[๋…ผ๋ฌธ์ •๋ฆฌ๐Ÿ“ƒ] EfficientDet: Scalable and Efficient Object Detection

EfficientDet: Scalable and Efficient Object Detection

- EfficientDet -

๋…ผ๋ฌธ์›๋ณธ๐Ÿ˜™

Detection ์ฝ”๋“œ๋Š” ๋งŽ์ด ์ฝ์–ด๋ณด์•˜์ง€๋งŒ ํฌ์ŠคํŒ…์€ ์ฒ˜์Œ์ธ ๊ฒƒ ๊ฐ™๋‹ค. ์•„๋ฌด๋ž˜๋„ classification์— ๋น„ํ•ด ์–ด๋ ค์›Œ์„œ ๋ฏธ๋ฃจ๋‹ค ๋ณด๋‹ˆ๊นŒ detection ์ชฝ์€ ํ”ผํ•˜๊ฒŒ ๋œ๋‹ค. ํ•˜์ง€๋งŒ ๊ต์ˆ˜๋‹˜์ด ํ”ผํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ์•ˆ๋˜๊ณ  ๋๊นŒ์ง€ ๋งˆ๋ฌด์œผ๋ฆฌ๋ฅผ ์ง€์–ด์•ผ ํ•œ๋‹ค๊ณ  ํ•˜์…”์„œ ๋งˆ๋ฌด๋ฆฌ! ๋ฅผ ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค.

EfficientDet์€ ์ด๋ฆ„์—์„œ๋ถ€ํ„ฐ ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด EfficientNet์„ ๊ธฐ๋ฐ˜์œผ๋กœ detection์„ ํ•œ ๋ชจ๋ธ์ด๋‹ค. ์ด ๋…ผ๋ฌธ์€ EfficientNet์„ ๋จผ์ € ์ฝ์œผ๋ฉด ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šฐ๋‹ˆ ํ•œ ๋ฒˆ ์ฝ์–ด๋ณด๊ณ  ์˜ค๋Š” ๊ฒƒ์„ ์ถ”์ฒœํ•œ๋‹ค


์ด ๋…ผ๋ฌธ์—์„œ ์•Œ์•„์•ผ ํ•  ํ•ต์‹ฌ ๋‚ด์šฉ 2๊ฐ€์ง€๋Š” BiFPN๊ณผ compound scaling method ์ž…๋‹ˆ๋‹ค.

1) BiFPN

๋…ผ๋ฌธ์„ ๋ณด๋ฉด, bi-directional feature pyramid network, allows easy and fast multi-scale feature fusion ๋ผ๊ณ  ์ ํ˜€์žˆ๋Š”๋ฐ, ๋’ค์—์„œ๋„ ์ž์„ธํžˆ ๋‹ค๋ฃจ์ง€๋งŒ Feature Pyramid Network๋ฅผ ๋” ํšจ์œจ์ ์ด๊ฒŒ ๊ฐœ์กฐํ•œ ๋ฐฉ์‹์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

2) Compound scaling method

EfficientNet์„ Backbone network๋กœ ์‚ฌ์šฉํ•˜๋‹ค๋ณด๋‹ˆ ์ด์— ๋งž์ถ”๊ธฐ ์œ„ํ•ด compound scaling method๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.


์•„๋ž˜ ์ด๋ฏธ์ง€๋ฅผ ๋ณด๋ฉด, ์—ฐ์‚ฐ๋Ÿ‰์€ ๊ธฐ์กด SOTA ๋ชจ๋ธ๊ณผ ๋น„๊ตํ–ˆ์„๋•Œ ํ˜„์ €ํžˆ ์ค„์—ˆ์ง€๋งŒ, ์„ฑ๋Šฅ์€ ์›”๋“ฑํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. EfficientNet๊ณผ ๋น„์Šทํ•œ ์–‘์ƒ์„ ๋ณด์ด๋„ค์š”!

image

EfficientDet architecture

์ „์ฒด์ ์ธ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. backbone์œผ๋กœ๋ถ€ํ„ฐ ๊ฐ๊ฐ $1 / 2^i$ ๋งŒํผ์”ฉ scale ํ•œ feature๋“ค์„ pyramid ์ฒ˜๋Ÿผ ์ญ‰ ๋‚˜์—ดํ•˜์—ฌ์„œ 3๋ฒˆ์งธ ๋ ˆ์ด์–ด๋ถ€ํ„ฐ 7๋ฒˆ์งธ ๋ ˆ์ด์–ด๋“ค์˜ ํ”ผ์ณ๋“ค๋กœ ๊ณ„์‚ฐ์„ ํ•ด์ค๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์—๋Š” ๋‚˜์™€์žˆ์ง€ ์•Š์ง€๋งŒ 3๋ฒˆ์งธ feature ๋ถ€ํ„ฐ ์—ฐ์‚ฐ์— ๊ณ ๋ คํ•˜๋Š” ์ด์œ ๋Š” ์•„๋งˆ ์ด ๋•Œ๋ถ€ํ„ฐ ์œ ์˜๋ฏธํ•œ ํŠน์ง•์ ๋“ค์„ ๊ฐ€์ง„ feature๊ฐ€ ์ƒ์„ฑ๋˜๊ธฐ ๋•Œ๋ฌธ์— 3๋ฒˆ์งธ๋ถ€ํ„ฐ ํ•˜์ง€ ์•Š์•˜๋‚˜ ์‹ถ๋„ค์š”..!

๋˜, 5๊ฐœ์˜ feature ๋“ค์„ BiFPN ์—ฐ์‚ฐ์„ n ๋ฒˆ ๋ฐ˜๋ณตํ•˜๊ฒŒ ํ•˜์—ฌ ๋งˆ์ง€๋ง‰ feature๋“ค์„ concatenate ํ•˜์—ฌ ๊ฐ๊ฐ class prediction๊ณผ box prediction network๋กœ ๊ณ„์‚ฐํ•ด์ค๋‹ˆ๋‹ค.

image


FPN development process

FPN์„ ์ฒ˜์Œ ๋“ค์–ด๋ณด์•˜๋‹ค๋ฉด ์ด ๊ฐœ๋…์ด ๋ฌด์—‡์ธ๊ฐ€ ๋‹นํ™ฉ์Šค๋Ÿฌ์› ์„ ํ…๋ฐ! ๊ทธ๊ฒŒ ๋ฐ”๋กœ ์ ‘๋‹ˆ๋‹ค ใ…Žใ…Ž

image

์œ„์˜ ์ด๋ฏธ์ง€๋Š” FPN์˜ ๋ฐœ์ „ ์–‘์ƒ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋Š”๋ฐ์š”, (a)์˜ FPN์€ ๊ฐ€์žฅ ์˜ค๋ฆฌ์ง€๋„ ํ˜•ํƒœ๋กœ ๋‹ค์–‘ํ•œ scale์˜ ์ด๋ฏธ์ง€๋“ค์„ fusion ์‹œํ‚ค๋ฉด ์„ฑ๋Šฅ์ด ๋” ์ข‹๊ฒŒ ๋‚˜์˜ฌ๊ฒƒ์ด๋‹ค! ๋ผ๋Š” ์ง๊ด€์—์„œ ๋‚˜์˜จ ๊ฐœ๋…์ž…๋‹ˆ๋‹ค. ๋ณด๋ฉด TOP-DOWN ๋ฐฉ์‹์œผ๋กœ ๊ฐ feature๋“ค์„ upsampling(์ด๋ฏธ์ง€ ์Šค์ผ€์ผ์„ ํฌ๊ฒŒํ•˜์—ฌ ํ•ฉ์น˜๋Š”๊ฒƒ! ๋”ํ•ด์ค€๋‹ค) ํ•˜์—ฌ ํ•ฉ์น˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•ด์„œ ์„ฑ๋Šฅ์„ ๋†’์˜€์ง€๋งŒ, ์ •๋ณด๊ฐ€ ํ•œ์ชฝ์œผ๋กœ ํ๋ฅด๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•˜๋ฉฐ ๋‚˜์˜จ๊ฒƒ์ด (b)์˜ PANet์ž…๋‹ˆ๋‹ค. TOP-DOWN ํ›„ BOTTOM-UP์„ ํ•œ ๋ฒˆ ๋” ์ถ”๊ฐ€ํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ๊ทธ๋ฆฌ๊ณ  (c)๋Š” NAS์˜ ๊ฐœ๋…์„ FPN์— ๋„์ž…ํ•œ ๊ฒƒ์ธ๋ฐ, ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ์ตœ์ ์˜ architecture์„ ์ฐพ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. NASNET ๋…ผ๋ฌธ๋ฆฌ๋ทฐ ๋ฅผ ์ฐธ๊ณ ํ•˜๋ฉด ์–ด๋–ค ๋ฐฉ์‹์ธ์ง€ ์•Œ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ด๋ ‡๊ฒŒ best architecture์„ ์ฐพ์œผ๋ ค๋‹ค๋ณด๋‹ˆ GPU ์—ฐ์‚ฐ์ด ๋Š๋ ค์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ฒผ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์˜คํžˆ๋ ค PANet์˜ ์„ฑ๋Šฅ์ด ๋” ์ข‹๋‹ค๊ณ  ํ•œ๋‹ค.

imageimage

๊ทธ๋ฆฌ๊ณ ! ์—ฌ๊ธฐ์„œ ์ƒˆ๋กœ๋‚˜์˜จ BiFPN์€ PANet์—์„œ ๊ฐœ์„ ์‹œํ‚จ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์˜ PANet์—์„œ input node๊ฐ€ 1๊ฐœ์ธ ๋…ธ๋ž€์ƒ‰์œผ๋กœ ํ•˜์ด๋ผ์ดํŠธ๋œ ๋…ธ๋“œ๋ฅผ ์ œ์™ธ์‹œํ‚จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๊ธฐ์—ฌ๋„๊ฐ€ ์ ์€ feature์€ ์—ฐ์‚ฐ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ ์ œ๊ฑฐํ•˜์˜€๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ , input node๋ฅผ ๋‹ค์‹œํ•œ๋ฒˆ ๋งˆ์ง€๋ง‰ ๋…ธ๋“œ์ธ ouput ๋…ธ๋“œ์— ๋”ํ•ด์ค๋‹ˆ๋‹ค. ์ด๋Š” ๋” ๋งŽ์€ feature์„ fusion ์‹œ์ผœ ์„ฑ๋Šฅ์„ ์ข‹๊ฒŒ ํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ๋ธ”๋Ÿญ์„ n๋ฒˆ ๋ฐ˜๋ณต! ๋…ผ๋ฌธ์—์„  3๋ฒˆ ๋ฐ˜๋ณตํ•œ๋‹ค๊ณ  ํ•จ


BiFPN accuracy

image

์œ„์˜ ์ด๋ฏธ์ง€๋Š” FNP๋“ค์˜ ์„ฑ๋Šฅ ๋น„๊ต์ž…๋‹ˆ๋‹ค. BiFPN์ด ์—ฐ์‚ฐ๋Ÿ‰๋„ ์ž‘์€๋ฐ ์„ฑ๋Šฅ์€ ๊ฐ€์žฅ ์ข‹๊ณ  ๋˜, weight๋ฅผ ์ถ”๊ฐ€ํ•˜๋ฉด ๋” ์ข‹์•„์ง„๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.


BiFPN : Weighted feature fusion

feature fusion์„ ์‹œํ‚ฌ ๋•Œ weight๋ฅผ ์ฃผ๋ฉด ์„ฑ๋Šฅ์ด ๋” ์ข‹๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ์œ„์˜ ํ‘œ์—์„œ ํ™•์ธ์„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋ฐ”๋กœ Fast normalized fusion์„ ์‚ฌ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.

image

๋จผ์ € unbounded fusion์€ ๋‹จ์ˆœํžˆ ์Šค์นผ๋ผ ๊ฐ’์„ input feature์— ๊ณฑํ•ด์ฃผ๋Š” ๋ฐฉ์‹์ธ๋ฐ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ์‹์ด์ฃ ! ํ•˜์ง€๋งŒ ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ์Šค์นผ๋ผ ๊ฐ’์ด ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์œผ๋ฉด ์ „์ฒด์ ์œผ๋กœ ๋ถˆ๊ท ํ˜•ํ•œ ๊ฐ’์ด ๋‚˜์™€๋ฒ„๋ ค์„œ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  softmax fusion์€ ์–ด๋Š์ •๋„ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์ง€๋งŒ ์—ฐ์‚ฐํ•˜๋Š”๋ฐ ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ ค์„œ (์•„๋งˆ ์ž์—ฐ์ƒ์ˆ˜๋กœ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์ธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.) ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ ๋น ๋ฅด๊ธฐ ๋ฉด์—์„œ ๋งŽ์ด ํ–ฅ์ƒ๋œ Fast normalized fusion์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Fastnormalized์˜ ๋ฐฉ์‹์—์„œ weight๋Š” Relu ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์ณ๋‚˜์˜จ ๊ฐ’์œผ๋กœ 0~1์˜ ๊ฐ’์„ ๋ณด์žฅํ•˜๋ฉด์„œ ๋ถ„๋ชจ๊ฐ€ 0์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š” ๊ฒƒ์„ ๋ง‰๊ธฐ ์œ„ํ•ด 0.0001์˜ ์•„์ฃผ ์ž‘์€ ๊ฐ’์„ ์ถ”๊ฐ€๋กœ ๋”ํ•ด์ค๋‹ˆ๋‹ค(์ˆ˜์‹์—์„œ์˜ ์ž…์‹ค๋ก ).

์•„๋ž˜ ์ฝ”๋“œ๋กœ ์‚ดํŽด๋ณด๋ฉด ๋” ์™€๋‹ฟ์Šต๋‹ˆ๋‹ค.

image


Comparision of different feature fusion

image


์—ฐ์‚ฐ๊ณผ์ • ๋„์‹ํ™”

image


Compound scaling

scaling ํ•˜๋Š” ๋ฐฉ์‹์€ ์•„๋ž˜์˜ ์ˆ˜์‹๋Œ€๋กœ ์ฒ˜๋ฆฌํ•ด์ค๋‹ˆ๋‹ค. ์ด ์Šค์ผ€์ผ๋ง ๋ฐฉ์‹์€ EfficientNet์— ๋งž์ถฐ์ฃผ๊ธฐ ์œ„ํ•ด์„œ ๊ณ ๋ ค๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

image

image

Comparision of different scaling

image


SOTA for COCO Dataset

image


์˜ค๋Š˜ ์„ธ๋ฏธ๋‚˜๋„ ๋ฌด์‚ฌํžˆ ์™„๋ฃŒ!๐Ÿฅฐ๐Ÿฅฐ

ํƒœ๊ทธ: ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ:

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ