[๋…ผ๋ฌธ์ •๋ฆฌ๐Ÿ“ƒ] Rethinking Model Scaling for Convolutional Neural Networks

Rethinking Model Scaling for Convolutional Neural Networks

- EfficientNet -

๋…ผ๋ฌธ์›๋ณธ๐Ÿ˜™


0. Abstract

CNN์€ ํ•œ์ •๋œ ์ž์›์—์„œ ๊ฐœ๋ฐœ๋˜์–ด์™”์œผ๋ฉฐ ๋” ๋งŽ์€ ์ž์›์ด ๊ฐ€๋Šฅํ•ด์ง€๋ฉด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ํฌ๊ธฐ๋ฅผ ํ‚ค์›Œ๋‚˜๊ฐ€๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐœ์ „๋˜์–ด์™”๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ๋Š” model scaling์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐ ํ•˜๋ฉฐ network์˜ depth, width, resolution ์‚ฌ์ด์˜ ๊ด€๊ณ„์— ๋Œ€ํ•œ ๊ท ํ˜•์„ ๋งž์ถฐ์•ผ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

depth, width, resolution์˜ ์ฐจ์›๋“ค์„ ๊ฐ„๋‹จํ•˜๋ฉด์„œ๋„ ๋†’์€ ํšจ์œจ์„ ๋ณด์ด๋Š” ์ƒˆ๋กœ์šด scaling๋ฐฉ๋ฒ•์ธ 'compound coefficient'๋ฅผ ์ œ์•ˆํ•˜๋ฉฐ, MobileNet๊ณผ ResNet์— ์ด ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•ด ํšจ์œจ์„ฑ์„ ํ…Œ์ŠคํŠธํ•œ๋‹ค.

๋” ๋‚˜์•„๊ฐ€, โ€˜Neural Architecture Search(NAS)โ€™๋ฅผ ์‚ฌ์šฉํ•ด baseline network๋ฅผ ์„ค๊ณ„ํ–ˆ์œผ๋ฉฐ ์ด baseline network๋ฅผ scale up ํ•ด ๊ฐ€์กฑ ๋ชจ๋ธ์ธ EfficientNet์„ ์„ค๊ณ„ํ•˜์˜€๋‹ค. (NAS๋Š” ๊ฐ•ํ™”ํ•™์Šต์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ ์˜ network๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์ธ๋ฐ ์ด๊ฒƒ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์„ค๋ช…์€ => ์—ฌ๊ธฐ)

ํŠนํžˆ, EfficientNet-B7์€ ImageNet dataset์— ๋Œ€ํ•ด 84.4%(top-1 acc)/97.1%(top-5 acc)๋ฅผ ์–ป์—ˆ์„ ์ •๋„๋กœ ๋งค์šฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ ์ด๋Š” convNet๋ณด๋‹ค 8.4๋ฐฐ ์ž‘์œผ๋ฉฐ 6.1๋ฐฐ ๋น ๋ฅธ ์„ฑ๋Šฅ์„ ๊ฐ€์ง„๋‹ค.

๋˜ํ•œ, CIFAR-100(91.7%), Flowers(98.8%) ์™€ ๋‹ค๋ฅธ 3๊ฐœ์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— ์ „์ดํ•™์Šต์„ ์‹œ์ผœ๋„ SOTA ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.


1. Introduction

Scaling up ๋ฐฉ๋ฒ•์€ ConvNet์˜ ์„ฑ๋Šฅํ–ฅ์ƒ์— ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ResNet์˜ ๊ฒฝ์šฐ์—๋„ ResNet-18์—์„œ ResNet-200์œผ๋กœ layer์ˆ˜๋ฅผ ๋Š˜๋ฆผ์œผ์จ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๊ณ , ์ตœ๊ทผ์—๋Š” GPipe๊ฐ€ baseline model์„ 4๋ฐฐ scaling up ํ•˜์—ฌ ImageNet์— ๋Œ€ํ•ด 84.3%(top-1 acc)์„ ์–ป์—ˆ๋‹ค. ํ•˜์ง€๋งŒ ConvNet์˜ ํšจ์œจ์ ์ธ scaling up์„ ํ•˜๋Š” ๊ณผ์ •์— ๋Œ€ํ•ด์„œ๋Š” ์—ฌ์ „ํžˆ ์ž˜ ์•Œ๋ ค์ง„ ๋ฐ”๊ฐ€ ์—†๋‹ค.


GPipe๐Ÿ”Ž

GPipe ๋ž€?

GPipe๋Š” Google Brain์—์„œ ๋ฐœํ‘œํ•œ ํ•™์Šต๊ธฐ๋ฒ•์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋งŽ์ด ์ฐจ์ง€ํ•˜๋Š” ํฐ ๋ชจ๋ธ์„ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๋Š”๋ฐ ์œ ์šฉํ•˜๋‹ค. Google์ด ๊ณต๊ฐœํ•œ ๋…ผ๋ฌธ์˜ ๋ฒค์น˜๋งˆํฌ์— ๋”ฐ๋ฅด๋ฉด ๊ธฐ์ค€๋ณด๋‹ค 8๋ฐฐ ๋งŽ์€ ์žฅ์น˜(TPU)๋กœ 25๋ฐฐ ํฐ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๊ณ , ๊ธฐ์ค€๋ณด๋‹ค 4๋ฐฐ ๋งŽ์€ ์žฅ์น˜์—์„œ 3.5๋ฐฐ ๋นจ๋ฆฌ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

Google์€ GPipe๋ฅผ ์ด์šฉํ•ด 5.6์–ต๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง€๋Š” AmoebaNet-B ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผฐ๋‹ค. ์ด ๋ชจ๋ธ์€ ImageNet์—์„œ 84.3%(top-1 acc)์„ ์–ป๊ณ  97%(top-5 acc)๋กœ SOTA๋ฅผ ๊ธฐ๋กํ–ˆ๋‹ค.

Gpipe๋Š” Pipeline Parallelism๊ณผ Checkpointing, ์ด ๋‘ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ฐ€๋Šฅํ•œ ํฐ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚จ๋‹ค.

- Pipeline Parallelism

GPipe๋Š” ๋ชจ๋ธ์„ ์—ฌ๋Ÿฌ ํŒŒํ‹ฐ์…˜์œผ๋กœ ๋‚˜๋ˆ  ๊ฐ๊ฐ ์„œ๋กœ ๋‹ค๋ฅธ ์žฅ์น˜์— ๋ฐฐ์น˜ํ•ด ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์—ฌ๋Ÿฌ ํŒŒํ‹ฐ์…˜์ด ์ตœ์žฌํ•œ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋„๋ก, ๋ชจ๋ธ์— ์ž…๋ ฅ๋˜๋Š” ๋ฏธ๋‹ˆ๋ฐฐ์น˜๋ฅผ ์—ฌ๋Ÿฌ ๋งˆ์ดํฌ๋กœ๋ฐฐ์น˜๋กœ ๋‚˜๋ˆ  ๋ชจ๋ธ์— ํ˜๋ ค๋ณด๋‚ธ๋‹ค.

- Checkpointing

๊ฐ ํŒŒํ‹ฐ์…˜์—” ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๋งŒ๋“ค์–ด ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์šฉ๋Ÿ‰์„ ๊ทน๋Œ€ํ™”ํ•œ๋‹ค. ์ˆœ์ „ํŒŒ(forward propagation)๋•Œ ํŒŒํ‹ฐ์…˜ ๊ฒฝ๊ณ„์˜ ์ž…์ถœ๋ ฅ๋งŒ ๊ธฐ์–ตํ•˜๊ณ  ๋‚ด๋ถ€์˜ hidden layer๋Š” ํœ˜๋ฐœ์‹œํ‚จ๋‹ค. ํœ˜๋ฐœ๋œ hidden layer์€ ์—ญ์ „ํŒŒ(back propagation) ๋•Œ ๋‹ค์‹œ ๊ณ„์‚ฐ๋œ๋‹ค.



๊ทธ ๋™์•ˆ์˜ ConvNet์˜ scaling ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ๋Š” depth, width, resolution ์ด ์…‹ ์ค‘ ํ•˜๋‚˜์˜ dimension๋งŒ์„ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉ๋˜์–ด์™”๋‹ค. ์ด ์ค‘ ๋‘ ๊ฐ€์ง€ ์ด์ƒ์„ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ๊ณ ๋ ค๋  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ฏธ์„ธํ•˜๊ฒŒ ์กฐ์ •ํ•ด์ค˜์•ผ ํ•˜๋Š” ์ž‘์—…๋“ค์ด ๋งŽ์ด ํ•„์š”ํ•˜๋ฉฐ ์ตœ์ ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ž˜ ๋‚˜ํƒ€๋‚ด์ง€ ๋ชปํ–ˆ๋‹ค.

๋”ฐ๋ผ์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๊ฐ„๋‹จํ•˜๋ฉด์„œ ํšจ์œจ์ ์ธ 'compound scaling method'๋ฅผ ์ œ์•ˆํ•˜๋ฉฐ ์ด ๋ฐฉ๋ฒ•์˜ ํ•ต์‹ฌ์€ network์˜ width, depth, resolution ์‚ฌ์ด์˜ ๊ท ํ˜•์„ ๋งž์ถ”๋Š” ๊ฒƒ์€ ์„ฑ๋Šฅํ–ฅ์ƒ์— ๋งค์šฐ ์ค‘์š”ํ•˜๋ฉฐ ์ด๋“ค๊ฐ„์˜ ๊ท ํ˜•์€ ๊ฐ„๋‹จํ•œ ์ƒ์ˆ˜์˜ ๋น„(constant ratio)๋กœ ๊ตฌํ•ด์งˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์šฐ๋ฆฌ๊ฐ€ $2^N$๋ฐฐ ํฐ ๋ชจ๋ธ์„ ๋””์ž์ธํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด baseline network์˜ depth๋ฅผ ๋‹จ์ˆœํžˆ $\alpha^N$, width๋ฅผ $\beta^N$, image size๋ฅผ $\gamma^N$ํ•ด์„œ ์ž‘์€ grid search๋ฅผ ํ†ตํ•ด ์œ„์˜ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” $\alpha, \beta, \gamma$๊ฐ’์„ ์ฐพ๊ฒŒ ๋œ๋‹ค.


์•„๋ž˜ ์ด๋ฏธ์ง€์™€ ๊ฐ™์ด ์ ์€ parameter์ˆ˜๋กœ ์—„์ฒญ๋‚œ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋‹ค. (significantly out-perform other convnets ๋ผ๊ณ  ์จ์ ธ์žˆ๋‹ค. ์••๋„์ ์ธ parameter ์ ์€ ์ˆ˜๋กœ ์—„์ฒญ๋‚œ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹คโ€ฆ๋ผ๊ณ  ๊ฐ•์กฐํ•จ)

แ„†แ…ฎแ„Œแ…ฆ

๊ทผ๋ฐ ์ง„์งœ.. ๋Œ€๋‹จํ•œ ์„ฑ๋Šฅ์ธ ๊ฒƒ ๊ฐ™๋‹ค.


๋‹ค์Œ์€ 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•์˜ scale up ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ทธ๋ฆผ์ด๋‹ค.

แ„†แ…ฎแ„Œแ…ฆ 2

(a)์˜ baseline network๋ฅผ ํ† ๋Œ€๋กœ (b)~(d) ๋Š” width, depth, resolution์„ scaling up ํ–ˆ์„ ๋•Œ์˜ ๊ตฌ์กฐ๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

๊ฒฐ๊ตญ ๋งˆ์ง€๋ง‰ (e)์˜ compound scaling ์„ ์ž˜ ํ•˜๋Š” ๊ฒƒ์ด ์ด ๋…ผ๋ฌธ์˜ ๋ชฉํ‘œ์ด๋‹ค.

๋…ผ๋ฌธ์—์„œ๋Š” MobileNet๊ณผ ResNet์„ ์ด์šฉํ•ด ์ด๋ฅผ ํ™•์ธํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, Model scaling์— ์˜ํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ baseline network์— ๋งค์šฐ ์˜์กด์ ์ด๊ธฐ ๋•Œ๋ฌธ์—, baseline network๋ฅผ ์„ค์ •ํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ neural architecture search(NAS)๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.


2. Compound Model Scaling

์ด๋ฒˆ ์žฅ์—์„œ๋Š” scaling problem์— ๋Œ€ํ•œ ๋‹ค๋ฅธ ์ ‘๊ทผ๋ฒ•๋“ค์„ ์‚ดํŽด๋ณด๊ณ  ์ƒˆ๋กœ์šด scaling method๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

2.1. Problem Formulation

ํ•˜๋‚˜์˜ ConvNet Layer $i$๋Š” $Y_i = F_i(X_i)$๋กœ ์ •์˜ ๋œ๋‹ค.

  • $F_i$๋Š” ์—ฐ์‚ฐ์ž, $Y_i$๋Š” output tensor, $X_i$๋Š” input tensor์„ ์˜๋ฏธ

  • $X_i$์˜ ํฌ๊ธฐ๋Š” $<H_i, W_i, C_i>$์ด๋ฉฐ, ๊ฐ๊ฐ $H_i, W_i$ ๋Š” ๊ณต๊ฐ„์  ์ฐจ์› $C_i$๋Š” channel ์ฐจ์›์„ ์˜๋ฏธํ•œ๋‹ค.

ํ•˜๋‚˜์˜ convNet$N$์€ $N = F_k\bigodotโ€ฆ \bigodot F_2 \bigodot F_1(X_1)$ ๋กœ ํ‘œ์‹œํ•œ๋‹ค.

ํ•˜์ง€๋งŒ ์‹ค์ œ๋กœ ConvNet layers ๋Š” ์—ฌ๋Ÿฌ stage๋กœ ๋‚˜๋ˆ ์ง€๋ฉฐ ๊ฐ๊ฐ์˜ stage๋Š” ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ๊ณต์œ ํ•œ๋‹ค. (์˜ˆ๋ฅผ ๋“ค์–ด ResNet์˜ ๊ฒฝ์šฐ 5๊ฐœ์˜ stage๊ฐ€ ์žˆ๊ณ  down-sampling์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋งจ ์ฒ˜์Œ ๋ ˆ์ด์–ด๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š” ๊ฐ stage์˜ ๋ชจ๋“  ๋ ˆ์ด์–ด๋“ค์€ ๊ฐ™์€ convolutional ํƒ€์ž…์„ ๊ฐ€์ง„๋‹ค.)

๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ConvNet์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

$N = \bigodot\limits_{i=1โ€ฆs} F_i^{L_i}(X_{<H_i, W_i, C_i>})$

$F_i^{L_i}$ ๋Š” $F_i$ ๋ ˆ์ด์–ด๊ฐ€ $i$ stage์—์„œ $L_i$๋ฒˆ ๋ฐ˜๋ณต, $<H_i, W_i, C_i>$ ๋Š” ๋ ˆ์ด์–ด $i$์˜ input tensor X ๊ฐ’์„ ๋‚˜ํƒ€๋‚ธ๋‹ค.

์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” ์ตœ์ ํ™” ๋ฌธ์ œ๋กœ ๊ท€๊ฒฐ๋˜๋Š” ์–ด๋–ค ์ œํ•œ๋œ ์ž์›์ด ์ฃผ์–ด์ ธ๋„ ๋ชจ๋ธ ์ •ํ™•๋„๋ฅผ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด์ œ ๋…ผ๋ฌธ์ด ์–ป๊ณ ์ž ํ•˜๋Š” ์ตœ์ข… ๋ชฉํ‘œ๋ฅผ ๊ฐ„๋‹จํ•œ ์‹์œผ๋กœ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

$\max\limits_{d,w,r}\,\,\,\, Accuracy(N(d,w,r))$

image

$Memory(N) \leq target \, memory$

$FLOPS(N) \leq target\,flops$

$w,d,r$ ์€ network ์˜ width, depth, resolution์„ scaling ํ•˜๊ธฐ ์œ„ํ•œ ์ƒ์ˆ˜๊ฐ’์ด๋ฉฐ, $\hat{F}_i,\hat{L}_i,\hat{H}_i,\hat{W}_i,\hat{C}_i$ ๋Š” baseline network์— ๋ฏธ๋ฆฌ ์ •ํ•ด์ ธ ์žˆ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์ด๋‹ค.

์ฆ‰ ,

๋ณ€๋™๋˜๋Š” ์ƒ์ˆ˜๊ฐ’ : $w,d,r$

๊ณ ์ •๊ฐ’ : $\hat{F}_i,\hat{L}_i,\hat{H}_i,\hat{W}_i,\hat{C}_i$


2.2. Scaling Dimension

์ค‘์š”ํ•œ ๋ฌธ์ œ๋Š”, ์ตœ์ ์˜ $d,w,r$ coefficient ๋“ค์€ ์„œ๋กœ ์—ฐ๊ด€๋˜์–ด์žˆ๋‹ค๋Š” ๊ฒƒ๊ณผ ์„œ๋กœ ๋‹ค๋ฅธ ์ œํ•œ์  ์ž์›์— ๋†“์—ฌ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋„๋ฆฌ ์‚ฌ์šฉ๋œ ConvNet๋“ค์€ ๋‹ค์Œ์˜ dimension ์ค‘ ํ•˜๋‚˜๋งŒ ์„ ํƒํ•ด scaling ํ•ด์™”๋‹ค.

  1. Depth ($d$) -> ๊นŠ์€ ๋ ˆ์ด์–ด

  2. Width ($w$) -> channel์˜ ์ˆ˜

  3. Resolution ($r$) -> input image size

์•„๋ž˜ ๊ทธ๋ž˜ํ”„๋Š” baseline model์€ width, depth, resolution coefficients ์— ๋”ฐ๋ผ scaling up ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

แ„†แ…ฎแ„Œแ…ฆ 3

๊ฐ๊ฐ ์„ฑ๋Šฅ์ด ์ข‹์•„์ง์„ ์•Œ ์ˆ˜ ์žˆ์ง€๋งŒ acc๊ฐ€ ์•ฝ 80%๊ฐ€ ๋˜๋Š” ์‹œ์ ์—์„œ ๊ธ‰ํ•˜๊ฒŒ saturate ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋„คํŠธ์›Œํฌ์˜ depth, width, resolution ์˜ ์ฐจ์› ์ค‘ ํ•œ๊ฐ€์ง€ ๋งŒ์„ scaling up ํ•˜๋Š” ๊ฒƒ์€ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค์ง€๋งŒ ๋” ํฐ ๋ชจ๋ธ์— ์žˆ์–ด์„œ๋Š” ์ •ํ™•๋„๊ฐ€ ์ค„์–ด๋“ ๋‹ค.


2.3 Compound Scaling

์ง๊ด€์ ์œผ๋กœ ์ƒ๊ฐํ•ด๋ณด๋ฉด ๊ฐ ์š”์†Œ๋“ค์€ ์˜์กด์ ์ด๋‹ค. ์ƒ๊ฐํ•ด ๋ณด์ž, input image(resolution)๊ฐ€ ์ปค์ง„๋‹ค๋ฉด network๊ฐ€ ๋” ๋„“์€ ์˜์—ญ์„ ์ˆ˜์šฉํ•  ์ˆ˜ ์žˆ๋Š” receptive field๋ฅผ ํ™•๋ณด(depth)ํ•ด์•ผ ํ•˜๋ฉฐ, ๋”์šฑ ๋งŽ์€ channel(width)์„ ํ†ตํ•ด ์ •์ œ๋œ pattern์„ ์ถ”์ถœํ•ด์•ผ ํ•  ๊ฒƒ์ด๋‹ค.

์•„๋ž˜ ๊ทธ๋ž˜ํ”„๋Š” depth ์™€ resolution ํฌ๊ธฐ๋ฅผ ๊ณ ์ •ํ•œ ์ฑ„๋กœ width ๊ฐ’์„ ๋ณ€ํ™”์‹œํ‚ค๋ฉด์„œ ํ…Œ์ŠคํŠธํ•œ ๊ฒฐ๊ณผ์ด๋‹ค.

๋™์ผํ•œ FLOPS์—์„œ width/depth/resolution ์กฐํ•ฉ์„ ์ฐพ์•„๋‚ด์•ผ ํ•œ๋‹ค.

แ„†แ…ฎแ„Œแ…ฆ 4

์ด ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด ConvNet scaling์„ ํ•˜๋Š” ๋™์•ˆ ๋” ๋‚˜์€ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์„ ์ถ”๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” network์˜ ๋ชจ๋“  dimensions์˜ ๊ท ํ˜•์„ ์žก๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ์‹์˜ compound scaling method๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. compound coefficient ์ธ $\phi$ ๋กœ network์˜ width, depth, resolution์„ scaleํ•œ๋‹ค.

compound scaling ๋ฐฉ๋ฒ•์— ์‚ฌ์šฉ๋˜๋Š” notation

depth: $d\,=\,\alpha^\phi$

width: $w\,=\,\beta^\phi$

resolution: $r\,=\,\gamma^\phi$

$s.t. \,\,\,\, \alpha\cdot \beta^2 \cdot\gamma^2\approx 2$

$\alpha \geq 1, \beta \geq 1, \gamma \geq 1$

๊ฐ $\alpha, \beta, \gamma$๋Š” small grid search ์— ์˜ํ•ด ์ •ํ•ด์งˆ ๋ณ€์ˆ˜๋“ค์ด๋ฉฐ $\phi$๋Š” ์–ผ๋งˆ๋‚˜ ๋งŽ์€ resource๋ฅผ ์‚ฌ์šฉํ• ์ง€์— ๋Œ€ํ•ด ์‚ฌ์šฉ์ž๊ฐ€ ์ •ํ•  coefficient ์ด๋‹ค.

Convolution operation์˜ FLOPS๋Š” $d, w^2, r^2$ ๊ฐ๊ฐ์— ๋Œ€ํ•ด ๋น„๋ก€ํ•ด ์ฆ๊ฐํ•˜๋Š” ์„ฑ์งˆ์„ ๊ฐ–๊ณ  ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ width์™€ resolution์— ์ œ๊ณฑ์ด ๋“ค์–ด๊ฐ„ ์ด์œ ๋Š” depth๋Š” 2๋ฐฐ ํ‚ค์›Œ์ฃผ๋ฉด FLOPS๋„ ๋น„๋ก€ํ•ด์„œ 2๋ฐฐ ์ฆ๊ฐ€ํ•˜์ง€๋งŒ width ์™€ resolution์€ ๊ฐ€๋กœ ์„ธ๋กœ๊ฐ€ ๊ฐ๊ฐ ๊ณฑํ•ด์ ธ ์ œ๊ณฑ ๋ฐฐ ์ฆ๊ฐ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

์œ„์˜ ์‹์—์„œ $\alpha\cdot \beta^2 \cdot\gamma^2 \approx 2$ ์—์„œ๋„ ์•Œ ์ˆ˜ ์žˆ๋“ฏ ๊ฐ’์„ 2๋กœ ์ œํ•œ์‹œ์ผฐ์œผ๋ฏ€๋กœ ์ด FLOPS๋Š” ๋Œ€๋žต $2^\phi$์— ๋น„๋ก€ํ•ด ์ฆ๊ฐํ•œ๋‹ค.


grid search๐Ÿ”Ž

grid search ๋ž€?

Grid search(๊ฒฉ์ž ํƒ์ƒ‰)์€ ๋ชจ๋ธ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์— ๋„ฃ์„ ์ˆ˜ ์žˆ๋Š” ๊ฐ’๋“ค์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ž…๋ ฅํ•œ๋’ค์— ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์ฐพ๋Š” ํƒ์ƒ‰ ๋ฐฉ๋ฒ•์ด๋‹ค.

์ฆ‰, ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•์ด ์žˆ๋Š”๋ฐ ์ด ์ค‘ ์–ด๋–ค ํŠน์ • ๋ฐฉ๋ฒ•์ด ์ด ๋ชจ๋ธ์— ์ ํ•ฉํ•œ์ง€ ํŒ๋‹จํ•œ๋‹ค.

  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ(hyper parameter, ์ดˆ๋งค๊ฐœ๋ณ€์ˆ˜) ๋ชจ๋ธ ์ƒ์„ฑ์‹œ ์‚ฌ์šฉ์ž๊ฐ€ ์ง์ ‘ ์„ค์ •ํ•˜๋Š” ๋ณ€์ˆ˜๋กœ, ๋งŒ์•ฝ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ์„ ๋งŒ๋“ ๋‹ค๊ณ  ํ•˜๋ฉด ํŠธ๋ฆฌ์˜ ๊ฐœ์ˆ˜๋ฅผ ๋ช‡๊ฐœ๊นŒ์ง€ ํ•  ๊ฒƒ์ธ์ง€, ํŠธ๋ฆฌ์˜ ๊นŠ์ด, ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์—์„œ๋Š” layer์˜ ๊ฐฏ์ˆ˜, ํ•™์ŠตํšŸ์ˆ˜ ๋“ฑ์ด ์ด์— ํ•ด๋‹นํ•œ๋‹ค. ๋ฐ˜๋ฉด, ํŒŒ๋ผ๋ฏธํ„ฐ(parameter)๋Š” ํ•™์Šต ๊ณผ์ •์—์„œ ์ƒ์„ฑ๋˜๋Š” ๋ณ€์ˆ˜์ด๋‹ค.



3. EfficientNet ๊ตฌ์กฐ

์œ„์˜ ์‹คํ—˜๋“ค์„ ํ†ตํ•ด 3๊ฐ€์ง€ scaling factor๋ฅผ ๋™์‹œ์— ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆํ•˜์˜€๋‹ค.

์ด์ œ, ์ตœ์ ์˜ ๋น„์œจ์„ ์ฐพ์•„ ์‹ค์ œ ๋ชจ๋ธ์— ์ ์šฉํ•ด ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค๊ณผ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜๋Š” ๊ณผ์ •์„ ์„ค๋ช…ํ•˜๊ฒ ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋ชจ๋ธ(F)๋ฅผ ๊ณ ์ •ํ•˜๊ณ  depth, width, resolution 3๊ฐ€์ง€๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๋Š”๋ฐ ๊ณ ์ •ํ•˜๋Š” ๋ชจ๋ธ (F)๋ฅผ ์ข‹์€ ๋ชจ๋ธ๋กœ ์„ ์ •ํ•˜๋Š” ๊ฒƒ์ด ์•„์ฃผ ์ค‘์š”ํ•˜๋‹ค. ์•„๋ฌด๋ฆฌ scaling factor์„ ์กฐ์ ˆํ•ด๋„ ์ดˆ๊ธฐ ๋ชจ๋ธ ์ž์ฒด์˜ ์„ฑ๋Šฅ์ด ๋‚ฎ๋‹ค๋ฉด ์ž„๊ณ„ ์„ฑ๋Šฅ๋„ ๋‚ฎ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” MnasNet๊ณผ ๊ฑฐ์˜ ๋™์ผํ•œ search spaceํ•˜์—์„œ AutoML์„ ํ†ตํ•ด ๋ชจ๋ธ์„ ํƒ์ƒ‰ํ•˜์˜€๊ณ , ์ด ๊ณผ์ •์„ ํ†ตํ•ด ์ฐพ์€ ์ž‘์€ ๋ชจ๋ธ์„ EfficientNet-B0 ๋ผ๊ณ  ํ•œ๋‹ค.

แ„†แ…ฎแ„Œแ…ฆ 5

๋ชจ๋ธ ๊ตฌ์กฐ๋Š” MnasNet๊ณผ ๊ฑฐ์˜ ์œ ์‚ฌํ•˜๋ฉฐ ์œ„์˜ ํ‘œ์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.

EfficientNet์˜ $\alpha, \beta, \gamma$ ๊ฐ’์€ ๊ฐ„๋‹จํ•œ grid search๋กœ ๊ตฌํ•ด์ง€๋ฉฐ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š”

$\alpha = 1.2$

$\beta = 1.1$

$\gamma = 1.15$

๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ์œผ๋ฉฐ ์ด ์„ธ ๊ฐ’๋“ค์€ ๊ณ ์ •ํ•œ ๋’ค $\phi$ ๊ฐ’์„ ํ‚ค์šฐ๋ฉฐ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋ฅผ ํ‚ค์šฐ๊ณ  ์žˆ๋‹ค.


4. Experiments

๊ธฐ์กด ์‚ฌ๋žŒ์ด ๋””์ž์ธํ•œ ConvNet, AutoML์„ ํ†ตํ•ด ์ฐพ์€ ConvNet๋“ค๊ณผ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜ ํ‘œ์— ๋‚˜์™€์žˆ๋‹ค.

แ„†แ…ฎแ„Œแ…ฆ 6

๊ธฐ์กด ConvNet๋“ค์— ๋น„ํ•ด ๋น„์Šทํ•œ ์ •ํ™•๋„๋ฅผ ๋ณด์ด๋ฉฐ parameter ์ˆ˜์™€ FLOPS ์ˆ˜๋ฅผ ๋งŽ์ด ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋˜, ๊ธฐ์กด์— ImageNet ๋ฐ์ดํ„ฐ์…‹ ์—์„œ ๊ฐ€์žฅ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋˜ GPipe๋ณด๋‹ค ๋” ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ทธ ์™ธ ๋‹ค์–‘ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋“ค

image

์œ„์˜ ์ด๋ฏธ์ง€๋Š” ๋ชจ๋ธ์ด ์ด๋ฏธ์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•  ๋•Œ ์ด๋ฏธ์ง€์˜ ์–ด๋Š ์˜์—ญ์— ์ง‘์ค‘ํ–ˆ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” Class Activation Map (CAM) ์„ ๋ฝ‘์€ ๊ฒฐ๊ณผ์ธ๋ฐ, 3๊ฐœ์˜ scaling factor์„ ๊ฐ๊ฐ ๊ณ ๋ คํ•  ๋•Œ ๋ณด๋‹ค ๋™์‹œํ•ด ๊ณ ๋ คํ•˜์˜€์„ ๋•Œ ๋” ์ •๊ตํ•œ CAM์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

แ„†แ…ฎแ„Œแ…ฆ 8

์œ„์˜ ํ‘œ๋Š” Fig.7 ์—์„œ ํ™œ์šฉ๋œ ์‹คํ—˜ network depth, width, resolution ์กฐ๊ฑด๋ณ„ FLOPS์™€ Top-1 accuracy๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ‘œ์ด๋‹ค. compound scaling์„ ์ ์šฉํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋น„์Šทํ•œ FLOPS์ž„์—๋„ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.



์ด๋ฒˆ ๋…ผ๋ฌธ์€ ๋ฆฌ๋ทฐ ๋Ÿ‰์ด ์—„์ฒญ ๋งŽ์•˜๋‹ค.. ๊ฐ„๋‹จํ•œ NasNet ์— ๋น„ํ•ด์„œ ๋ฐฐ๋กœ ๊ฑธ๋ฆฐ๊ฒƒ ๊ฐ™๋‹ค. ํ•˜์ง€๋งŒ ๊ทธ๋งŒํผ ๋ฐฐ์šธ ์ ์ด ๋งŽ์•˜๊ณ  ํŠนํžˆ NAS๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ ์šฉํ•ด ํ•œ๋‹จ๊ณ„ ์—…๊ทธ๋ ˆ์ด๋“œ ํ•œ ๊ฒฐ๊ณผ๋ฌผ์„ ๋„์ถœํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๋งŽ์ด ๋†€๋ผ์› ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์„ฑ๋Šฅ๋ฉด์—์„œ๋„ ๋งค์šฐ ๋†€๋ž๋‹ค.. ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์€ ํ›จ์”ฌ ์ ์€๋ฐ ํ›จ์”ฌ ์ข‹์€ ์„ฑ๋Šฅ์ด๋ผ๋‹ˆ.. nas frame์„ ์ž˜ ์ ์šฉํ•˜๋ฉด ์–ด๋งˆ์–ด๋งˆํ•œ ์„ฑ๋Šฅ์ด ๋‚˜์˜จ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๐Ÿ™Œ๐Ÿ™Œ๐Ÿ™Œ


์ฐธ๊ณ 

[1] https://bellzero.tistory.com/17

[2] https://norman3.github.io/papers/docs/efficient_net.html

[3] https://hoya012.github.io/blog/EfficientNet-review/

ํƒœ๊ทธ: ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ:

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ