[๋…ผ๋ฌธ์ •๋ฆฌ๐Ÿ“ƒ] Going Deeper with Convolutions

Going deeper with convolutions

-GoogLeNet(Inception-v1)-

๋…ผ๋ฌธ์›๋ณธ๐Ÿ˜™

GoogLeNet ์ฝ”๋“œ๊ตฌํ˜„ ํŽ˜์ด์ง€. => GoogLeNet

0. ์š”์•ฝ

  • โ€œinceptionโ€ ์ด๋ผ๋Š” deep convolution neural network๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด inception์˜ ํŠน์ง•์€ ๋„คํŠธ์›Œํฌ ์•ˆ์—์„œ ์ปดํ“จํ„ฐ ๋ฆฌ์†Œ์Šค์˜ ํ™œ์šฉ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ์ด๊ฒƒ์œผ๋กœ ๊ณ„์‚ฐ ๋น„์šฉ์€ ์œ ์ง€ํ•˜๋ฉด์„œ ๋„คํŠธ์›Œํฌ์˜ ๊นŠ์ด์™€ ๋„“์ด๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค.
  • ILSVRC 2014์— ์ œ์ถœ ๋œ ๋ชจ๋ธ์€ 22-layer๋กœ ์ด๋ฃจ์–ด์ง„ GoogLeNet์œผ๋กœ, classification๊ณผ detection ๋ถ„์•ผ์—์„œ ๊ทธ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ๋‹ค.

1. ๊ฐœ์š”

  • ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹๊ณผ convolutional networks์˜ ๋ฐœ์ „์œผ๋กœ ์ด๋ฏธ์ง€ ์ธ์‹๊ณผ ๋ฌผ์ฒด์ธ์‹์˜ ์„ฑ๋Šฅ์ด ์—„์ฒญ๋‚˜๊ฒŒ ํ–ฅ์ƒ๋˜์—ˆ์œผ๋ฉฐ ์ด๊ฒƒ์€ (1)์ƒˆ๋กœ์šด ์•„์ด๋””์–ด๋‚˜ (2)์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜, (3)๊ฐœ์„ ๋œ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ๋กœ๋ถ€ํ„ฐ ์–ป์–ด์ง„ ๊ฒฐ๊ณผ๋“ค์ด๋‹ค.

  • inception์€ย NIN๋…ผ๋ฌธ๊ณผ ํ•จ๊ป˜, โ€œwe need to go deeperโ€๋ผ๋Š” ์œ ๋ช…ํ•œ ์ธํ„ฐ๋„ท ๋ฐˆ์—์„œ ์œ ๋ž˜ํ•œ ์ด๋ฆ„์ด๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” โ€œdeepโ€์ด๋ผ๋Š” ๋‹จ์–ด๋ฅผ (1) Inception module์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ๊ตฌ์กฐ์˜ ๋„์ž… ์ด๋ผ๋Š” ์˜๋ฏธ์™€ (2) Network์˜ depth๊ฐ€ ๋Š˜์–ด๋‚œ ๊ฒƒ ์ด๋ผ๋Š” ์˜๋ฏธ๋กœ 2๊ฐ€์ง€์˜ ๋œป์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

  • ์ผ๋ฐ˜์ ์œผ๋กœ Inception ๋ชจ๋ธ์€ย NIN์˜ ๋…ผ๋ฆฌ๋กœ๋ถ€ํ„ฐ ์˜๊ฐ์„ ์–ป์—ˆ์œผ๋ฉฐ,ย Arora์˜ ์ด๋ก ์  ์—ฐ๊ตฌ๊ฐ€ ์ง€์นจ์ด ๋˜์—ˆ๋‹ค. Inception ๊ตฌ์กฐ์˜ ์ด์ ์€ ILSVRC 2014 classification ๋ฐ detection ๋ถ„์•ผ์—์„œ ์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆ๋์œผ๋ฉฐ, ๋‹น์‹œ์˜ state-of-the-art๋ณด๋‹ค ํ›จ์”ฌ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

    ๐Ÿ”ธ NIN(Network-in-Network)

์ด ํŽ˜์ด์ง€์—์„œ๋Š” ์„ธ ๊ฐ€์ง€ ์ฃผ๋ชฉํ•  ๋‚ด์šฉ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

โ€œCNN์˜ ์ „ํ˜•์ ์ธ ๊ตฌ์กฐโ€

  • Convolution layer๋‹ค์Œ์— contrast normalization์ด๋‚˜ max-pooling layer๊ฐ€ ์„ ํƒ์ ์œผ๋กœ ๋’ค๋”ฐ๋ฅด๋ฉฐ, ํ•˜๋‚˜ ์ด์ƒ์˜ FC layer๊ฐ€ ๋‚˜์˜ค๋Š” ํ˜•ํƒœ์ด๋‹ค.
  • ์ด ๊ตฌ์กฐ๋ฅผ ๋ณ€ํ˜•ํ•œ ๋ชจ๋ธ๋“ค์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋ถ„์•ผ์— ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋ฉฐ MNIST, CIFAR, ImageNet ๋ถ„๋ฅ˜ ์ฑŒ๋ฆฐ์ง€์—์„œ ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ์–ป์—ˆ๋‹ค.

โ€œImageNet๊ณผ ๊ฐ™์€ ํฐ dataset์˜ ๊ฒฝ์šฐโ€

  • layer ์ˆ˜์™€ layer-size ๋Š˜๋ฆฌ๋ฉด์„œ dropout์„ ์‚ฌ์šฉํ•ด์„œ overfitting์„ ํ”ผํ•˜๋Š” ๊ฒƒ์ด ์ตœ๊ทผ์˜ ์ถ”์„ธ์˜€๋‹ค. ๋˜ํ•œ, Maxpooling layer๊ฐ€ ๊ณต๊ฐ„์ •๋ณด์˜ ์ •ํ™•์„ฑ์„ ์†์‹ค์‹œํ‚จ๋‹ค๋Š” ๊ฑฑ์ •์—๋„ AlexNet์˜ CNN๊ตฌ์กฐ๋Š” localization, object detection, human pose estimation๋ถ„์•ผ์—์„œ ์„ฑ๊ณต์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

โ€œNINโ€์€

  • ์‹ ๊ฒฝ๋ง์˜ ํ‘œํ˜„๋ ฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด Lin et al. ์ด ์ œ์•ˆํ•œ ๋ฐฉ์‹์ด๋‹ค. ย GoogLeNet์˜ ๊ฒฝ์šฐ์—๋Š” Inception layer๊ฐ€ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณต๋˜์–ด 22-layer deep model๋กœ ๊ตฌํ˜„๋œ๋‹ค. ์ด ๋ชจ๋ธ์€ 1x1 conv layer๊ฐ€ ๋„คํŠธ์›Œํฌ์— ์ถ”๊ฐ€๋˜์–ด depth๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. 1x1 convolution์ด ๊ฐ€์ง€๋Š” ๋ชฉ์ ์€ ๋‹ค์Œ ๋‘ ๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.

    1. ๋ณ‘๋ชฉํ˜„์ƒ์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•œ ์ฐจ์›์˜ ์ถ•์†Œ
    2. ํฐ ์„ฑ๋Šฅ์˜ ์ €ํ•˜์—†์ด ๋„คํŠธ์›Œํฌ์˜ width์™€ depth๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ์œ„ํ•ด

1x1 convolution๊ณผ ๋ณ‘๋ชฉํ˜„์ƒ(bottleneck) ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์„ ์•„๋ž˜ ๋”๋ณด๊ธฐ๐Ÿ”Ž ์ฐธ๊ณ !!

๋”๋ณด๊ธฐ๐Ÿ”Ž

๋จผ์ €, ๋ณ‘๋ชฉํ˜„์ƒ์ด๋ž€?

image

์—ฌ๋Ÿฌ ํ˜„์ƒ์—์„œ ์“ฐ์ด๋Š” ๋‹จ์–ด๋กœ ์ปดํ“จํ„ฐ์ชฝ ์šฉ์–ด๋กœ๋Š”, ๋ณ‘์˜ ๋ชฉ ๋ถ€๋ถ„์ฒ˜๋Ÿผ ๋„“์€ ๊ธธ์ด ์ข์•„์ง์œผ๋กœ์จ ์ปดํ“จํ„ฐ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ํ˜„์ƒ์„ ๋งํ•œ๋‹ค. ์‰ฝ๊ฒŒ ๋งํ•˜์ž๋ฉด, ์ˆ˜์šฉ๊ฐ€๋Šฅํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์€ ์ž‘์€๋ฐ ํ•œ๊บผ๋ฒˆ์— ๋งŽ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์œ ์ž… ๋จ์œผ๋กœ์จ ์ปดํ“จํ„ฐ๊ฐ€ ๋Š๋ ค์ง€๋Š” ํ˜„์ƒ์„ ๋งํ•œ๋‹ค.

์ด ํ˜„์ƒ์€ CPU์™€ GPU๋กœ๋ถ€ํ„ฐ ๋ฐœ์ƒํ•œ๋‹ค. ๊ฐ ํ˜„์ƒ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž!

1. CPU bottleneck

cpu bottlenec ํ˜„์ƒ์€ ํ”„๋กœ์„ธ์„œ๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋น ๋ฅด๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•˜๊ฑฐ๋‚˜ ์ „์†กํ•˜์ง€ ๋ชปํ• ๋•Œ ๋ฐœ์ƒํ•œ๋‹ค. ์ฆ‰, ๋ฐ์ดํ„ฐ์˜ ์ „์†ก์†๋„๋ฅผ CPU์˜ ์ฒ˜๋ฆฌ์†๋„๊ฐ€ ๋”ฐ๋ผ๊ฐ€์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ์— ๋ณ‘๋ชฉํ˜„์ƒ์ด ๋ฐœ์ƒํ•œ๋‹ค.

2. GPU bottleneck

gpu bottleneck ํ˜„์ƒ์€ entry-level์˜ ๊ทธ๋ž˜ํ”ฝ์นด๋“œ๋ฅผ ๋น ๋ฅธ ํ”„๋กœ์„ธ์„œ๋ฅผ ์ ‘๋ชฉํ•ด ์‚ฌ์šฉํ•˜๋ฉด ์ผ์–ด๋‚œ๋‹ค๊ณ  ํ•˜๋Š” ์ฃผ์žฅ๋„ ์žˆ๊ณ , ๊ธฐ๋ณธ์ ์œผ๋กœ CPU ์ฒ˜๋ฆฌ๋Šฅ๋ ฅ์ด ๋ถ€์กฑํ•˜์—ฌ GPU ๋กœ๋“œ์œจ์ด 90%์ดํ•˜๋กœ ๋–จ์–ด์ง€๊ณ  ํ”„๋ ˆ์ž„์ด ๊ธ‰๊ฒฉํžˆ ํ•˜๋ฝํ•˜๋Š” ์ •๋„์˜ ์ƒํ™ฉ์ด ๋ฐœ์ƒํ•˜๋Š” ์ •๋„๊ฐ€ ๋˜์•ผ ๋ณ‘๋ชฉํ˜„์ƒ์ด ๋ฐœ์ƒํ–ˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋ณด๋Š” ์‚ฌ๋žŒ๋„ ์žˆ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด 1x1 convolution์€ ์œ„์˜ ๋ณ‘๋ชฉํ˜„์ƒ์„ ์–ด๋–ป๊ฒŒ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ์„๊นŒ?

๋จผ์ €, ์ปจ๋ณผ๋ฃจ์…˜ ์—ฐ์‚ฐ์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ณต์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. (ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” ๊ฒฐ๊ณผ๊ฐ’ ํฌ๊ธฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค)

Convolution Parameters = Kernel Size x Kernel Size x Input Channel x Output Channel

์ด๋ ‡๊ฒŒ Channel ๊ฐ’์ด ๋งŽ์•„์ง€๋Š” ๊ฒฝ์šฐ ์—ฐ์‚ฐ์— ๊ฑธ๋ฆฌ๋Š” ์†๋„๋„ ๊ทธ๋งŒํผ ์ฆ๊ฐ€ํ•  ์ˆ˜ ๋ฐ–์— ์—†๋Š”๋ฐ, ์ด๋•Œ Channel(์ฐจ์›)์„ ์ถ•์†Œํ•˜๋Š” ๊ฐœ๋…์ด Bottleneck layer ์ด๋‹ค.

image

์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ์ค‘๊ฐ„์— 1x1 convolution์„ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ channel์˜ ์ˆ˜๊ฐ€ ์ค„๋ฉด์„œ ์—ฐ์‚ฐ๋Ÿ‰์ด ํ™•์—ฐํžˆ ๊ฐ์†Œํ•˜๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. channel์ˆ˜ ์กฐ์ ˆ์€ ์ง์ ‘์ ์ธ ์—ฐ์‚ฐ๋Ÿ‰ ๊ฐ์†Œ๋กœ ์ด์–ด์ง€๋ฉด์„œ ๋„คํŠธ์›Œํฌ๋ฅผ ๊ตฌ์„ฑํ•  ๋•Œ ๋” ๊นŠ๊ฒŒ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์›€์„ ์ค€๋‹ค.(1x1 conv ๋‘๋ฒˆ์งธ ๋ชฉ์  ํฐ ์„ฑ๋Šฅ์˜ ์ €ํ•˜์—†์ด ๋„คํŠธ์›Œํฌ์˜ width์™€ depth๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๊ธฐ ์œ„ํ•ด) ๋˜, ์ด๋ ‡๊ฒŒ channel์ˆ˜๋ฅผ ์ค„์˜€๋‹ค๊ฐ€ ๋‹ค์‹œ ๋Š˜๋ฆฌ๋Š” ๋ถ€๋ถ„์„ bottleneck ๊ตฌ์กฐ๋ผ๊ณ  ํ‘œํ˜„ํ•˜๊ธฐ๋„ ํ•œ๋‹ค.

3. Motivation and High Level Consideration!

GoogLeNet์ด ๋‚˜์˜ค๊ฒŒ ๋œ ๋ฐฐ๊ฒฝ

  • DNN์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๊ฐ€์žฅ ์†์‰ฌ์šด ๋ฐฉ๋ฒ•์€ โ€œํฌ๊ธฐโ€๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์‰ฝ๊ณ  ๊ฐ„๋‹จํ•˜๋‚˜ ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

(1) ๋„คํŠธ์›Œํฌ์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์ ธ parameter๊ฐ€ ๋Š˜์–ด๋‚˜๋ฉด์„œ ํ•™์Šต๋ฐ์ดํ„ฐ๊ฐ€ ์ ์€ ๊ฒฝ์šฐ overfitting์ด ์‰ฝ๊ฒŒ ์ผ์–ด๋‚œ๋‹ค.

(2) ์ปดํ“จํ„ฐ ์ž์›์˜ ์‚ฌ์šฉ๋Ÿ‰์ด ๋Š˜์–ด๋‚œ๋‹ค. -> ์ปดํ“จํ„ฐ ์ž์›์€ ์œ ํ•œํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋„คํŠธ์›Œํฌ ์‚ฌ์ด์ฆˆ๋ฅผ ๋Š˜๋ฆฌ๋Š” ๊ฒƒ๋ณด๋‹ค๋Š” ์ปดํ“จํŒ… ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ๋ถ„๋ฐฐํ•˜๋Š” ๊ฒƒ์ด ๋” ์ค‘์š”ํ•˜๋‹ค.

๐Ÿ‘‰ ์ด ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” fully connected์—์„œ sparsely connected ๊ตฌ์กฐ๋กœ ๋ณ€๊ฒฝ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

ํ•˜์ง€๋งŒ, ํ˜„์žฌ์˜ ํ•˜๋“œ์›จ์–ด๋กœ๋Š” sparseํ•œ ๋งคํŠธ๋ฆญ์Šค ์—ฐ์‚ฐ์— ๋น„ํšจ์œจ์ ์ด๋‹ค. ๋”ฐ๋ผ์„œ ๋งŽ์€ ๋ฌธํ—Œ์—์„œ๋Š” sparse ํ–‰๋ ฌ์„ ํด๋Ÿฌ์Šคํ„ฐ๋งํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋ฐ€๋„๊ฐ€ ๋†’์€ ํ•˜์œ„ dense ํ–‰๋ ฌ๋กœ(submatrix)๋งŒ๋“œ๋Š” ๊ฒƒ์„ ์ œ์•ˆ(=> ์ด ๋ถ€๋ถ„์— ์ฃผ๋ชฉํ•˜์ž! ๐Ÿ’ก) ํ•˜๋ฉฐ ์ด๊ฒƒ์€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒˆ๋‹ค.

4. Architectural Details

Inception๊ตฌ์กฐ์˜ ์ฃผ ์•„์ด๋””์–ด๋Š” CNN์—์„œ ๊ฐ ์š”์†Œ๋ฅผ ์ตœ์ ์˜ local sparce structure๋กœ ๊ทผ์‚ฌํ™”ํ•˜๊ณ , ์ด๋ฅผ dense component๋กœ ๋ฐ”๊พธ๋Š” ๋ฐฉ๋ฒ•์„ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค.

์•ž์„œ ๋งํ•œ ๊ฒƒ๊ณผ ๊ฐ™๋‹ค! -> sparse ํ–‰๋ ฌ์„ ํด๋Ÿฌ์Šคํ„ฐ๋งํ•˜์—ฌ ์ƒ๋Œ€์ ์œผ๋กœ ๋ฐ€๋„๊ฐ€ ๋†’์€ ํ•˜์œ„ dense ํ–‰๋ ฌ๋กœ(submatrix)๋งŒ๋“œ๋Š” ๊ฒƒ!

Inception ๊ตฌ์กฐ๋Š” 1x1, 3x3, 5x5๋กœ ์ œํ•œํ–ˆ์œผ๋ฉฐ module์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

image

๋จผ์ € (a)๋ฅผ ๋ณด๋ฉด 5x5 convolution์ด๋ผ๋„ ๋งŽ์€ filter๊ฐ€ ์Œ“์ธ๋‹ค๋ฉด ๊ณ„์‚ฐ ๋น„์šฉ์ด ์ปค์ง„๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

๋”ฐ๋ผ์„œ (b)์™€ ๊ฐ™์ด 1x1์„ ํ†ตํ•ด ์ฐจ์›์„ ์ถ•์†Œํ•˜์˜€๋‹ค. 1x1 convolution์€ 3x3๊ณผ 5x5 convolution์ด์ „์— ์‚ฌ์šฉํ•ด ์—ฐ์‚ฐ ๋Ÿ‰์„ ๊ฐ์†Œ์‹œํ‚จ๋‹ค.

๐Ÿ‘‰ ์ด ๊ตฌ์กฐ์˜ ์ด์ ์€ ์—ฐ์‚ฐ ๋Ÿ‰์„ ํฌ๊ฒŒ ๋Š˜๋ฆฌ์ง€ ์•Š์œผ๋ฉด์„œ ๋„คํŠธ์›Œํฌ์˜ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆด ์ˆ˜ ์žˆ๊ณ  convolution ์—ฐ์‚ฐ ์ดํ›„์˜ ReLU๋ฅผ ํ†ตํ•ด ๋น„์„ ํ˜•์  ํŠน์ง•์„ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค.

5. GoogLeNet

image

์ „์ฒด ๊ตฌ์กฐ์˜ scheme view

  • ์ฐจ์› ์ถ•์†Œ๋ฅผ ์œ„ํ•œ 1x1 convolution๊ณผ ReLUํ•จ์ˆ˜
  • 1024์œ ๋‹›์„ ๊ฐ€์ง„ fully-connected layer๊ณผ ReLUํ•จ์ˆ˜
  • Dropout layer๋Š” drop๋œ ๊ฒฐ๊ณผ๋ฌผ์˜ 70%์˜ ๋น„์œจ์„ ๊ฐ–๋Š”๋‹ค.
  • ๋ถ„๋ฅ˜๊ธฐ๋กœ๋Š” softmax๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์„ ํ˜• layer์„ ์‚ฌ์šฉํ•œ๋‹ค.

Inception module ๋‚ด๋ถ€๋ฅผ ํฌํ•จํ•œ ๋ชจ๋“  Convolution layer์—๋Š” ReLU๊ฐ€ ์ ์šฉ๋˜์–ด ์žˆ๋‹ค. ๋˜ํ•œ receptive field์˜ ํฌ๊ธฐ๋Š” 224 x 224๋กœ RGB ์ปฌ๋Ÿฌ ์ฑ„๋„์„ ๊ฐ€์ง€๋ฉฐ, mean subtraction์„ ์ ์šฉํ•œ๋‹ค.

์ด์ œ googlenet์˜ ๊ตฌ์กฐ๋ฅผ ๋ถ€๋ถ„์ ์œผ๋กœ ์•Œ์•„๋ณด์ž! ํฌ๊ฒŒ 3๊ฐ€์ง€์˜ ๊ตฌ์กฐ๋กœ ๋ถ„์„ํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

(1) architecture part 1

image

๋‚ฎ์€ ๋ ˆ์ด์–ด๊ฐ€ ์œ„์น˜ํ•œ ๋ถ€๋ถ„์œผ๋กœ Inception module์ด ์‚ฌ์šฉ๋˜์ง€ ์•Š์Œ.

ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์œ„ํ•ด ๋‚ฎ์€ layer์—์„œ๋Š” ๊ธฐ๋ณธ์ ์ธ CNN ๋ชจ๋ธ์„ ์ ์šฉํ•˜๊ณ , ๋†’์€ layer์—์„œ Inception module์„ ์‚ฌ์šฉํ•œ๋‹ค.

(2) architecture part 2

image

Inception module์ด๋ฉฐ ๋‹ค์–‘ํ•œ ํŠน์ง•์„ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด 1x1, 3x3, 5x5 convolution์ด ๋ณ‘๋ ฌ์ ์œผ๋กœ ์—ฐ๊ฒฐ ๋˜์–ด ์žˆ์œผ๋ฉฐ 3x3๊ณผ 5x5 convolution ์ด์ „์— 1x1์„ ํ†ตํ•ด ์ฐจ์›์„ ์ถ•์†Œํ•˜๋Š” ๊ตฌ์กฐ์ด๋‹ค. Architectural details์—์„œ ์–ธ๊ธ‰ํ•œ ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

(3) architecture part 3

image

auxiliary classifier๊ฐ€ ์ ์šฉ๋œ ๋ถ€๋ถ„์ด๋‹ค.

๋ชจ๋ธ์˜ ๋ ˆ์ด์–ด๊ฐ€ ๋งŽ์•„์ง€๋ฉด ์—ญ์ „ํŒŒ๋ฅผ ์ˆ˜ํ–‰ ํ•  ๋•Œ ๊ธฐ์šธ๊ธฐ๊ฐ€ 0์œผ๋กœ ์ˆ˜๋ ดํ•˜๋Š” gradient vanishing ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ค‘๊ฐ„ layer์— auxiliary classifier๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ, ์ค‘๊ฐ„์ค‘๊ฐ„์— ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ด ์ถ”๊ฐ€์ ์ธ ์—ญ์ „ํŒŒ๋ฅผ ์ผ์œผ์ผœ gradient๊ฐ€ ์ „๋‹ฌ๋ ย ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋ฉด์„œ๋„ย ์ •๊ทœํ™” ํšจ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚˜๋„๋ก ํ•˜์˜€๋‹ค.

์ง€๋‚˜์น˜๊ฒŒ ์˜ํ–ฅ์„ ์ฃผ๋Š” ๊ฒƒ์„ ๋ง‰๊ธฐ ์œ„ํ•ดย auxiliary classifier์˜ loss์— 0.3์„ ๊ณฑํ•˜์˜€๊ณ ,ย ์‹ค์ œ ํ…Œ์ŠคํŠธ ์‹œ์—๋Š” auxiliary classifier๋ฅผ ์ œ๊ฑฐย ํ›„, ์ œ์ผ ๋๋‹จ์˜ softmax๋งŒ์„ ์‚ฌ์šฉํ•œ๋‹ค. ย 

(4) architecture part 4

image

Output์ด ๋‚˜์˜ค๋Š” ๊ตฌ๊ฐ„์ด๋‹ค. ๊ตฌ์กฐ๋ฅผ ๋ณด๋ฉด ์ตœ์ข… classifier์ด์ „์— average pooling layer๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š”๋ฐ ์ด๋Š”ย GAPย (Global Average Pooling)๊ฐ€ ์ ์šฉ๋œ ๊ฒƒ์ด๋‹ค.

GAP๋Š”ย ์ด์ „ layer์—์„œ ์ถ”์ถœ๋œ feature map์„ ๊ฐ๊ฐ ํ‰๊ท  ๋‚ธ ๊ฒƒ์„ ์ด์–ด 1์ฐจ์› ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค์–ด ์ค€๋‹ค. (1์ฐจ์› ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค์–ด์ค˜์•ผ ์ตœ์ข…์ ์œผ๋กœ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ softmax layer์™€ ์—ฐ๊ฒฐํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.)


6. Training Methodology

์ด ๋…ผ๋ฌธ์ €์ž๋Š” asynchronous stochastic gradient descent(SGD)๋ฅผ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ momentum = 0.9, learning rate๋Š” ๋งค 8๋ฒˆ์˜ epoch ๋งˆ๋‹ค 4% ์”ฉ ๊ฐ์†Œํ•˜๋Š” ๊ณ ์ • ์Šค์ผ€์„ ๊ฐ€์ง„๋‹ค.

asynchronous SGD


๋˜, Polyak averaging์ด inference time์— ์‚ฌ์šฉ๋˜๋Š” final model์„ ๋งŒ๋“œ๋Š”๋ฐ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

Polyak averaging๐Ÿ”Ž


์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํšก๋‹จํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์˜ ์—ฌ๋Ÿฌ ํฌ์ธํŠธ๋“ค์„ ํ‰๊ท ํ™”์‹œํ‚จ๊ฒƒ์„ ํฌํ•จํ•˜๋Š” ํ‰๊ท ์‹์ด๋‹ค.

๋”ฐ๋ผ์„œ ๋งŒ์•ฝ ์ตœ์ ํ™” ๋„์ค‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด $\theta(1), \theta(2), โ€ฆ$๋ฅผ ๋งŒ๋‚˜๊ฒŒ ๋˜๋ฉด Polyak averaging์˜ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

ย ย ย ย ย ย ย ย ย ย ย ย  $\hat{\theta}^{(t)} = \frac{1}{t} \sum_i \theta^{(i)}$


image

์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ minima์— ๋„๋‹ฌํ•˜์ง€ ๋ชปํ•œ ์ฑ„ valley๋ฅผ ๋”ฐ๋ผ ์•ž๋’ค๋กœ ์ง„๋™(์ด๋ฆฌ์ €๋ฆฌ ์›€์ง์ž„)ํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ํฌ์ธํŠธ๋“ค์˜ ํ‰๊ท ๊ฐ’์œผ๋กœ valley์˜ ํ•˜๋‹จ,์ฆ‰ minima์— ๊ฐ€๊น๊ฒŒ ๋œ๋‹ค. ์œ„์˜ $\theta(1), \theta(2), โ€ฆ$ ํฌ์ธํŠธ๋“ค์„ ํ‰๊ท ํ•˜๋Š” ์‹์— ํฌํ•จํ•œ๋‹ค

์—ฌ๊ธฐ๋Š” ์ถ”๊ฐ€๋กœ...


๋”ฅ๋Ÿฌ๋‹์— ์žˆ์–ด ๋Œ€๋ถ€๋ถ„์˜ ์ตœ์ ํ™” ๋ฌธ์ œ๋Š” ๋ฐ”๋กœ (1) ์ตœ์ ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์˜ํ•ด ์ฑ„ํƒ๋œ ๊ธธ์ด ๊ฝค ๋ณต์žกํ•ด ๋ณผ๋กํ•˜์ง€ ์•Š์€(non-convex)๊ฒƒ๊ณผ ๋จผ ๊ณผ๊ฑฐ์— ๋ฐฉ๋ฌธํ•œ ํฌ์ธํŠธ๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์˜ ์ตœ๊ทผ ํฌ์ธํŠธ๋กœ๋ถ€ํ„ฐ ๊ฝค ๋ฉ€์ง€๋„ ๋ชจ๋ฅธ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค..

๋”ฐ๋ผ์„œ ๋จผ ๊ณผ๊ฑฐ์˜ ์ด์™€ ๊ฐ™์€ ํฌ์ธํŠธ๋ฅผ ํฌํ•จ์‹œํ‚ค๋Š” ๊ฒƒ์€ ์‹ค์šฉ์ ์ด์ง€ ์•Š์„์ง€๋„ ๋ชจ๋ฅธ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— polyak average๋ณด๋‹ค๋Š” exponentially decaying running average๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ ์ด๋Š” Polyak-Ruppert Averaging์ด๋ผ๊ณ  ํ•œ๋‹ค.

Inference Time๐Ÿ”Ž


์ง์—ญํ•˜์ž๋ฉด ์ถ”๋ก  ์‹œ๊ฐ„์ด๋ผ๋Š” ๊ฒƒ์ธ๋ฐ, ํ•˜๋‚˜์˜ frame์„ detectionํ•˜๋Š”๋ฐ ๊นŒ์ง€ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„์„ inference time์ด๋ผ๊ณ  ํ•œ๋‹ค.

์˜์ƒ์€ image๋“ค์˜ ์—ฐ์†์ ์ธ ์ง‘ํ•ฉ์ด๋‹ค. FPS๋ž€ ์ดˆ๋‹น detectionํ•˜๋Š” ๋น„์œจ์„ ์˜๋ฏธํ•œ๋‹ค. ๋งŒ์•ฝ, ์ดˆ๋‹น 20๊ฐœ์˜ frame์— ๋Œ€ํ•ด detection์„ ์ˆ˜ํ–‰ํ•˜๋ฉด 20fps ๋ผ๊ณ  ํ•œ๋‹ค. ์‚ฌ๋žŒ๋“ค์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ธ์‹ํ•˜๋Š” ์˜์ƒ์˜ fps๋Š” 30fps์ด๋‹ค. ๋”ฐ๋ผ์„œ ์ดˆ๋‹น ์—ฐ์†์ ์ธ frame์„ 30๊ฐœ ์ด์ƒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉด ๋Š๊ธฐ์ง€ ์•Š๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ์˜์ƒ์ด๋ผ๊ณ  ์ธ์‹ํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

๋”ฐ๋ผ์„œ Object Detection๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•  ๋•Œ (m)AP๊ฐœ๋…๋„ ์ค‘์š”ํ•˜์ง€๋งŒ inference time๋„ ์ค‘์š”ํ•˜๊ฒŒ ์ƒ๊ฐํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ์—ฌ๊ธฐ์„œ ๋งˆ์ง€๋ง‰ ๋ชจ๋ธ๋กœ inference time์„ ์ธก์ •ํ•˜๋‚˜๋ณด๋‹ค!


GoogLeNet์„ ์ฝ”๋“œ๋กœ ๊ตฌํ˜„ํ•œ๊ฒƒ์„ ์ •๋ฆฌํ•œ ํŽ˜์ด์ง€์ด๋‹ค. => GoogLeNet


์ฐธ๊ณ 

[1] https://sike6054.github.io/blog/paper/second-post/

[2] https://leedakyeong.tistory.com/entry/%EB%85%BC%EB%AC%B8-GoogleNet-Inception-%EB%A6%AC%EB%B7%B0-Going-deeper-with-convolutions-1

[3] https://phil-baek.tistory.com/entry/3-GoogLeNet-Going-deeper-with-convolutions-%EB%85%BC%EB%AC%B8-%EB%A6%AC%EB%B7%B0

[4] https://89douner.tistory.com/80

[5] https://medium.com/inveterate-learner/deep-learning-book-chapter-8-optimization-for-training-deep-models-part-ii-438fb4f6d135

ํƒœ๊ทธ: ,

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ:

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ