Notice

Contact

Recent Posts

Recent Comments

Link

« 2025/01 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Deeper Learning

You Only Look Once: Unified, Real-Time Object Detection 본문

AI/Deep Learning

You Only Look Once: Unified, Real-Time Object Detection

Dlaiml 2021. 11. 30. 09:33

2016, University of Washington, Allen Institute for AI, Facebook AI Research
source: https://arxiv.org/pdf/1506.02640.pdf

Abstract

YOLO, obejct detection을 위한 새로운 method를 제시
이전 연구들은 classifier를 detection을 위해 사용하였으나 YOLO는 object detection을 spatially separated bbox와 해당하는 bbox의 class probabilities를 예측하는 regression 문제로 정의
하나의 신경망 모델이 full image에서 one-stage로 bbox와 class probabilities를 예측
전체 detection 파이프라인이 하나의 network에서 이루어지는 end-to-end 모델
초당 45 frames을 detection 할 수 있는 real-time object detection
작은 버전의 network Fast YOLO는 초당 155 frames을 detection 가능하며 이전 real-time detector의 2배의 mAP를 달성
SOTA와 비교하였을 때 localization에서 성능이 조금 떨어지나 background에 bbox를 생성하는 false positive error는 감소
object의 일반적인 representations을 학습하는데 DPM, R-CNN보다 뛰어난 성능, natural images 뿐만 아니라 artwork에서도 좋은 성능

1. Introduction

인간은 이미지를 보자마자 무슨 object가 어디에 있고 어떻게 상호작용 하는지 바로 알 수 있음
빠르고 정확한 object detection 알고리즘은 많은 분야에 유용하게 활용할 수 있음
이전 detection system은 classifier를 detection 목적으로 사용
- classifier를 사용하여 image의 여러 location과 scale에서 classify
- DPM: sliding window를 사용하여 전체 이미지를 탐색
- R-CNN: region proposal을 사용하여 먼저 가능성 있는 bbox를 만들고 proposed bbox에서 classifier를 사용 → 겹치는 detection, rescore와 같은 post-processing 과정을 통해 bbox를 조정
- 각 component를 따로 학습하여야 하기 때문에 느리고 optimize도 어려움
YOLO는 object detection을 single regression problem으로 정의 (input image → bbox coordinates, class probabilities)
object와 그 위치를 한번에 예측 가능 (you only look once)

YOLO는 매우 간단하며 단일 conv network로 동시에 bbox와 class probabilities를 예측하며 full image에서 바로 detection performance를 optimize: Unified model
unified model은 전통적인 object detection method에 비해 여러 장점을 가짐
1. Extremely fast
  - 복잡한 파이프라인이 필요없음
  - YOLO의 fast version의 경우 150 fps를 소화할 수 있으며 25 milliseconds latency의 video에서도 real-time object detection이 가능함
  - 이전 real-time object detection method에 비해 2배 이상의 mAP 달성
2. 이미지를 전역적으로 판단
  - sliding window, region proposal-based techniques과 다르게 YOLO는 train, test 과정에서 전체 이미지를 봄으로써 class에 대한 부분적인 외형뿐만 아니라 contextual information도 implicit 하게 인코딩
  - Fast R-CNN은 더 큰 context를 보지 못하기 때문에 background patch를 object로 인식하는 오류가 YOLO보다 빈번함
3. objects의 일반화 가능한 representations을 학습
  - natural image로 학습하고 artwork에서 test 하였을 때 YOLO의 성능이 DPM, R-CNN의 성능을 크게 상회
  - 새로운 도메인이나 예기치 못한 input에 대한 일반화 성능이 뛰어남
SOTA detection systems에 비해 정확도가 다소 떨어짐
작은 object에서 localization 성능이 떨어짐

2. Unified Detection

object detection의 separate components를 single neural network에 통합
전체 이미지의 feature를 사용하여 모든 classes의 모든 bbox를 동시에 예측
input image를 S x S grid로 나눔
만약 object의 중앙점이 어떤 grid cell 내부에 있다면, 그 grid cell은 object를 detect 하여야 함
각 grid cell은 B개의 bbox와 각 bbox에 해당하는 confidence를 예측
- confidence score는 box에 obejct가 있는지, box가 얼마나 정확한지에 대한 모델의 confidence
- confidence score는 $Pr(Object) * IOU^{truth}_{pred}$ 로 정의
- grid cell에 object가 없다면 confidence score는 0이 되어야 함
- grid cell에 object가 있다면 ground truth와 predicted box의 IOU가 confidence score가 되어야 함
각 bbox는 5개의 predictions을 포함: x, y, w, h, confidence
- (x, y)는 box의 중앙 좌표로 grid cell 경계 기준 상대 좌표
- w = width, h = height 전체 이미지 기준 상대적
- confidence = (ground truth box와 predicted box의 IOU)
각 grid cell은 C개의 conditional class probabilities를 예측
- $Pr(Class_i|Object)$
- 하나의 grid cell이 포함하는 bbox의 수 B에 관계없이 한 set의 class probabilities를 예측
Test time 때는 conditional class probabilities와 각 box의 confidence prediction을 곱함
- $Pr(Class_i|Object) * Pr(Object) * IOU^{truth}{pred} = Pr(Class_i) * IOU^{truth}{pred}$
- 하나의 grid에 해당하는 최종 output (C = 5, B = 2)
  - [b1_confidence, b1x, b1y, b1w, b1h, b2_confidence, b2x, b2y, b2w, b2h, C1, C2, C3, C4, C5], 5 * B + C
  - b1_confidence를 C1 ~ C5에 곱하여 bbox1에 대한 최종 class-specific confidence score를 도출
- class-specific confidence는 box에 해당 class의 object가 존재할 확률과 bbox가 object에 얼마나 잘 맞는지를 나타냄
PASCAL VOC 데이터셋에서 S=7, B=2, C=20 → 최종 output tensor의 shape는 S x S x (B * 5 + C) = 7 x 7 x 30

2.1. Network Design

GoogLeNet에서 영감을 받음
24 conv layers → 2 fc layers
Inception 모듈 대신 1x1 reduction layer + 3x3 conv layers를 사용(NiN 구조와 유사)

2.2. Training

1000-class ImageNet에서 pretrain
conv, fc layer를 pretrained network에 추가하면 성능을 높일 수 있다는 이전 연구에 따라 randomly initialized 된 4개의 conv layer와 2개의 fc layer를 pretrained network에 추가
width와 height는 image size에 따라 normalized 되어 0 ~ 1의 값을 가짐
x와 y는 grid cell에서 위치를 나타내는 offset으로 0 ~ 1의 값을 가짐
final layer에서는 linear activation function을 사용 나머지 layer에서는 0.1 leaky ReLU를 사용
sum-squared error
모든 이미지에서 많은 grid cell은 object를 포함하고 있지 않기 때문에 confidence score를 0으로 밀어버리면서 object가 있는 grid cell의 유의미한 gradient를 압도하는 상황이 발생
$\lambda_{coord}$ = 5, $\lambda_{noobj}$ = 0.5로 설정하여 안정적인 학습을 유도
큰 box에서 Sum-squared error는 작은 box에서 error와 동일한 가중치를 가짐
- $(100-104)^2 = (5 -1) ^2$
- 큰 box에서 작은 수치의 error는 작은 box의 그것보다 중요하지 않음
- width, height를 바로 예측하는 것이 아닌 squared root를 예측
- $(\sqrt{100} - \sqrt{104})^2 < (\sqrt{5} - \sqrt{1})^2$

2.3. Inference

대부분 object가 특정 grid cell에 소속되어 object마다 1개의 box가 생성
그러나 큰 object, grid 경계에 걸쳐있는 object의 경우 여러 개의 box가 생길 수 있음, Non-maximal suppression (NMS)를 사용하면 2~3%의 mAP 증가

2.4. Limitations of YOLO

YOLO는 grid마다 1개의 class만 가질 수 있는 2개의 box만 예측하는 강한 공간적 제약을 가짐
이러한 제약이 근접한 다른 작은 objects를 탐지하는데 악영향을 줌
data에서 bounding box 예측을 학습하기 때문에 특이한 비율의 object에 일반화 성능이 좋지 않음
input image에 많은 downsampling layer를 통과시키기 때문에 비교적 coarse features를 사용
IOU를 사용하기 때문에 large box에서 error보다 small box에서 error가 훨씬 크게 loss에 영향
incorrect localizations이 error의 주요 원인

3. Comparison to Other Detection Systems

DPM

sliding window
pipeline: extract static features, classify regions, predict bounding boxes
YOLO는 통합된 단일 네트워크를 사용하여 빠르고 더 정확

R-CNN

region proposal by Selective Search algorithm
complex pipeline, 0.25 fps
YOLO도 potential bounding boxes를 each grid cell에서 제안
YOLO는 proposal bounding boxes가 R-CNN의 2000개에 비해 98개로 훨씬 적으며 파이프라인을 통합하여 빠름

Other Fast Detectors

Fast & Faster R-CNN은 Selective Search의 문제를 해결하고자 structure를 수정하여 R-CNN을 빠르게 만듦, 그러나 Real-Time을 소화하기에 느린 속도가 문제
DPM의 속도를 높이기 위한 많은 연구가 있었음. HOG computation의 속도를 증가시킨 연구가 많았으나 마찬가지로 Real-Time을 하기에 느린 속도
YOLO는 파이프라인 자체를 버리고 한 번에 빠른 예측이 가능

Deep MultiBox

YOLO와 마찬가지로 RoI를 예측하기 위해 cnn을 사용
그러나 detection pipeline의 구조를 벗어나지 못함

OverFeat

효율적으로 sliding window를 적용하였으나 여전히 pipeline 구조 유지
localization을 최적화하는 method로 localizer는 오직 local information만 고려
global context를 볼 수 없어 post-processing이 필요

MultiGrasp

Grid 접근법이 유사
하나의 object를 포함하는 image의 single graspable region만 예측

4. Experiments

4.1. Comparison to Other Real-Time Systems

4.2. VOC 2007 Error Analysis

Error Analysis
- Correct: correct class and IOU > 0.5
- Localization: correct class and 0.1 < IOU < 0.5
- Similar: class is similar, 0.1 < IOU
- Other: class is wrong, IOU > 0.1
- Background: IOU < 0.1
YOLO의 경우 False Positive인 Background error가 작지만 localization 성능이 떨어진다

4.3. Combining Fast R-CNN and YOLO

Fast R-CNN의 localization 성능과 YOLO의 낮은 False Positive 성능을 얻기 위한 모델 결합
YOLO를 사용하여 Fast R-CNN의 background detection을 제거
두 알고리즘이 비슷한 bbox를 예측했다면 YOLO의 prediction probability 만큼 boost를 부여
성능이 크게 향상하였음
단지 ensemble의 효과가 아님을 증명하기 위해 다른 Fast R-CNN 모델과의 Combined 성능도 측정 → YOLO와 combine 하였을 때만 성능이 크게 향상

4.4. Generalizability: Person Detection in Artwork

5. Real-Time Detection In The Wild

YOLO는 Real-Time이 가능한 속도
http://pjreddie.com/yolo/

YOLO: Real-Time Object Detection

YOLO: Real-Time Object Detection You only look once (YOLO) is a state-of-the-art, real-time object detection system. On a Pascal Titan X it processes images at 30 FPS and has a mAP of 57.9% on COCO test-dev. Comparison to Other Detectors YOLOv3 is extremel

pjreddie.com

6. Conclusion

Unified object detection model: YOLO를 제시
구조가 간단하며 full images에 바로 학습 가능
classifier-based 접근법과 다르게 detection performance와 직접적인 관련이 있는 loss를 사용하여 전체 모델을 한번에 학습
Fast YOLO는 SOTA object detector
다른 도메인에 일반화 성능이 뛰어남

정리

object detection의 전체 pipeline을 통합한 one-stage detection method YOLO를 제시
기존 method에 비해 성능은 크게 떨어지지 않으며 Real-Time detection이 가능한 매우 빠른 속도
뛰어난 일반화 성능
demo와 source code를 함께 공개
Grid cell을 사용하여 object, grid cell이 1:1 mapping 관계이며 한 grid cell에는 오직 하나의 class의 object가 존재한다는 제약이 속도를 높여주나 localization 성능을 감소시켰을 것
한 번에 classification과 bbox regression 학습을 성공, one-stage detection 관련 후속 논문이 매우 많았을 듯 함

'AI > Deep Learning' 카테고리의 다른 글

[Vision Transformer, ViT] An Image is Worth 16x16 Words: Transformers For Image Recognition At Scale (0)	2021.12.17
ArcFace: Additive Angular Margin Loss for Deep Face Recognition (0)	2021.12.13
Pixel-Adaptive Convolutional Neural Networks (0)	2021.11.16
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (0)	2021.11.06
Adaptive Convolutional Kernels (0)	2021.11.05

'AI/Deep Learning' Related Articles

Comments