Fast methods for deep learning based object detection

Fast Methods for Deep Learning based
Object Detection

R-CNN: Problems
● Training is a multi-stage pipeline.
○ R-CNN first finetunes a ConvNet on object proposals using log loss.
○ Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax
classifier learnt by fine-tuning.
○ In the third training stage, bounding-box regressors are learned.
● Training is expensive in space and time.
○ For SVM and bounding-box regressor training, features are extracted from each object proposal in
each image and written to disk.
○ With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the
VOC07 trainval set. These features require hundreds of gigabytes of storage.
● Object detection is slow.
○ At test-time, features are extracted from each object proposal in each test image.
○ Detection with VGG16 takes 47s / image (on a GPU).

● Only calculate features once.
● ROI Pooling layer extracts constant length vector representations of proposals.
● Classify and regress bounding boxes with multi purpose loss for end-to-end
training.
Fast R-CNN

● Instead of SVM + bounding box regression:
○ SoftMax classifier output
○ Bounding box regression output
● Multi-task training:
Fast R-CNN

● Advantages
○ Training is single-stage, using a multi-task loss
○ Training can update all network layers
○ No disk storage is required for feature caching
○ More accurate 66.9mAP vs 66.0mAP.
○ Faster training time 9.5h vs 84h (x8.8)
○ Faster test time per image: 0.32s vs 47s (x146)
● Problem
○ Test time don’t include region proposals.
○ Test time with region proposals: 2s vs 50s (x25)
● Solution
○ Make the CNN do region proposals too!
Fast R-CNN

● Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks (2015)
○ Shaoqing Ren, Kaiming He, Ross Girshick
● Insert a Region Proposal Network (RPN) after the
last convolutional layer.
● RPN trained to produce region proposals directly;
no need for external region proposals!
● After RPN, use RoI Pooling and an upstream
classifier and bbox regressor just like Fast R-CNN.
Faster R-CNN

● Slide a small window on the already computed
feature map (FREE!).
● Build a small network for:
○ Classifying object or not-object, and
○ Regressing bbox locations
● Position of the sliding window provides
localization information with reference to the
image.
● Box regression provides finer localization
information with reference to this sliding
window
Faster R-CNN: RPN

● In the paper: Ugly pipeline
○ Use alternating optimization to train RPN, then Fast
R-CNN with RPN proposals, etc.
○ More complex than it has to be
● Since publication: Joint training!
○ One network, four losses
■ RPN classification (anchor good / bad)
■ RPN regression (anchor -> proposal)
■ Fast R-CNN classification (over classes)
■ Fast R-CNN regression (proposal -> box)
Faster R-CNN: Training

How Many Proposals Do We Need?
● Fast R-CNN used 2000 proposals from selective search.
● Faster R-CNN needs only 300 proposals from the RPN.
● RPN is better than selective search
○ Deep learning vs. classical computer vision
○ Optimized for this task

Also Read:
R-FCN: Object Detection via Region-based Fully
Convolutional Networks
https://arxiv.org/abs/1605.06409

Another Approach For
Speeding Up
Proposals

Another Approach For
Speeding Up
Proposals
Just Don’t Do It

Just RPN From Faster R-CNN
● Much faster than Faster R-CNN!
● But RPN had only object/not object classifier.

Add Classification!
● What about accuracy?
● How well does it handle different object scales?

SSD: Single Shot MultiBox Detector

Why Does Stride Matter?
● Smaller stride means more scanned
windows.
● Handles close objects better.
○ Need to have enough default boxes to do
accurate matching in each.
● Handles small objects better.
○ Better IoU with objects.
○ More positive windows per object.
● Too little stride is bad
○ Too many windows means too many false
positives to filter.

Improving Accuracy
● Object detection data is unbalanced
○ 1-30 True Positives per image.
○ 8,000 - 25,000 False Positives per image.
● Solution
○ Resample at fixed ratio (1:3)
● Not all negatives are equal!
○ Some are harder than others
● Better Solution
○ Hard negative mining: resample worst-misclassified false positives at fixed ratio.

Improving Accuracy
● Not enough data?
● Solution: Data augmentation
○ Random horizontal flip
○ Random crop
○ Random color distortion
○ Random expansion

Also Read:
YOLO9000: Better, Faster, Stronger
https://arxiv.org/abs/1612.08242

Speed/accuracy factors in object detectors
● Algorithm: Faster R-CNN / SSD / R-FCN / YOLO / ...
● Backbone: VGG16 / ResNet / MobileNet / etc…
● Input size
● Many other hyperparameters...

Speed/accuracy trade-offs for modern convolutional object
detectors (Google)

Frameworks
● Caffe
○ Faster R-CNN: https://github.com/rbgirshick/py-faster-rcnn
○ SSD: https://github.com/weiliu89/caffe/tree/ssd
● Tensorflow Object Detection API:
○ https://github.com/tensorflow/models/tree/master/research/object_detection
● Detectron:
○ https://github.com/facebookresearch/Detectron
● Many more re-implementations in different languages...

Honorable mentions
● VGG16: https://arxiv.org/abs/1409.1556
● ResNet: https://arxiv.org/abs/1512.03385
● Inception-ResNet: https://arxiv.org/abs/1602.07261
● ResNeXt: https://arxiv.org/abs/1611.05431
● Xception: https://arxiv.org/abs/1610.02357
● DenseNet: https://arxiv.org/abs/1608.06993
● MobileNet: https://arxiv.org/abs/1704.04861
● SqueezeNet: https://arxiv.org/abs/1602.07360

Looking for brilliant researchers
cv@brodmann17.com

Fast methods for deep learning based object detection

More Related Content

What's hot

Similar to Fast methods for deep learning based object detection

Recently uploaded

Fast methods for deep learning based object detection