Fast Methods for Deep Learning based
Object Detection
R-CNN: Problems
● Training is a multi-stage pipeline.
○ R-CNN first finetunes a ConvNet on object proposals using log loss.
○ Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax
classifier learnt by fine-tuning.
○ In the third training stage, bounding-box regressors are learned.
● Training is expensive in space and time.
○ For SVM and bounding-box regressor training, features are extracted from each object proposal in
each image and written to disk.
○ With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the
VOC07 trainval set. These features require hundreds of gigabytes of storage.
● Object detection is slow.
○ At test-time, features are extracted from each object proposal in each test image.
○ Detection with VGG16 takes 47s / image (on a GPU).
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Training
Fast R-CNN
Training
● Only calculate features once.
● ROI Pooling layer extracts constant length vector representations of proposals.
● Classify and regress bounding boxes with multi purpose loss for end-to-end
training.
Fast R-CNN
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
● Instead of SVM + bounding box regression:
○ SoftMax classifier output
○ Bounding box regression output
● Multi-task training:
Fast R-CNN
● Advantages
○ Training is single-stage, using a multi-task loss
○ Training can update all network layers
○ No disk storage is required for feature caching
○ More accurate 66.9mAP vs 66.0mAP.
○ Faster training time 9.5h vs 84h (x8.8)
○ Faster test time per image: 0.32s vs 47s (x146)
● Problem
○ Test time don’t include region proposals.
○ Test time with region proposals: 2s vs 50s (x25)
● Solution
○ Make the CNN do region proposals too!
Fast R-CNN
● Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks (2015)
○ Shaoqing Ren, Kaiming He, Ross Girshick
● Insert a Region Proposal Network (RPN) after the
last convolutional layer.
● RPN trained to produce region proposals directly;
no need for external region proposals!
● After RPN, use RoI Pooling and an upstream
classifier and bbox regressor just like Fast R-CNN.
Faster R-CNN
● Slide a small window on the already computed
feature map (FREE!).
● Build a small network for:
○ Classifying object or not-object, and
○ Regressing bbox locations
● Position of the sliding window provides
localization information with reference to the
image.
● Box regression provides finer localization
information with reference to this sliding
window
Faster R-CNN: RPN
● In the paper: Ugly pipeline
○ Use alternating optimization to train RPN, then Fast
R-CNN with RPN proposals, etc.
○ More complex than it has to be
● Since publication: Joint training!
○ One network, four losses
■ RPN classification (anchor good / bad)
■ RPN regression (anchor -> proposal)
■ Fast R-CNN classification (over classes)
■ Fast R-CNN regression (proposal -> box)
Faster R-CNN: Training
How Many Anchors Do We Need?
How Many Proposals Do We Need?
● Fast R-CNN used 2000 proposals from selective search.
● Faster R-CNN needs only 300 proposals from the RPN.
● RPN is better than selective search
○ Deep learning vs. classical computer vision
○ Optimized for this task
How Much Data Do We Need?
Also Read:
R-FCN: Object Detection via Region-based Fully
Convolutional Networks
https://arxiv.org/abs/1605.06409
Another Approach For
Speeding Up
Proposals
Another Approach For
Speeding Up
Proposals
Just Don’t Do It
Just RPN From Faster R-CNN
● Much faster than Faster R-CNN!
● But RPN had only object/not object classifier.
Add Classification!
● What about accuracy?
● How well does it handle different object scales?
Add More Scales!
Add More classifiers
SSD: Single Shot MultiBox Detector
SSD: Single Shot MultiBox Detector
Why Does Stride Matter?
● Smaller stride means more scanned
windows.
● Handles close objects better.
○ Need to have enough default boxes to do
accurate matching in each.
● Handles small objects better.
○ Better IoU with objects.
○ More positive windows per object.
● Too little stride is bad
○ Too many windows means too many false
positives to filter.
Improving Accuracy
● Object detection data is unbalanced
○ 1-30 True Positives per image.
○ 8,000 - 25,000 False Positives per image.
● Solution
○ Resample at fixed ratio (1:3)
● Not all negatives are equal!
○ Some are harder than others
● Better Solution
○ Hard negative mining: resample worst-misclassified false positives at fixed ratio.
Improving Accuracy
● Not enough data?
● Solution: Data augmentation
○ Random horizontal flip
○ Random crop
○ Random color distortion
○ Random expansion
How Much Does It Help?
Also Read:
YOLO9000: Better, Faster, Stronger
https://arxiv.org/abs/1612.08242
Speed/accuracy factors in object detectors
● Algorithm: Faster R-CNN / SSD / R-FCN / YOLO / ...
● Backbone: VGG16 / ResNet / MobileNet / etc…
● Input size
● Many other hyperparameters...
Speed/accuracy trade-offs for modern convolutional object
detectors (Google)
Frameworks
● Caffe
○ Faster R-CNN: https://github.com/rbgirshick/py-faster-rcnn
○ SSD: https://github.com/weiliu89/caffe/tree/ssd
● Tensorflow Object Detection API:
○ https://github.com/tensorflow/models/tree/master/research/object_detection
● Detectron:
○ https://github.com/facebookresearch/Detectron
● Many more re-implementations in different languages...
Honorable mentions
● VGG16: https://arxiv.org/abs/1409.1556
● ResNet: https://arxiv.org/abs/1512.03385
● Inception-ResNet: https://arxiv.org/abs/1602.07261
● ResNeXt: https://arxiv.org/abs/1611.05431
● Xception: https://arxiv.org/abs/1610.02357
● DenseNet: https://arxiv.org/abs/1608.06993
● MobileNet: https://arxiv.org/abs/1704.04861
● SqueezeNet: https://arxiv.org/abs/1602.07360
Looking for brilliant researchers
cv@brodmann17.com

Fast methods for deep learning based object detection

  • 1.
    Fast Methods forDeep Learning based Object Detection
  • 2.
    R-CNN: Problems ● Trainingis a multi-stage pipeline. ○ R-CNN first finetunes a ConvNet on object proposals using log loss. ○ Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax classifier learnt by fine-tuning. ○ In the third training stage, bounding-box regressors are learned. ● Training is expensive in space and time. ○ For SVM and bounding-box regressor training, features are extracted from each object proposal in each image and written to disk. ○ With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the VOC07 trainval set. These features require hundreds of gigabytes of storage. ● Object detection is slow. ○ At test-time, features are extracted from each object proposal in each test image. ○ Detection with VGG16 takes 47s / image (on a GPU).
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    ● Only calculatefeatures once. ● ROI Pooling layer extracts constant length vector representations of proposals. ● Classify and regress bounding boxes with multi purpose loss for end-to-end training. Fast R-CNN
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
    ● Instead ofSVM + bounding box regression: ○ SoftMax classifier output ○ Bounding box regression output ● Multi-task training: Fast R-CNN
  • 18.
    ● Advantages ○ Trainingis single-stage, using a multi-task loss ○ Training can update all network layers ○ No disk storage is required for feature caching ○ More accurate 66.9mAP vs 66.0mAP. ○ Faster training time 9.5h vs 84h (x8.8) ○ Faster test time per image: 0.32s vs 47s (x146) ● Problem ○ Test time don’t include region proposals. ○ Test time with region proposals: 2s vs 50s (x25) ● Solution ○ Make the CNN do region proposals too! Fast R-CNN
  • 19.
    ● Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks (2015) ○ Shaoqing Ren, Kaiming He, Ross Girshick ● Insert a Region Proposal Network (RPN) after the last convolutional layer. ● RPN trained to produce region proposals directly; no need for external region proposals! ● After RPN, use RoI Pooling and an upstream classifier and bbox regressor just like Fast R-CNN. Faster R-CNN
  • 20.
    ● Slide asmall window on the already computed feature map (FREE!). ● Build a small network for: ○ Classifying object or not-object, and ○ Regressing bbox locations ● Position of the sliding window provides localization information with reference to the image. ● Box regression provides finer localization information with reference to this sliding window Faster R-CNN: RPN
  • 21.
    ● In thepaper: Ugly pipeline ○ Use alternating optimization to train RPN, then Fast R-CNN with RPN proposals, etc. ○ More complex than it has to be ● Since publication: Joint training! ○ One network, four losses ■ RPN classification (anchor good / bad) ■ RPN regression (anchor -> proposal) ■ Fast R-CNN classification (over classes) ■ Fast R-CNN regression (proposal -> box) Faster R-CNN: Training
  • 22.
    How Many AnchorsDo We Need?
  • 23.
    How Many ProposalsDo We Need? ● Fast R-CNN used 2000 proposals from selective search. ● Faster R-CNN needs only 300 proposals from the RPN. ● RPN is better than selective search ○ Deep learning vs. classical computer vision ○ Optimized for this task
  • 24.
    How Much DataDo We Need?
  • 25.
    Also Read: R-FCN: ObjectDetection via Region-based Fully Convolutional Networks https://arxiv.org/abs/1605.06409
  • 26.
  • 27.
    Another Approach For SpeedingUp Proposals Just Don’t Do It
  • 28.
    Just RPN FromFaster R-CNN ● Much faster than Faster R-CNN! ● But RPN had only object/not object classifier.
  • 29.
    Add Classification! ● Whatabout accuracy? ● How well does it handle different object scales?
  • 30.
  • 31.
  • 32.
    SSD: Single ShotMultiBox Detector
  • 33.
    SSD: Single ShotMultiBox Detector
  • 34.
    Why Does StrideMatter? ● Smaller stride means more scanned windows. ● Handles close objects better. ○ Need to have enough default boxes to do accurate matching in each. ● Handles small objects better. ○ Better IoU with objects. ○ More positive windows per object. ● Too little stride is bad ○ Too many windows means too many false positives to filter.
  • 35.
    Improving Accuracy ● Objectdetection data is unbalanced ○ 1-30 True Positives per image. ○ 8,000 - 25,000 False Positives per image. ● Solution ○ Resample at fixed ratio (1:3) ● Not all negatives are equal! ○ Some are harder than others ● Better Solution ○ Hard negative mining: resample worst-misclassified false positives at fixed ratio.
  • 36.
    Improving Accuracy ● Notenough data? ● Solution: Data augmentation ○ Random horizontal flip ○ Random crop ○ Random color distortion ○ Random expansion
  • 37.
    How Much DoesIt Help?
  • 38.
    Also Read: YOLO9000: Better,Faster, Stronger https://arxiv.org/abs/1612.08242
  • 39.
    Speed/accuracy factors inobject detectors ● Algorithm: Faster R-CNN / SSD / R-FCN / YOLO / ... ● Backbone: VGG16 / ResNet / MobileNet / etc… ● Input size ● Many other hyperparameters...
  • 40.
    Speed/accuracy trade-offs formodern convolutional object detectors (Google)
  • 41.
    Frameworks ● Caffe ○ FasterR-CNN: https://github.com/rbgirshick/py-faster-rcnn ○ SSD: https://github.com/weiliu89/caffe/tree/ssd ● Tensorflow Object Detection API: ○ https://github.com/tensorflow/models/tree/master/research/object_detection ● Detectron: ○ https://github.com/facebookresearch/Detectron ● Many more re-implementations in different languages...
  • 42.
    Honorable mentions ● VGG16:https://arxiv.org/abs/1409.1556 ● ResNet: https://arxiv.org/abs/1512.03385 ● Inception-ResNet: https://arxiv.org/abs/1602.07261 ● ResNeXt: https://arxiv.org/abs/1611.05431 ● Xception: https://arxiv.org/abs/1610.02357 ● DenseNet: https://arxiv.org/abs/1608.06993 ● MobileNet: https://arxiv.org/abs/1704.04861 ● SqueezeNet: https://arxiv.org/abs/1602.07360
  • 43.