Advanced deep learning based object detection methods

Advanced Deep Learning based Object
Detection Methods

Improving Object Detection With One Line of Code
● Non-Maximum Suppression is a greedy
process.
○ It worked well enough in 2007 but it doesn’t
anymore.
● High scoring detections can be suppressed
just as low scoring detections.
○ Overlap with stronger detection is the only
criteria.
● Should one detection completely suppress
another detection, or simply reduce its
confidence?

● NMS:
● Linear Soft-NMS:
● Gaussian Soft-NMS:
○ Linear Soft-NMS is not continuous in terms of
overlap and a sudden penalty is applied when a
NMS threshold is reached.
○ Instead we can use a continuous function:

Learning Non-Maximum Suppression
● Object detectors are mostly trained
end-to-end, except for the NMS.
○ NMS is still fully hand-crafted, and forces a
trade-off between recall and precision.
● Training loss is not evaluation loss.
○ Training is performed without NMS
○ During evaluation, multiple detections for same
object count as false positives.
● Instead, train the network to include the
suppression process.
○ Only output one bounding box per object.
○ Learn how to handle close objects.

● Additional blocks that:
○ Encode pairwise information.
○ For each detection, pool information from all
pairings.
○ Update feature vector.
○ Repeat.
● New loss:
○ Only one positive candidate per object.
○ Instead of the current practice to take all
objects with IoU>50%

● Multi-scale object detection using image pyramid
○ Predict different scales by applying same model at different image resolutions.
● Classic method.
● But also, in OverFeat.
● Slow. Requires multiple evaluation of the same model.
Multi-Scale Object Detection

● Predict multiple scale of objects using a single feature map.
● Same as Faster R-CNN.
● Fast
● Single model (same in training as in testing).
● Bad features resolution for small objects.

● Predict different object sizes at different feature scales.
● Same as SSD.
● Good features resolution for small objects
● But features are much weaker than in deeper layers.

● Single model (same in training as in testing).
● Good features resolution for small objects.
● Strong features in all layers.
● Almost no overhead over SSD (= Fast).
Feature Pyramid Network (FPN)

Feature Pyramid Network (FPN)
● How important is top-down enrichment?
● How important are lateral connections?
● How important are pyramid representations?

Focal Loss for Dense Object Detection
● Can we train a single stage detector to be as accurate as two stage detectors?
● Contributions:
○ RetinaNet: Single stage object detector based on FPN backbone.
○ New loss.

● Class unbalance is an important issue for object detection.
● Previous solutions:
○ Random resampling at 1:3 ratio.
○ Hard negative resampling at 1:3 ratio.
● Both solutions means that at each step, we only a few samples actually matters
to the loss function.
● Instead, include all samples but use different weight for each class.
○ Regular cross entropy:
○ Weighted cross entropy:

● Using weight CE as baseline:
○ Can we do better?
○ Can we use different weight for each sample?
● Focal loss:
● Every sample is weighted according to its error.
○ We want to focus on samples which are
mislabeled.

● Different parameters for RetinaNet

● Comparison with online hard negative mining

● Accuracy/speed trade-offs

● Benchmark results

Also Read:
Deformable Convolutional Networks
https://arxiv.org/abs/1703.06211

YouTube Videos
● CS231n
○ Lecture 11 - Detection and segmentation https://youtu.be/nDPWywWRIRo
● Deep Learning for Objects and Scenes (CVPR 2017 Workshop)
○ Lecture 1: Learning Deep Representations for Visual Recognition, by Kaiming He
https://youtu.be/jHv37mKAhV4
○ Lecture 2: Deep Learning for Instance-level Object Understanding, by Ross Girshick
https://youtu.be/jHv37mKAhV4?t=39m4s

Looking for brilliant researchers
cv@brodmann17.com /
amir@brodmann17.com

Computer Vision Tasks
Source: CS231n Object detection http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf

Mask R-CNN
● Instance segmentation with pose
estimation for people.
● Extends faster R-CNN by adding new
branch for the instance mask task.
● Pose estimation can be added by simply
adding an additional branch.
● SOTA accuracy on detection, segmentation
and pose estimation at 5 FPS on GPU.
● https://arxiv.org/abs/1703.06870
● Girshick won young researcher award.

Mask R-CNN
● RoiPool
○ Quantization breaks pixel-to-pixel alignment
○ Too coarse and not good for fine spatial
information required for mask.
● RoiAlign
○ Bilinearly sample the proposal region and avoid
the quantization.
○ Smoothly normalize features and predictions
into coordinate frame free of scale and aspect
ratio

Mask R-CNN
● Backbone architecture
○ ResNet
○ ResNeXt
○ FPN
● Mask representation
○ FC vs. Convolutional
○ Multinomial vs. Independent Masks: softmax
vs. sigmoid
○ Class-Specific vs. Class-Agnostic Masks:
almost same accuracy
● Multi-task learning
○ Mask task improves object detection accuracy.
○ Keypoint task reduces object detection
accuracy.

Mask R-CNN
● Pose estimation
○ Simply add an additional branch.
○ Model a keypoint’s location as a one-hot mask,
and adopt Mask R-CNN to predict K masks.
○ Experiments are mainly to demonstrate the
generality of the Mask R-CNN framework.
○ RoiAlign improves this task’s accuracy as well.

Looking for brilliant researchers
cv@brodmann17.com

Advanced deep learning based object detection methods

More Related Content

What's hot

Similar to Advanced deep learning based object detection methods

More from Brodmann17

Recently uploaded

In this document

Advanced deep learning based object detection methods