Advanced Deep Learning based Object
Detection Methods
Improving Object Detection With One Line of Code
● Non-Maximum Suppression is a greedy
process.
○ It worked well enough in 2007 but it doesn’t
anymore.
● High scoring detections can be suppressed
just as low scoring detections.
○ Overlap with stronger detection is the only
criteria.
● Should one detection completely suppress
another detection, or simply reduce its
confidence?
Improving Object Detection With One Line of Code
● NMS:
● Linear Soft-NMS:
● Gaussian Soft-NMS:
○ Linear Soft-NMS is not continuous in terms of
overlap and a sudden penalty is applied when a
NMS threshold is reached.
○ Instead we can use a continuous function:
Improving Object Detection With One Line of Code
Improving Object Detection With One Line of Code
Learning Non-Maximum Suppression
● Object detectors are mostly trained
end-to-end, except for the NMS.
○ NMS is still fully hand-crafted, and forces a
trade-off between recall and precision.
● Training loss is not evaluation loss.
○ Training is performed without NMS
○ During evaluation, multiple detections for same
object count as false positives.
● Instead, train the network to include the
suppression process.
○ Only output one bounding box per object.
○ Learn how to handle close objects.
Learning Non-Maximum Suppression
● Additional blocks that:
○ Encode pairwise information.
○ For each detection, pool information from all
pairings.
○ Update feature vector.
○ Repeat.
● New loss:
○ Only one positive candidate per object.
○ Instead of the current practice to take all
objects with IoU>50%
Learning Non-Maximum Suppression
Learning Non-Maximum Suppression
● Multi-scale object detection using image pyramid
○ Predict different scales by applying same model at different image resolutions.
● Classic method.
● But also, in OverFeat.
● Slow. Requires multiple evaluation of the same model.
Multi-Scale Object Detection
Multi-Scale Object Detection
● Predict multiple scale of objects using a single feature map.
● Same as Faster R-CNN.
● Fast
● Single model (same in training as in testing).
● Bad features resolution for small objects.
● Predict different object sizes at different feature scales.
● Same as SSD.
● Good features resolution for small objects
● But features are much weaker than in deeper layers.
Multi-Scale Object Detection
● Single model (same in training as in testing).
● Good features resolution for small objects.
● Strong features in all layers.
● Almost no overhead over SSD (= Fast).
Feature Pyramid Network (FPN)
Feature Pyramid Network (FPN)
Feature Pyramid Network (FPN)
● How important is top-down enrichment?
● How important are lateral connections?
● How important are pyramid representations?
Feature Pyramid Network (FPN)
● How important is top-down enrichment?
● How important are lateral connections?
● How important are pyramid representations?
Focal Loss for Dense Object Detection
● Can we train a single stage detector to be as accurate as two stage detectors?
● Contributions:
○ RetinaNet: Single stage object detector based on FPN backbone.
○ New loss.
Focal Loss for Dense Object Detection
● Class unbalance is an important issue for object detection.
● Previous solutions:
○ Random resampling at 1:3 ratio.
○ Hard negative resampling at 1:3 ratio.
● Both solutions means that at each step, we only a few samples actually matters
to the loss function.
● Instead, include all samples but use different weight for each class.
○ Regular cross entropy:
○ Weighted cross entropy:
● Using weight CE as baseline:
○ Can we do better?
○ Can we use different weight for each sample?
● Focal loss:
● Every sample is weighted according to its error.
○ We want to focus on samples which are
mislabeled.
Focal Loss for Dense Object Detection
● Different parameters for RetinaNet
Focal Loss for Dense Object Detection
● Comparison with online hard negative mining
Focal Loss for Dense Object Detection
● Accuracy/speed trade-offs
Focal Loss for Dense Object Detection
● Benchmark results
Focal Loss for Dense Object Detection
Also Read:
Deformable Convolutional Networks
https://arxiv.org/abs/1703.06211
YouTube Videos
● CS231n
○ Lecture 11 - Detection and segmentation https://youtu.be/nDPWywWRIRo
● Deep Learning for Objects and Scenes (CVPR 2017 Workshop)
○ Lecture 1: Learning Deep Representations for Visual Recognition, by Kaiming He
https://youtu.be/jHv37mKAhV4
○ Lecture 2: Deep Learning for Instance-level Object Understanding, by Ross Girshick
https://youtu.be/jHv37mKAhV4?t=39m4s
Looking for brilliant researchers
cv@brodmann17.com /
amir@brodmann17.com
Computer Vision Tasks
Source: CS231n Object detection http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf
Mask R-CNN
● Instance segmentation with pose
estimation for people.
● Extends faster R-CNN by adding new
branch for the instance mask task.
● Pose estimation can be added by simply
adding an additional branch.
● SOTA accuracy on detection, segmentation
and pose estimation at 5 FPS on GPU.
● https://arxiv.org/abs/1703.06870
● Girshick won young researcher award.
Mask R-CNN
Mask R-CNN
Mask R-CNN
Mask R-CNN
● RoiPool
○ Quantization breaks pixel-to-pixel alignment
○ Too coarse and not good for fine spatial
information required for mask.
● RoiAlign
○ Bilinearly sample the proposal region and avoid
the quantization.
○ Smoothly normalize features and predictions
into coordinate frame free of scale and aspect
ratio
Mask R-CNN
Mask R-CNN
● Backbone architecture
○ ResNet
○ ResNeXt
○ FPN
● Mask representation
○ FC vs. Convolutional
○ Multinomial vs. Independent Masks: softmax
vs. sigmoid
○ Class-Specific vs. Class-Agnostic Masks:
almost same accuracy
● Multi-task learning
○ Mask task improves object detection accuracy.
○ Keypoint task reduces object detection
accuracy.
Mask R-CNN
● Pose estimation
○ Simply add an additional branch.
○ Model a keypoint’s location as a one-hot mask,
and adopt Mask R-CNN to predict K masks.
○ Experiments are mainly to demonstrate the
generality of the Mask R-CNN framework.
○ RoiAlign improves this task’s accuracy as well.
Looking for brilliant researchers
cv@brodmann17.com

Advanced deep learning based object detection methods

  • 1.
    Advanced Deep Learningbased Object Detection Methods
  • 2.
    Improving Object DetectionWith One Line of Code ● Non-Maximum Suppression is a greedy process. ○ It worked well enough in 2007 but it doesn’t anymore. ● High scoring detections can be suppressed just as low scoring detections. ○ Overlap with stronger detection is the only criteria. ● Should one detection completely suppress another detection, or simply reduce its confidence?
  • 3.
    Improving Object DetectionWith One Line of Code ● NMS: ● Linear Soft-NMS: ● Gaussian Soft-NMS: ○ Linear Soft-NMS is not continuous in terms of overlap and a sudden penalty is applied when a NMS threshold is reached. ○ Instead we can use a continuous function:
  • 4.
    Improving Object DetectionWith One Line of Code
  • 5.
    Improving Object DetectionWith One Line of Code
  • 6.
    Learning Non-Maximum Suppression ●Object detectors are mostly trained end-to-end, except for the NMS. ○ NMS is still fully hand-crafted, and forces a trade-off between recall and precision. ● Training loss is not evaluation loss. ○ Training is performed without NMS ○ During evaluation, multiple detections for same object count as false positives. ● Instead, train the network to include the suppression process. ○ Only output one bounding box per object. ○ Learn how to handle close objects.
  • 7.
    Learning Non-Maximum Suppression ●Additional blocks that: ○ Encode pairwise information. ○ For each detection, pool information from all pairings. ○ Update feature vector. ○ Repeat. ● New loss: ○ Only one positive candidate per object. ○ Instead of the current practice to take all objects with IoU>50%
  • 8.
  • 9.
  • 10.
    ● Multi-scale objectdetection using image pyramid ○ Predict different scales by applying same model at different image resolutions. ● Classic method. ● But also, in OverFeat. ● Slow. Requires multiple evaluation of the same model. Multi-Scale Object Detection
  • 11.
    Multi-Scale Object Detection ●Predict multiple scale of objects using a single feature map. ● Same as Faster R-CNN. ● Fast ● Single model (same in training as in testing). ● Bad features resolution for small objects.
  • 12.
    ● Predict differentobject sizes at different feature scales. ● Same as SSD. ● Good features resolution for small objects ● But features are much weaker than in deeper layers. Multi-Scale Object Detection
  • 13.
    ● Single model(same in training as in testing). ● Good features resolution for small objects. ● Strong features in all layers. ● Almost no overhead over SSD (= Fast). Feature Pyramid Network (FPN)
  • 14.
  • 15.
    Feature Pyramid Network(FPN) ● How important is top-down enrichment? ● How important are lateral connections? ● How important are pyramid representations?
  • 16.
    Feature Pyramid Network(FPN) ● How important is top-down enrichment? ● How important are lateral connections? ● How important are pyramid representations?
  • 17.
    Focal Loss forDense Object Detection ● Can we train a single stage detector to be as accurate as two stage detectors? ● Contributions: ○ RetinaNet: Single stage object detector based on FPN backbone. ○ New loss.
  • 18.
    Focal Loss forDense Object Detection ● Class unbalance is an important issue for object detection. ● Previous solutions: ○ Random resampling at 1:3 ratio. ○ Hard negative resampling at 1:3 ratio. ● Both solutions means that at each step, we only a few samples actually matters to the loss function. ● Instead, include all samples but use different weight for each class. ○ Regular cross entropy: ○ Weighted cross entropy:
  • 19.
    ● Using weightCE as baseline: ○ Can we do better? ○ Can we use different weight for each sample? ● Focal loss: ● Every sample is weighted according to its error. ○ We want to focus on samples which are mislabeled. Focal Loss for Dense Object Detection
  • 20.
    ● Different parametersfor RetinaNet Focal Loss for Dense Object Detection
  • 21.
    ● Comparison withonline hard negative mining Focal Loss for Dense Object Detection
  • 22.
    ● Accuracy/speed trade-offs FocalLoss for Dense Object Detection
  • 23.
    ● Benchmark results FocalLoss for Dense Object Detection
  • 24.
    Also Read: Deformable ConvolutionalNetworks https://arxiv.org/abs/1703.06211
  • 25.
    YouTube Videos ● CS231n ○Lecture 11 - Detection and segmentation https://youtu.be/nDPWywWRIRo ● Deep Learning for Objects and Scenes (CVPR 2017 Workshop) ○ Lecture 1: Learning Deep Representations for Visual Recognition, by Kaiming He https://youtu.be/jHv37mKAhV4 ○ Lecture 2: Deep Learning for Instance-level Object Understanding, by Ross Girshick https://youtu.be/jHv37mKAhV4?t=39m4s
  • 26.
  • 27.
    Computer Vision Tasks Source:CS231n Object detection http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf
  • 28.
    Mask R-CNN ● Instancesegmentation with pose estimation for people. ● Extends faster R-CNN by adding new branch for the instance mask task. ● Pose estimation can be added by simply adding an additional branch. ● SOTA accuracy on detection, segmentation and pose estimation at 5 FPS on GPU. ● https://arxiv.org/abs/1703.06870 ● Girshick won young researcher award.
  • 29.
  • 30.
  • 31.
  • 32.
    Mask R-CNN ● RoiPool ○Quantization breaks pixel-to-pixel alignment ○ Too coarse and not good for fine spatial information required for mask. ● RoiAlign ○ Bilinearly sample the proposal region and avoid the quantization. ○ Smoothly normalize features and predictions into coordinate frame free of scale and aspect ratio
  • 33.
  • 34.
    Mask R-CNN ● Backbonearchitecture ○ ResNet ○ ResNeXt ○ FPN ● Mask representation ○ FC vs. Convolutional ○ Multinomial vs. Independent Masks: softmax vs. sigmoid ○ Class-Specific vs. Class-Agnostic Masks: almost same accuracy ● Multi-task learning ○ Mask task improves object detection accuracy. ○ Keypoint task reduces object detection accuracy.
  • 35.
    Mask R-CNN ● Poseestimation ○ Simply add an additional branch. ○ Model a keypoint’s location as a one-hot mask, and adopt Mask R-CNN to predict K masks. ○ Experiments are mainly to demonstrate the generality of the Mask R-CNN framework. ○ RoiAlign improves this task’s accuracy as well.
  • 36.