Object Detection with SSD and MobileNet

Aditya Kunar
24 min readJul 6, 2020

1. Introduction

Object detection is one of the most prominent fields of research in computer vision today. It is an extension of image classification, where the goal is to identify one or more classes of objects in an image and localize their presence with the help of bounding boxes as can be seen in figure 1. Hence, object detection plays a vital role in many real-world applications such as image retrieval and video surveillance, to simply name a few. With this in mind, the main aim of our project is to investigate the inner workings of the “Single Shot MultiBox Detector” (SSD) framework for object detection [1]. Our objective is to highlight some of the salient features that make this technique stand out as well as to address a few of its shortcomings as will be discussed in more detail in the rest of this post.

Figure 1: Example of object detection

1.1 What makes SSD special?

To answer this question, we first need some historical context. It is the year 2016 and the competition for the best object detection method is fierce with research teams looking for a viable solution that is not just accurate at making predictions but also possesses faster execution times to be utilized in real-time applications. Typically in those days, two-stage approaches which featured region proposals such as the family of R-CNN methods were computationally cumbersome and slow but dominated the field in terms of accuracy (typically measured by mean-Average-Precision or mAP) on standard object detection datasets such as MS COCO and Pascal VOC2007&12. This led to the creation of the well known YOLOv1 network which was much faster and more computationally efficient than previous methods, but this increase in speed could only be achieved at the cost of sacrificing accuracy. And this is exactly where the SSD framework came into the picture. It was the first deep neural architecture that did not use region proposals and featured an End-to-End approach to detecting objects in an image using a single deep neural network that was just as accurate as methods which did. Moreover, with the removal of the region proposal steps, the SSD method was capable of delivering faster execution times as well (59 FPS with mAP 74.3% on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4%) [1,2,3]. However, what is perhaps the most…

Aditya Kunar

I am a researcher at Generatrix- An AI-based privacy preserving data synthesizing platform. I have an avid passion for new and emerging technologies in AI & ML