Object Detection with SSD and MobileNet

24 min readJul 6, 2020

1. Introduction

Object detection is one of the most prominent fields of research in computer vision today. It is an extension of image classification, where the goal is to identify one or more classes of objects in an image and localize their presence with the help of bounding boxes as can be seen in figure 1. Hence, object detection plays a vital role in many real-world applications such as image retrieval and video surveillance, to simply name a few. With this in mind, the main aim of our project is to investigate the inner workings of the “Single Shot MultiBox Detector” (SSD) framework for object detection [1]. Our objective is to highlight some of the salient features that make this technique stand out as well as to address a few of its shortcomings as will be discussed in more detail in the rest of this post.

1.1 What makes SSD special?

To answer this question, we first need some historical context. It is the year 2016 and the competition for the best object detection method is fierce with research teams looking for a viable solution that is not just accurate at making predictions but also possesses faster execution times to be utilized in real-time applications. Typically in those days, two-stage approaches which featured region proposals such as the family of R-CNN methods were computationally cumbersome and slow but dominated the field in terms of accuracy (typically measured by mean-Average-Precision or mAP) on standard object detection datasets such as MS COCO and Pascal VOC2007&12. This led to the creation of the well known YOLOv1 network which was much faster and more computationally efficient than previous methods, but this increase in speed could only be achieved at the cost of sacrificing accuracy. And this is exactly where the SSD framework came into the picture. It was the first deep neural architecture that did not use region proposals and featured an End-to-End approach to detecting objects in an image using a single deep neural network that was just as accurate as methods which did. Moreover, with the removal of the region proposal steps, the SSD method was capable of delivering faster execution times as well (59 FPS with mAP 74.3% on VOC2007 test, vs. Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4%) [1,2,3]. However, what is perhaps the most crucial ingredient of the SSD framework and what also allows it achieve good performance in terms of both speed and accuracy is the fact that it utilizes small convolution filters to predict object classes and bounding box locations for different aspect ratios and does so across multiple feature maps from the later stages of the network allowing it to aggregate detections at multiple scales. Therefore, by utilizing multiple layers which are of relatively lower input resolution, and using those to generate predictions at different scales, it can provide high accuracy results with faster detection speeds. To know more, I encourage readers to have a look at the paper which talks specifically about the SSD framework in quite some depth.

1.2 What can be improved?

The biggest drawback of the SSD framework is the fact that its performance is directly proportional to object sizes meaning that it doesn’t fare too well on object categories with small sizes as compared to other approaches such as the family of R-CNNs [1,2]. This is because small objects may not contain any useful information in the top layers of the network that can be fruitfully used for detection. Therefore, data augmentation techniques are commonly used to randomly crop and resize certain parts of the image to help the network more easily identify and learn features for small object categories. Moreover, this is also why increasing the input resolutions of the images from 300x300 to 512x512 provides better results on both the MS COCO and Pascal VOC datasets [1]. Therefore, in our work, we conduct experiments that highlight the difference in the performance of networks trained using data augmentation and without.

The second issue is that of inference time. We see that 80% of the time taken to do a forward pass through the network is controlled by the base/backbone network. The backbone network is essentially a truncated high-quality image classifier which is used to extract features that are to be used for prediction in the later stages of the network. The authors use the famous VGG-16 network which is pre-trained on the ILSVRC CLS-LOC dataset [1]. We see this is an opportunity to possibly further increase detection speeds by using a faster base network. In our work, we have experimented with utilizing the MobileNetv1 [4] pre-trained on the same dataset as VGG16 as a replacement to the original base network to study the speed vs accuracy trade-off involved. Furthermore, we test the network’s run times on a wide range of hardware with varying computational capacities to gain greater insights from a practical viewpoint in context to real-time application needs of developers.

2. Related Work

2.1 MobileNet(v1)

The MobileNet network architecture is a special class of convolutional neural models that are built using depth-wise separable convolutions and are therefore more lightweight in terms of their parameter count and computational complexity. Additionally, the authors involved in the development of this network architecture introduced 2 additional global hyper-parameters. These are the width and the resolution multiplier which can control the number of input/output channels of the convolution layers and the input data resolution (i.e Height, Width) respectively. These parameters can be used to directly influence the latency vs accuracy of the network depending on the end requirements of the user [4].

2.2 SSD & MobileNet

The integration of MobileNet into the SSD framework forms one of the core aspects of our work. However, it is worth pointing out that the combination of a highly efficient base network such as the MobileNet with the supremely effective SSD framework has been a hot research topic in recent times, largely due to dealing with the practical limitations of running powerful neural nets on low-end devices such as mobile phones/laptops to further extend the myriad of possibilities with regards to real-time applications. In fact, in this paper by the Google Research team, they concretely evaluate the speed/memory/accuracy trade-off concerned with adapting diverse base feature extractors(i.e VGG16, Residual Networks, MobileNet) within various detection architecture such as the Faster R-CNN, R-FCN and SSD on different hardware and software platforms [5]. The work they do serves as an inspiration to ours as we similarly try to analyze the right balance needed in terms of speed and accuracy for real-time applications based on hardware constraints in the context of integrating MobileNet within the SSD framework.

2.3 Different Approaches to Trading off Speed for Accuracy

2.3.1 Fire SSD [6]

This paper brings up key issues regarding the execution of convolutional neural networks on what they call “Edge” computing devices. Therefore, the authors of this paper utilize the “SqueezeNet” architecture as their base network. This network architecture comprises of a “Fire Module” which effectively reduces the number of input channels by using a 1x1 convolution layer (point-wise convolution) before applying a 3x3 convolution and 1x1 convolution layers in parallel and is thus quite useful for reducing the number of parameters of the network. Additionally, the authors modify this “Fire Module” into a “Wide Fire Module” (WFM) to perform group convolutions which they say, has proven to not only reduce computational complexity but also improve accuracy. What’s more, is that the authors also employ dynamic residual multi-box detection layers. This allows them to gradually increase the depth of the network to better extract features for smaller objects categories to be used for making predictions from relatively earlier layers of the network. Lastly, the authors use a “Normalized and Dropout Module” (NDM) which comprises of batch normalization to normalize gradients coming from different levels of feature maps as well as a dropout layer which regularizes training and improves the network’s ability to generalize to unseen data. Fire SSD is capable of achieving 70.7mAP on the Pascal VOC 2007 test set. Moreover, it can provide detection speeds of up to 30.6FPS and 22.2 FPS on low power mainstream CPUs and integrated GPUs respectively. This makes it close to 6 times quicker as compared to the SSD300 and is about 1/4th the original model’s size as well.

2.3.2 Feature-Fused SSD [7]

The authors of this paper focus on solving the challenges associated with obtaining accurate predictions for smaller objects, as this is a well-known weakness of the SSD framework. However, compared to previous approaches, the authors also prioritize the speed of detection in doing so. Therefore, the authors of this paper describe multi-level feature fusion methods which consist of two distinct modules namely the concatenation module and the element sum module which may be added to the original SSD network architecture to enhance contextual information by injecting semantic information in the shallower layers of the network which are primarily used for generating predictions for smaller objects. Their experimental results not only showcase improved accuracy on smaller object categories but also provide decent detection speeds which can be considered to be used in real-time applications (43 and 40 FPS respectively). It is worth noting that even though this approach is more accurate than the baseline SSD network, it’s also a bit slower due to the added multi-level feature fusion modules.

3. Method

While our initial idea was to implement the entire SSD framework from scratch, we soon realized that this was quite an ordeal, especially since we wanted to subsequently perform many experiments. Considering the amount of time given, there was a choice to be made. Either focus on just replicating the SSD, or use existing implementations and perform novel experiments, to reveal interesting aspects of the framework. We decide to take the latter path and use the existing ssd framework written by qfgaohao. This does not mean that we simply use the library wholesale. The original code base was sufficient to train and evaluate SSD with MobileNet(v1) backbone, but it was lacking in aspects which we will elaborate, along with how we bridged that gap. The code along with the modifications is available on our forked repository.

3.1 MobileNet(v1) Backbone

In our analysis of previous work which looked at the SSD framework, one consistent observation was that the VGG backbone was a bottleneck during training and inference. Clearly there was a need here to replace the VGG with a network that would reduce the computation time while keeping the accuracy similar. We found the perfect candidate in MobileNetv1 as we shall soon see.

MobileNet is made up of Depth-wise Separable Convolutional layers that are computationally faster than standard convolutional layers [4]. The reason is simply due to fewer mult-adds (multiplication and addition operations) due to the separation of channels in the depth-wise layer and their subsequent linear combination using the 1x1 convolution as shown in figure 2 below.

Empirically, the reduction in computational effort does not affect the performance of the network to a large extent which is why we would like to use it as a backbone in the SSD framework.

The full network itself was implemented exactly as the original authors describe. It can be summarized in figure 3.

When used in the SSD, we drop the last three layers, i.e., Avg Pool/s1, FC/s1 and Softmax/s1 since we do not want to classify with MobileNet.

Generally speaking, the role of the backbone network in the SSD framework is to convert the pixels from the input image into features that describe the contents of the image, and pass these along to the other layers of the SSD [1]. Hence, it is used here as a feature extractor for a second neural network. We show in figure 4 how this setup looks for a VGG16 SSD:

When replacing VGG16 with MobileNetv1, we connect the layer 12 and 14 of MobileNet to SSD. In terms of the table and image above, we connect the depth-wise separable layer with filter 1x1x512x512 (layer 12) to the SSD producing feature map of depth 512 (topmost in the above image). We also connect the last pointwise convolution layer to the SSD layer with feature map 1024.

If we were doing classification, we would be interested in the features that describe high-level concepts, such as “there is a face” and “there is fur”, which the classifier layer then can use to draw a conclusion — “this image contains a cat”. In the case of object detection with SSD, we want to know not just these high-level features but also lower-level ones, which is why we also read from the previous layer. These two layers were chosen as they provide mid to high level features. We will leave the task of studying the effect of varying the layer of MobileNet that is connected to the SSD classification headers for future work.

Additionally, the MobileNet has 2 tunable hyperparameters that allow the tradeoff between accuracy and computation. The width multiplier which weighs the input and output channels and resolution multiplier which weighs the input and output resolution. They have an effect of reducing the computational cost dramatically. We use the width multiplier extensively in our experiments to see its effect on the object detection mAP, but do not implement the resolution multiplier as it interferes with the different sized feature maps with are important for SSD to perform object detection.

Hence, it is clear to see from the above discussion as to why we chose the MobileNet as a backbone for the SSD framework.

The original code base that we forked had a fixed MobileNet width of 1 with no ability to parameterize the model while training or evaluation. We were able to add this parameterization into the network. According to the authors of MobileNet(v1), the width parameter translates to reducing the number of input and output channels in the depth-wise convolution. This is exactly how we implement it; by using the parameter as a weight (0 to 1) on the number of channels. The number of channels in every layer is weighted by the same amount.

In addition to modifying the input and output channels in the MobileNet, we also had to weight, by the same width factor, the input channels for the regression and classification headers of SSD which were connected to MobileNet, i.e. 12th and 14th(last) MobileNet layer as shown in figure 5.

Figure 5: SSD network with MobileNet backbone

To our knowledge, this is the only implementation of the SSD with the MobileNet(v1) which allows for the width parameters to be tuned.

3.2 Data Augmentation

Data augmentation is particularly important to improve detection accuracy for small objects as it creates zoomed-in images where more of the object structure is visible to the classifier. It is also useful for handling images containing occluded objects by including cropped images in the training data where only part of the object may be visible.

The following are the data augmentation steps used in order:

Photometric Distortions
Geometric Distortions
Expand Image
Random Crop
Random Mirror

Photometric Distortions involves random changes to brightness, hue, contrast lighting noise, and saturation. Geometric Distortions involve changes to the image dimensions. In expand the image, the canvas of the image is expanded, with the original image being placed randomly within. We can see this in example 2 of figure 6 which shows the grey canvas which is expanded beyond the image. In random crop, we crop a patch out of the expanded image produced in Expand Image such that this patch has some overlap with at least one ground-truth box and the centroid of at least one ground-truth box lies within the patch. We can see in figure 6, three examples of Random Crop.

The last augmentation step is Random Mirror. This one simply involves a left-right flip as shown in figure 7.

Figure 7: Random mirror data augmentation

The figures in this section are taken from this telesens blog post, which talks about these techniques in a bit more depth.

The original codebase by qfgaohao had data augmentation on the VOC dataset by default. It performed Photo-metric Distortion, Random Sample Cropping, Random Mirroring as explained by the original authors. Our experiments require us to change the augmentation, turning it off and on as needed. We implement this as a settable parameter.

3.3 Segregating Images by Size

One of our experiments on studying the effects of data augmentation required us to evaluate SSD’s power of classifying objects of different sizes. While datasets like MS COCO classify their data as “small”, “medium” and “large”, we did not find any documentation on how these are classified for the Pascal VOC 2007&2012 datasets. Therefore, we decided to use the sizes of the bounding box as a percentage of the total image to determine the size of the object. Using this technique, we were able to produce a more fine-grained division of object sizes as is described in more detail in section 4.1. According to our knowledge, this has not been done in previous works.

3.4 Hard-Negative Mining

In an interest to perform an extensive ablation study of the SSD framework, we were interested in analyzing the benefits of performing hard-negative mining as the authors of the SSD paper didn’t cover this aspect in their ablation studies [1]. Hard-negative mining is essentially used to deal with the class imbalance created by the bountiful background class as a result of using a large number of default boxes. The authors of the paper rank and sort samples that contain only the background and use those that possess the highest individual losses. However, when we perform our experiments, instead of using a 3:1 ratio of negative to positive samples as was done originally by the authors, we used all the negative samples/background instances to train the network without hard-negative mining. We shall see in the following section, what was the result of doing so.

3.5 Measuring Training and Evaluation time

As one of our most important experiments was to measure the trade-off between accuracy and time, we measured the average time per epoch during training and for evaluating a single image. In the case of training, measuring time was quite straightforward. However, for evaluation, we measured two separate aspects: inference time and prediction time. Whereas inference time measures the time taken to do a forward pass, the prediction time also includes the non-max suppression operation. We separate evaluation time into these two quanta since one of our experiments revolves around exposing the non-max suppression operation being a bottleneck of the SSD. Earlier works measure these two together, hiding the effect of non-max suppression.

4 Data & Experiments

4.1 Dataset

The dataset that we decided to use for our project has been taken from the PASCAL Visual Object Classes Challenges for the years 2007 and 2012, also known as the VOC2007 & VOC2012 datasets. In the context of object detection, these datasets contain bounding boxes and corresponding labels for 20 different object classes divided into 4 main categories: person, animal, vehicle and indoor. To train the neural networks, we used 11540 images containing 27450 annotations and 5011 images containing 12608 annotations from the VOC2012 and VOC2007 datasets respectively. For testing, 4952 images containing 12032 annotations taken solely from the VOC2007 test set were used. It should also be noted that we do not use annotations which are demarcated as difficult so that it does not act as a confounding variable in our experiments.

Here is one example that is taken from the dataset shown in figure 8:

Figure 8: Example of images in the VOC datasets

In order to study the effects of data augmentation in a more fine-grained manner, we decided to segregate objects based on the size of their corresponding bounding boxes with respect to the full image. In particular, each object was assigned a size between 0% and 100% calculated based on the ratio of the bounding box’s area with respect to the full image.

Finally, all images were divided into 7 bins: 0–5%, 5–10%, 10–20%, 20–40%, 40–60%, 60–80%, 80–100%. For example, an image belonging to the 40–60% bin only contains objects of size between 40% and 60%.

For most of our analyses, we used the full test set without considering the object sizes, while for the experiments described in section 4.2.3.1 we split the test-set based on the bins we just described. The images which contained multiple objects of sizes belonging to different bins were discarded.

4.2.1 MobileNet Backbone Vs VGG16 Backbone

To highlight the difference in detection speeds vs accuracy for different network architectures, we used the pre-trained weights and models provided by qfgaohao for training and evaluating the standard versions of both networks i.e VGG16-SSD and MobileNet-SSD respectively. However, for experimenting without data augmentation (explained in more detail in section 4.2.3) as well as incorporating the width parameter associated with the MobileNet backbone, we modified the code as mentioned previously. Moreover, the networks with varying width parameters were all trained from scratch for 100 epochs without using any pre-trained weights even for the base network as the width multiplier directly affected the entire network architecture. Additionally, we thought it would be most appropriate to vary the width parameter in regular intervals between 0 and 1 and chose the following values-: 0.25,0.5,0.75 and 1(default). This allowed us to study the latency vs accuracy tradeoff involved in further optimizing the MobileNetv1 backbone embedded within the SSD framework more closely.

Here in figure 9 shown below are the results that we obtained based on conducting all our experiments on google Colab using a single K80 GPU

What we can see clearly from the visualisation shown above, is the fact that in terms of mean average precision, the VGG16 backbone network is superior as compared to the MobileNet backbone, however, it also consumes more time to make a prediction. With that being said, another interesting takeaway from our experiments is that data augmentation greatly affects the prediction speeds as well as accuracy for both networks. This can be understood by the fact that additional operations required to maintain the true bounding boxes locations after performing data augmentation are not needed. Furthermore, the training times for without data augmentation are also much lesser for both networks as the PyTorch data loader does much less computational work. Finally, last but not least, the width multiplier also greatly influences both the speed and accuracy of the MobileNet-SSD networks. We, however, note that the drop in accuracy overshadows the gains in prediction speeds granted by using any alpha value while removing data augmentation. This suggests that limiting the size of the base mobile network greatly reduces accuracy without huge benefits in detection speeds. One last finding was the fact that the MobileNet(with augmentation) backbone network has greater inference times as compared to VGG16. We shall explore this aspect a bit more deeply when discussing the different hardware settings we explored.

4.2.2 Testing On Different Hardware

Now that we have explored the speed vs accuracy trade-off on a single machine, the main goal of performing these next set of experiments was to further broaden our understanding from the context of real-time application needs of users. And so, to capture such insights, we tested both default networks (i.e VGG16-SSD and MobileNet-SSD) on a wide array of devices available on Amazon’s Web Services and the Google cloud platform. The devices we chose ranged from low-end hardware to compute-optimized hardware and finally to high-end deep learning specific GPU based hardware. Therefore, in this manner, our readers can benefit from knowing the time it takes to train such networks using different GPUs and also better understand the time it takes for generating predictions based on the computational resources that they commonly have available to them via cloud platforms such as AWS and Google.

Shown below in figures 10,11, 12 are bar plots of running our experiments for measuring the training time per epoch as well as average prediction and inference times.

Figure 10: Time per Epoch vs Hardware for different backbones

Figure 11: Avg Prediction Time on different hardware with different backbones

Figure 12: Avg Inference Time on different hardware with different backbones

Based on these plots, we can spot interesting trends for the MobileNet-SSD’s detection times vs VGG16-SSD’s detection times. Firstly, it is quite noticeable that on GPU based hardware, inference times of VGG16 backbone are slightly better. However, for hardware devices without GPU capabilities, the MobileNet-SSD backbone takes the prize. This came as quite a shock for us but also explains the results in the scatterplot. Furthermore, when looking at the prediction times in figure 11, we see that having compute-optimized hardware brings down the time taken to do an average prediction. We believe the reason for this is because the Non-Max-Suppression operation(NMS) which is primarily done on the CPU (based on the code provided by qfgaohao) is a bottle-neck for the SSD framework. This is simply because the SSD-Framework produces many more output predictions based on the number of default boxes and therefore the NMS operation can be quite an expensive one in the context of this framework and so we feel that a compute-optimized CPU or a deep learning specific GPU is needed to do this more efficiently to achieve real-time detection speeds.

Lastly, we also provide readers with an estimate of how long it takes to train both networks on different GPUs. For example, interestingly, we see than on the p100, the VGG16-SSD trains fastest but for MobileNet-SSD, the best GPU is the T4.

4.2.3 Ablation Studies

For these set of experiments, the goal was to determine how certain components such as data augmentation and hard-negative mining during training affect the performance of the different networks (i.e Vgg16-SSD/MobileNet-SSD). Therefore, for all the experiments we did, we only varied the element which we wanted to investigate and kept all other settings to be fixed.

4.2.3.1 Data Augmentation & Image Sizes

For understanding the effects of training with and without data augmentation on the prediction accuracy of different object sizes, the dataset was divided into bins based on the method previously described in section 4.1. All experiments for both types of networks were conducted by using the model provided by qfgaohao with the pre-trained weights on the combined VOC2007 and VOC2012 datasets. However, to remove data augmentation, the original code was modified and retrained on the same data but with pre-trained weights only for the base networks(i.e VGG16 and MobileNet). It is also worth mentioning that for the models we trained ourselves(without data augmentation), we used those weights that resulted in the best validation loss during training after training the networks for 100 epochs on google Colab with a K80 GPU.

Here we provide the results of the experiments shown in figure 12 below.

Figure 12: Prediction accuracy vs bin-size

To begin with, a few things are immediately noticeable: first of all, small objects are, as expected, harder to detect for all backbone/augmentation combinations. SSD is well known to do poorly on this type of data as small objects are not well represented in the feature maps used for prediction in the later stages of the network due to the empirical receptive field becoming too large. Furthermore, we can see that data augmentation greatly improves the performances for both types of network(i.e VGG16-SSD & MobileNet-SSD). These were the results we expected.

Another interesting aspect is the fact that using VGG16 as backbone network is better than MobileNet in terms of performance, and this difference is much more evident for smaller object sizes. For bigger objects, the two models tend to perform similarly, with MobileNet beating VGG16 for objects between 40 and 80% when no data augmentation is used.

4.2.3.2 Hard-Negative Mining

To effectively remove hard-negative mining, we modified the codebase to use all the negative samples/background instances during training. This is in contrast to what was originally done by the authors where they used a 3:1 ratio of negative to positive samples. The aim of doing this experiment was to quantify the actual benefits of using hard-negative mining as we could compare the differences in accuracy and speed for a network trained with and without hard-negative mining. Additionally, this was done for both network types (i.e VGG16-SSD and MobileNet-SSD) where the networks used in the experiment were trained with default hyper-parameters and with pre-trained weights provided by qfgaohao for the base networks. Moreover, all models used in this set of experiments were trained for 15 epochs and evaluated using the same hardware on google Colab to make sure that the comparison was standardized.

Table 1 below shows the results for each of the models trained with and without hard-negative mining.

Table 1: Difference in MAP and Avg Prediction Time with and without negative hard mining

First, we find quite surprising that at 15 epochs, the MobileNet-SSD model outperforms VGG16-SDD suggesting that the former model trains relatively faster than the latter but perhaps what’s more important is looking at the fact that the mAP for both models drops to 0.40 when no hard-negative mining is used. This is as expected as the large prior probability of the background class greatly reduces the model’s ability to learn the other more relevant classes. Moreover, we also find that this difference in performance is greater for the MobileNet-SSD network than it is for the VGG16-SSD framework.

Another interesting observation can be made when looking at the increase in prediction times for the networks. We believe this is because the networks are forced to make more output predictions for the negative class and this causes a detrimental impact to the average prediction speeds.

4.3 Discussion

After presenting our most important findings, we will now turn our discussion towards the speculations that we included in this blogpost to justify the unexpected results we obtained.

Naturally, the main big controversy lies in the fact that MobileNet did not turn out to be a faster backbone when evaluated on GPU based hardware when considering inference times. As stated before, we have been quite surprised by this, since the depth-wise separable convolutional layers of MobileNet should have improved the computation time by being faster than the more standard convolutional layers used by VGG. At first, we thought that the network was wrongly implemented or the code was not optimized for running on GPUs which we decided to investigate. Then, based on our analysis of the code provided by qfgaohao, we finally believe to have found the main bottleneck in the fact that PyTorch 1.0, which is the version used by the authors of the code, was not very well optimized for depth-wise separable convolutions. We speculate that porting the same code to newer versions of the toolkit will change the results and provide expected results.

Another interesting aspect to discuss is the difference in training times between both networks on different deep learning specific GPUs. It is still not clear why the performance of VGG16-SSD network on the T4 GPU is slower than on the p100 even though the T4 is a more superior GPU. We believe that this could be the result of depth-wise separable convolutions, which is the only different operation in the two networks, is not fully optimized to run on the GPUs. Though this is just an armchair hypothesis and needs careful investigation.

5. Conclusion

We performed multiple experiments to reveal interesting aspects of the SSD framework for object detection. In this section, we summarize our results below.

We first experimented with replacing the VGG16 backbone of SSD with MobileNet and tweaking some of its parameters to see the trade-off we can achieve between accuracy and prediction time. We saw that data augmentation greatly affects the prediction speeds as well as accuracy for both networks. Furthermore, the training times without data augmentation are also much lesser for both networks.

Next, we looked at using different hardware, from vanilla cloud CPUs to popular cloud GPUs, for training and inference of the networks with various settings. We saw that while running on GPUs, inference times of VGG16 backbone are slightly better but the result is reversed for hardware devices without GPU capabilities. We hypothesize that this is possibly due to the GPUs not having optimized operations for the depth-wise separable layer as discussed in the previous section 4.3. we also saw that SSD framework required not only a good GPU but also a good CPU to perform the Non-Max Suppression during evaluation.

In the data augmentation and image sizes ablation study, we see that using VGG16 as backbone network is better than MobileNet in terms of performance. The difference between the networks is much more evident for smaller object sizes. MobileNet beats VGG16 for objects of sizes between 40% and 80% of the image when no data augmentation is used.

In the hard-negative mining ablation study, we saw that the MobileNet-SSD model without hard-negative mining outperforms the VGG-SSD in early epochs suggesting that it trains much faster. Additionally, after the removal of hard-negative mining, the prediction time rose due to more output predictions for the negative class.

Future Work- Here we list some of the ideas that we did not explore in our work but has promise for further study:

We conducted an ablation study in the data augmentation as a whole, i.e. we either perform data augmentation or remove it altogether. One interesting aspect to study would be the impact of each data augmentation technique. Hence, in the future, we can explore a more fine-grained ablation study on data augmentation techniques.
In prior works, we see that the authors of Fire SSD paper use the SqueezeNet as a backbone which also allows a good trade-off between speed and accuracy. Future work would be to study the difference between MobileNetv1 and SqueezeNet when used as the backbone in the SSD [8].
An interesting avenue to explore is the Memory consumption of the SSD when using different backbone architectures. This could reveal how suitable different methods are for running on edge devices.
Experimenting with tiling of default boxes and studying how varying this affects the metrics and its interaction with object sizes.
The adjustment of feature maps being generated at the later stages of the network to correctly correspond to empirically viable receptive fields on the basis of the chosen backbone network are not well studied. This is something we look forward to in future work.