GstInference with MobileNetV2 + SSD

From RidgeRun Developer Wiki



Previous: Supported architectures/MobileNet Index Next: Supported backends





Description

MobileNetV2 is a very effective feature extractor for object detection and segmentation. On the other side, SSD is designed to be independent of the base network, and so it can run on top of MobileNetV2. The MobileNetV2 + SSD combination uses a variant called SSDLite that uses depthwise separable layers instead of regular convolutions for the object detection portion of the network, which makes it easier to get real-time results. When paired with SSDLite the models are about 35% faster with the same accuracy than MobileNetV1.

Architecture

When paired with SSD the job of the MobileNetV2 network is to convert the pixels into features that describe what's in the image and pass them onto the next layers. In this case the next layers can be considered as the ones from SSD. Here not only the outputs of the last MobilenetV2 layer are forwarded to SSD, but also some previous layers in order to get high and low level features.

GStreamer Plugin

The Gstreamer plugin uses the same pre-process and post-process for MobilenetV2 described in the original paper. Please take into consideration that not all deep neural networks are trained the same even if they use the same model architecture. If the model is trained differently, details like label ordering, input dimensions, and color normalization can change.

The pre-trained model used to test the element may be downloaded from the Coral store for the Coral framework. The labels for this model can be extracted from here.

    Important: Make sure you use RidgeRun labels to get the correct inference results.

Pre-process

Input parameters:

  • Input size: 224 x 224
  • Format BGR

The pre-process consists of taking the input image and transforming it to the input size (by scaling, interpolation, cropping...). Then the mean is subtracted from each pixel on RGB. Finally, conversion to BGR is performed by inverting the order of the channels.

Post-process

The model output are 4 tensors arranged in the following way:

  • 0: [N * 4] tensor with the location of the N bounding boxes (top-left and right-bottom corners), normalized from 0 to 1.
  • 1: [N] tensor with the number of labels for the N bounding boxes.
  • 2: [N] tensor with the probabilities of those N labels.
  • 3: [1] tensor with the number of detected boxes.

The post-process consists of evaluating the probability of each box and compare them with a threshold, if the probability is higher than the threshold then we convert the normalize coordinates to the actual coordinates and assign them the corresponding label.


Previous: Supported architectures/MobileNet Index Next: Supported backends