Custom Object Detection with YOLOv4

Ahlemkaabi
9 min readSep 16, 2022
The output of the Re-trained YOLOv4 model on video

In this blog post, I will share my experience with re-training the YOLOv4 algorithm to detect different equipment during turnaround operations at the airport apron (AKA preparing the airplane for the next flight including boarding, baggage loading, and fueling).

Turnaround processes are important for every airline company that’s why they should be optimized, in order to reduce downtimes, save costs and make customers satisfied with in-time flights.

Computer vision might be the answer to make managing these processes more efficient. And as a first step, the computer algorithm should recognize these processes and therefore the equipment that does the job.

Content

  • Introduction
  • Training the YOLOv4
  • Results
  • Technologies
  • Timeline
  • Challenges
  • Ethical Implications
  • Conclusion

INTRODUCTION

Object detection

Before talking about the YOLOv4 algorithm we have to understand the task of object detection first.
Object detection is a computer vision technique that works to locate and then classify the objects within an image or video. This is accomplished by predicting the coordinates of the bounding box along with the Class probabilities.

YOLO

The YOLO (You Only Look Once) family algorithms are Algorithms based on regression. Instead of selecting interesting parts of an image, they predict classes and bounding boxes for the whole image in one run of the algorithm.
YOLO was first described in the seminal 2015 paper by Joseph Redmon et al.

YOLOv4

The fourth version of the YOLO algorithm was released in April 2020 by Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao in their article “YOLOv4: Optimal Speed and Accuracy of Object Detection”.

Object detector (from the YOLOv4 paper)

YOLOv4 is One-Stage Detector. It consists of:

  • Backbone (Feature Formation): CSPDarknet53 (Cross Stage Partial DenseNet)
  • Neck (Feature Aggregation): PAN (Path Aggregation Network)
  • Head ( The Detection Step): YOLOv3, with anchor-based detection steps, and three levels of detection granularity.

read more about the architecture in detail — this post

YOLOv4 employs a “Bag of Freebies” that improves the performance of the network without adding to inference time in production. Most of the Bag of Freebies has to do with data augmentation — cutMix and Mosaic data augmentation, DropBlock regularization, and class label smoothing.

It also deploys strategies called a “Bag of Specials”, they add marginal increases to inference time but significantly increase performance, so they are considered worth it. — Mish activation, Cross Stage Partial DenseNet, multi-input weighted residual connections (MiWRC)

TRAINING THE YOLOv4

Dataset preparation

To retrain the YOLOv4 model, I manually collected images of the airport apron while preparing the airplane for the next flight.

I have used 2 different datasets and trained YOLOv4 on both. The first was collecting image data that contains the airport apron equipment. The second focused more on similar input from camera streams at the airport.

Dataset

The most crucial step in any deep learning task is data preparation. There is a general rule that garbage in equals garbage out. It is also the most time-taking as well since we want to ensure good images and correct annotations.

Yolo expects a dataset of .jpg images and its related .txt file with the same name. Each .txt file contains the class ID and the coordinates of the bounding box (containing the target class object)

Example:

screenshot of online CVAT
the related .txt file output

Training Configuration

In this part, I would like to focus on a specific configuration = The Network Dimension. It’s something I find cool and important. Then I will briefly present the other configurations.

Working on YOLO we have to consider 3 related sizes, the image dimensions, the network dimensions, and the size of each object within the images. In fact, Darknet/YOLO only knows about one: the network dimensions. Which is presented in the .cfg file. Darknet will always resize the video frame or image to that dimension, say 608x608.

At first thought, to improve prediction we tend to increase the size of the network so we more easily find smaller objects. But, 1- we increase the time it will take inference to complete. And 2- Larger networks require much more memory to train, and slightly more memory to run.

So the question: What is the optimal network size? Knowing that Darknet/YOLO works best when the object size is 16x16 pixels or larger, you have to make sure that the objects after resizing have a size no less than 16x16. And with little math, you can estimate the optimal network size.

Example:

Taking an image sample of our dataset, the smallest object in the original 1280x720 image is approximately 60x34 pixels in size.

To get an object size that measures approximately 16x16, the image dimensions can be cut by 3. Which makes the object (say baggage_truck) measure 60x34 / 3 => 20x12.3 pixels in the resized images.

However, knowing that the network dimensions must be divisible by 32, the closest values to 1280x720 / 3 = 427x240 would be 448x256.

I have applied this technique with the second dataset collected for the second training here is the config file.

width = 448, height = 256

Preparing the .cfg file for the training doesn’t end here, we have to adjust more parameters according to the number of our classes (in my case I have trained the YOLOv4 to detect 8 classes: pushback_truck, catering_truck, passanger_bridge, stairway, ground_power, baggage_truck, baggage_loader, fuel_truck)

After cloning the Darknet repo, in the folder /cfg

  • change line batch to batch=64
  • change line subdivisions to subdivisions=16
  • change line max_batches to classes*2000 but NOT less than the number of training images
  • change line steps to 80% and 90% of max_batches (e.g. steps=4800, 5400)
  • change line classes=80 to number of objects in each of 3 [yolo]-layers (8 in my case)
  • change [filters=255] to filters=(classes+5)×3 in the 3 [convolutional] before each [yolo] layer, keep in mind that it only has to be the last [convolutional] before each of the [yolo] layers.

For more details — This section from the darknet repository is a great guide on how-to-improve-object-detection.

Training

I have used the darknet framework to retrain the YOLOv4 model on google colab

1.Files to prepare

obj.data
obj.names
train.txt — path for each training image

2. Train

Colab file

The training took 6 days!

# the traing command!./darknet detector train data/obj.data cfg/yolov4-custom.cfg /kaabidrive/yolov4/weights/4/yolov4-custom_last.weights -dont_show -map
Finished training

Results

Detection on images

I have implemented the trained model (.weights file, .cfg file, and .data file) into a Flask application that uses the cv2.dnn.readNetFromDarknet module to load the model and make predictions to detect objects from uploaded images. — you can check my GitHub repository and try it yourself!

detecting baggage_loader

For this to work, I had to set up the OpenCV-DNN module with the CUDA backend. — why?
- OpenCV’s DNN module lets you use pre-trained neural networks from popular frameworks like TensorFlow, and PyTorch, etc, (in our case Darknet) and use those models directly in OpenCV.
- OpenCV’s DNN module gives you faster inference results for the CPU

These are other predicted images tested on Google Colab!

Output

Detection on videos

Technologies

These are the important technologies to mention in this project

Google colab notebook: use the GPU to train the model.

OpenCV: use its model to load the model and make the predictions.

Nvidia Cuda: make it possible for the OpenCV DNN module to work.

Flask framework: backend for the web application.

Anaconda: to make an isolated environment to run the application.

Online CVAT: It is the computer vision annotation tool used to prepare the dataset for the training.

Challenges

I had to face two main challenges in this project:

The first is to learn how the YOLOv4 algorithm works and what it needs to re-train on a new custom dataset.

Second, it was how to make the trained model run with a Flask web application backend.

For that, I had to use the OpenCV-DNN module so I can use the trained model directly with the Flask. Until this part, it was all fine.

But I was obliged to change my work environment from Ubuntu to Windows due to storage problems.

Here come more challenges! after being comfortable working on Ubuntu, Now a whole bunch of new installations into a new environment added to the task.

It was a time-consuming task to install the Cuda and cuDNN on windows and learn everything along the way.

But eventually, it was all done!

Timeline

Week 1:

  • Deciding which model to use and reading the state-of-the-art papers:
  1. Deep reinforcement learning for visual object tracking in videos.
  2. Multi-Agent Deep Reinforcement Learning for Multi-Object Tracker
  3. A review paper on how object detection networks using deep-reinforcement machine learning.

PS: At first the project was about building an object tracker. But later I get that Ihave to start with object detection task and then use it for the tracking task with any machine learning method I like, supervised or reinforcement learning! -> decided to retrain the yolov3 model.

Week 2:

  • Understand how to retrain the model.
  • Collect dataset (around 250 images)
  • Started the training with yolov3 but it was like it will never end!

I had to change to a faster version yolov4 and read the paper to understand how it works (optimal speed and accuracy of object detection)

  • More recherche on how the configuration should be done!

Week 3:

  • Start training the model yolov4 (it took 6 days in total)
  • Collect the second dataset (almost 300 image)
  • Train the model with second detaset (took also 6 days in total)

PS: Ididn’t mention the second dataset in this blog because it was just one more experiment for me. Trying to change some configurations and see if it works or not.

Week 4

  • Learn how to implement the trained model within a web flask application.
  • Prepare the environment for the application to run.

Ethical Implications

1. Accuracy

The accuracy of an ML model is the proportion of examples for which it generates a correct output. In general high accuracy is a good thing, and low accuracy can lead to harm. This is why it is important for us to improve accuracy.

2. Transparency

Very broadly, transparency is about users and stakeholders having access to the information they need to make informed decisions about ML.

3. Accountability

Transparency is an enabler for accountability, we need to be able to see what is going wrong and where to be able to determine responsibility.

Conclusion

Re-taining the YOLOv4 was quite challenging, I am satisfied with the output despite the amount of data used and the time the whole project had.
A larger dataset is for sure required and more experiments with hyper-parameters are still needed, to retrain the model and to ensure more accurate results.

PS: This project was done under the framework of Holberton school as a final project after 10 months of learning machine learning from scratch.

Team: Ahlem Kaabi

Click to check my GitHub and LinkedIn 😊

Resources:

https://www.w3.org/TR/webmachinelearning-ethics

--

--