Source: Deep Learning on Medium
Chess, rolls or basketball? Let’s create a custom object detection model
Data Science Toolkit Part II
YOLO is one of my favorite Computer Vision algorithms and for a long time, I had a plan of writing a blog post dedicated solely to this marvel. However, I decided that I don’t want it to be another article explaining in detail how YOLO works under the hood. There are at least a few publications on Medium that cover the theoretical side of things very well. Besides that, if you want to broaden your understanding of this architecture, it is also a great idea to get your information directly at the source and read the original paper.
Instead of theory, this time I will show you how quickly, with a relatively low effort and with no need for a powerful machine, create customized models, capable of detecting any objects you choose. This is a great approach if you need to quickly test your idea at work or just have a good time building your little pet project at home. Last year I had the opportunity to perform three such experiments and all of the visualizations that will appear in this article are the outcome of these projects.
Note: This time, we will use mostly open-source libraries and tools, so the amount of coding on our site will be minimal. However, to encourage you to play with YOLO and give you a starting point for your project, I have also provided scripts that will allow you to download my pre-trained models along with all the configuration files and test datasets. As usual, you’ll find it all on my GitHub.
All of you who have no idea what YOLO is — but out of curiosity decided to click on the cover image — don’t worry and don’t go anywhere! I will now briefly explain what am I talking about.
YOLO — or You Only Look Once — is a real-time object detection algorithm, which was one of the first to balance the quality and speed of the provided predictions. The most powerful models of this type, are built upon Conventional Neural Networks, and this time it is no different. By “object detection model” we mean that we can use it not only to indicate what objects are present in a given photo but also where they are located and in what amount. This kind of model is used, among others, in robotics as well as in the automotive industry, therefore the speed of interference is crucial. Since 2015, there have already been three iterations of this algorithm as well as variations designed for mobile devices like TinyYOLO. The precision of the mobile version is limited but it is also less computationally demanding, allowing it to run faster.
As usual in Deep Learning, the first step towards creating your model is to prepare a dataset. Supervised learning is about looking at labeled examples and finding non-obvious patterns in data. I must admit that creating a dataset is a rather tedious task. That’s why I prepared a script that will allow you to download my Chess dataset and check out how YOLO works on this example.
But those of you who want to build your own dataset face a challenge. To achieve this goal, we need to collect a set of images and create matching label files. The pictures should contain objects that we would like to recognize. It is also recommended that the distribution of all object classes in the dataset population is similar. As you can see, in the case of my first project — Basketball detector — I used frames from game videos.
Your label files should have identical names as the images, but obviously with a different extension and should be located in a parallel directory. The optimal data structure is presented below. In addition to the
labels directories, we must also prepare
class_names.txt file, that defines the names of the object classes we plan to detect. Each line of this file represents a single class and should contain a single word or many words with no spaces.
│ ├── image_1.png
│ ├── image_2.png
│ └── image_3.png
Unfortunately, YOLO requires a specific label format that is not supported by most free labeling tools. To eliminate the need for parsing labels from VOC XML, VGG JSON or another widely used format we will leverage makesense.ai. This is a free and open-source project I develop on GitHub. The editor not only supports direct export to YOLO format but is also intuitive and does not require installation as it works in a browser. Additionally, it supports multiple features aimed solely to speed up your labeling work. Take a look at the labeling process during my second project — the Rolls detector.
Once the work is done, we can download a
.txt files. Each such file corresponds to a single labeled image and describes what objects are visible in the photo. If we open one of these files, we would discover, that every line is in
class_idx x_center y_center width height format. Where
class_idx represents the index of an assigned label from
class_names.txt file (counting from 0). The rest of the parameters describe the bounding box surrounding a single object. They can take values between 0 and 1 (relative to image dimensions). Fortunately, most of the time we don’t need to think about these details, as the editor will handle it all for us. An example of a label in the YOLO format is shown below.
4 0.360558 0.439186 0.068327 0.250741
7 0.697519 0.701205 0.078643 0.228243
3 0.198589 0.683692 0.076613 0.263441
YOLO was originally written in a niche framework for Deep Learning called Darknet. Since then, many other implementations have been created, most of them using the two very popular Python platforms — Keras and PyTorch. Among all the available solutions, there is one that I particularly like. It offers a high-level API for training and detection but is also rich in useful features. When using it, all our work boils down to preparing a dataset and creating a few configuration files, then the responsibility is transferred to the library.
Environment setup is also quite simple, as it comes down to running several commands, that you may find below (assuming that you already have Python and Git installed on your computer). It is best to execute the commands from the
project directory, to achieve the structure shown above. It is also worth mentioning that the environment can be also created via Docker (this can be especially useful for Windows users). You can find more instructions on this topic here.
# Clone framework
git clone https://github.com/ultralytics/yolov3.git
# Enter framework catalogue [Linux/MacOS]
# Setup Python environment
pip install -U -r requirements.txt
As I mentioned in the previous paragraph, all we need to do now is create several configuration files. They define the location of our training and test sets, names of the object classes as well as provide guidelines on the architecture of the used neural network.
First, we need to split our dataset into a training and test set. We do this with the help of two
.txt files. Each of them contains paths leading to specific images from our dataset. To speed up our work, I have prepared a Python script that will create these files automatically for us. All you need to do is indicate the location of your dataset and define the percentage split between training and test set. A snippet of the
test.txt file is shown below.
.data is the final file we need to provide. Let’s discuss its contents using the example of my third project — the Chess detector. In this case, I had 12 unique object classes that I wanted to recognize. Next, we give the location of the files defining which photos belong to the training and test set, and finally, the locations of the previously discussed file with the names of labels. For everything to work properly files
chess.names should be moved to
Now we are ready to start training. As I mentioned earlier, the library we use has a high-level API, so one command in the terminal and a few parameters are enough to start this process. Underneath, however, there are several important things happening that significantly increase our chances of achieving final success.
First, we can apply transfer learning — we don’t have to start our training from scratch. We can use the weights of a model trained on different datasets, which results in shorter learning times for our own network. Our model can use the knowledge of basic shapes and focus on linking this information to new types of objects that we want to recognize. Secondly, the library performs data augmentation — so it generates new examples, based on the photos we provided. Because of that, we can train our model even when we have only a small dataset — several hundred images. The library we use also provides us with a sample of images that were created as a result of augmentation. Below you can see examples created during the training process of my Basketball detector.
And finally, the moment of pleasure has come! Our work devoted to creating a model is rewarded, as we can now use it to find objects we seek in any photo. Once again, it’s a very simple task, that we can accomplish with one simple command in the terminal. After the execution, we will find the results of our prediction in the output directory. It is worth mentioning that we can also make live predictions on the video coming from our webcam, which is especially useful when we want to make an impression by showing a demo.
Congratulations if you managed to get here. Big thanks for the time spent reading this article. If you liked the post, consider sharing it with your friend, or two friends or five friends. I hope I managed to prove that it is not difficult to train your own custom YOLO model, and my tips will be helpful in your future experiments.
This article is another part of the “Data Science Toolkit” series, if you haven’t had the opportunity yet, read the other articles. Also, if you like my job so far, follow me on Twitter and Medium and see other projects I’m working on, on GitHub and Kaggle. Stay curious!