Original article was published on Deep Learning on Medium
Are datasets important for Machine learning?
Data is the basic necessity for any form of learning. Us humans learn the ability to perceive or understand actions after experiencing them or reading/hearing about their experiences. Datasets, labeled or not labeled, act as the base of all the machine learning tasks.
To train the machine learning models, a lot of data is required. The saying: “more the merrier” has never been more true. The current state-of-the-art (SOTA) machine learning models require a gigantic amount of data, for example, BERT, SOTA for many NLP tasks, are trained on 10s of GBs of data.
Imagenet a standard image classification dataset is around 144 gigs of data compressed and >300 gigs uncompressed. Needless to say, that labeled data is very important for the advancement of artificial intelligence.
While working on a project, I came across a problem where the object detector that I was using did not recognize all the objects in the image frame. I was trying to index all the objects present in the images frame, which later would make searching of images easier. But all the images are labeled human, not being able to detect the other objects in the image frames, the search was not working as I wanted.
The ideal solution for this problem would be to gather data for those objects and re-train the object detector to also identify the new objects. This would not only be boring but time-consuming. I could use GANs, a type of machine learning algorithm famous for its use of creating artificial and similar examples to its inputs, to create more samples after organizing a few samples manually, but this is also boring and will require resources to train the GANs to generate more data.
Now the only thing I could do was using internet services, like ScaleAI and Google Cloud Platform, to create a dataset. But these services charge ~$0.05 for each image annotated. Creating a dataset of 100 images for 15 categories each would mean that I will have to spend $120 just for the dataset. Spending this amount of $$$ just for a simple project is not a viable option.
Can games and gamers help make better datasets?
Out of all options, I starting wishing if there was an artificial world, where all the objects already have the respective labels assigned to them. Then I won’t have to spend time searching for samples and then meticulously generating annotations for them. Then I realized that game worlds are the best example of such an artificial or virtual world. When the world is created the game engine has all the necessary information, it knows which textures are located where and it knows the location of all the other objects. Looking at the current state of gaming, where the developers are pushing the limits and creating game environments that are hardly distinguishable from the real world. Unreal Engine 5 demo on PS5 shows the graphical fidelity coming generation of consoles will boast. Having a virtual world, which closely resembles the real world, should act as a great simulation for testing and learning for algorithms. It would be great to allow algorithms to get access to the game world’s information and let it learn the nooks and crannies.
The first game that came to my mind was Uncharted 4: The Thieves End.
A stunning game with a good story, it has all the elements that we would like from testing environment. It has all the natural terrain types: Plains, Plateaus, Caves, Shores, it also sports great graphics which are nearly photo-realistic which would mean object detector would be learning features and shapes closer to that in the real world. But the problem is that I do not have my ps4 here at the university and nor is there a way to access Uncharted 4 assets in the way I’d like.
The other choice for a game which could fulfill my demands of:
- Modding interface to allow accessing of the game world data
was Grand Theft Auto: V. The base, though now 7 years old, resembles the real world more than most other games. This game has extensive support for mods and making changes/interacting with the game world. The game, after using some mods, can look extremely stunning with all of its lighting and reflections
GTA V community has made some outrageous mods wherein the default player models are replaced by custom models. This is a really really awesome thing, which I will continue upon later.
I started by reading some articles and guides to using the APIs of Rage Engine the engine used by GTA. There are very few good examples or guides to help me with what I was trying to achieve. But luckily, there was an amazing tool that does something close to what I was trying to achieve, it gives the locations of many objects in the current image frame of the user.
The greatest thing about this tool is that its source code is available on Github. After getting this solid starting point, the task had come down to figuring out how to access the object information from the game world like it’s location and type. This was achieved after hours and hours of reading code, compiling erroneous code. I was able to log the results from the game screen. I started with simple logging:
# the first log file that was created script running inside of GTA V
Some things were easy, like getting the location of named objects like characters that were individually identified using hashes. Progressively I learned to interact with the API to get the location of pedestrians and things like cars running close by. I was not able to fully hack the game world to get location information for all the objects that I wanted but this was a good starting point. Initial results looked promising, the scripts took screenshots of the game when an object was detected and saved the location of the same. Results looked something like:
Training the Object Detector
With these satisfactory results, I went ahead with creating more samples. I decided to use 14 objects:
- Human-like: pedestrians or humans, dogs and cats (3)
- Living things but not human-like: trees (4)
- Non-living things: cars, trucks, bikes, trains, boats, traffic signals, billboard (6)
After acquiring enough data to train a model. I started training a RetinaNet model on my university server, a moderately powerful machine with Quadro P5000 and an Intel Xeon G6132. There was no specific reason for choosing RetinaNet, I had prior experience of using it. I trained the network from scratch, as using the pre-trained weights would kill the idea of this project which was to use game world knowledge and apply that to the real world.
After hours and hours of waiting, I realized that looking at the slow training progress would not speed up it up and decided to call it a day and sleep. After waiting for an eternity, the training was completed and it was showtime. Testing the model on a test-set comprising of images from the game. The results seemed fair:
The results were as expected and now was the time for testing the performance on real-world objects. I think this experiment can be concluded as a sort of success by looking at the outcomes:
Most objects are identified well, but I noticed a blunder:
The reason for this, which I assume, is that the lack of different player models in the game for cats and dogs. The model is able to learn the shapes of different objects in the dataset, but since the variety of samples in cats and dogs was limited things do not go well for these classes. It is not able to learn features allowing it to differentiate between dogs and cats. The number of models for dogs and cats is limited, modding, and adding new models for these can be a remedy for this situation.
The model performed well on the images, but don’t take my word for it. I have created a colab notebook where you can train your own model in just 5 minutes to recognize humans from an image. To keep the runtime as low as possible, the notebook only has a small dataset of 200 samples of human objects from the game. This low number of samples is sufficient to show the potential of such a model.
After getting the object detection up and running I wanted to get semantic segmentation working as this is where the real potential of automatic labeling would shine. Exploiting the game engine’s knowledge of the textures to obtain pixel-level segmentation. I have not yet figured out how to obtain the texture-based information from the RAGE engine, the game engine used by GTA V. Getting the location of objects is one thing, and getting the location and area coverage of the object is another. The latter is where I am currently stuck.
I believe the potential of this is huge, with the use of mods we can easily replace player models with those of things that we want to build detectors for. This technique, wherein we can replace the character models of a simple ragdoll which can act as the base for creating a large database of the character’s appearances in varied environments.
With my GSoC selection, my time has been reduced and the dev environment changed. I would love to come back to this project in the future. I was planning on uploading the 31GB dataset of all the images and their labels but this is not possible currently as my University’s internet service is unbearably slow.
My next blog here would be either a dev-log for another project that I am working on or a starter blog for the GSoC blogs that I will be posting during the coming 3 months to show the progress of my work with CERN.