Datacleaning fighter jets

Source: Deep Learning on Medium


Go to the profile of Wayne Nixalo

When I finally got around to starting on the xView dataset, I found out that the test set and evaluation page were locked… which lobbed a bit of a torpedo at what I was planning. It can be used as training data for any satellite imagery task, but if you’re looking to see how your model ranks up like in a completed Kaggle competition: that’s just not in the cards right now.

So I came back around to a paused homework project for fast.ai: the fighter jet detector. The idea is simple: train an aircraft classifier, then turn it into a detector. Why fighter jets? I won’t go crazy if I have to look at them for hours.

Note: You can follow along with the work notebook I used, here.


1. Getting the Data

fighter-jets-datacleaning.ipynb

I started this project in October, but wanted to give it a fresh start. I already downloaded and cleaned data, and it was an experience I didn’t want to repeat. The main obstacle was the lack of a GUI on the cloud machine. I used Jupyter to get around this by displaying images in batches of 16 with matplotlib. There were problems. I couldn’t find a way to enter input without taking the time to learn about widgets, so I had to break a 2-layer for-loop into a series of cells that had to be run in the right order.

This is not how you want to do things.

I assigned numbers to files, and entered the number of the image that didn’t belong. This was time consuming and complicated. The worst part was that the process was hard to repeat or alter. You do it once and never come back. That’s not how work is supposed to be done. As I went along, I made notes on ways things could be done better, but were too late then to implement.

Aside: What’s interesting looking back on work you did while learning is that it’s as if you kept finding the single hardest way to do a task. I often hear researchers express surprise when they discover a simpler algorithm outperforming previous more complex ones, such as with OpenAI’s PPO. But I feel like complexity is often the result of an imperfect approximation to an ideal solution.

This time I approached things more methodically. The search term I used in my old script for the Panavia Tornado was just ‘tornado’. Guess what happens when you do a Google image search for that? This time I used a javascript command that captures links to the images you can see, instead of the chromedriver-based command-line script. This let me refine the search.

Getting the data was very easy. You build your url files by entering the javascript command in your browser’s console (cmd-alt-i or -j) when you see the images you want. You then run a for-loop over them, calling fastai’s download_images function:

# download dataset
for url_path in urls.ls():
aircraft_type = url_path.name.split('.')[0] # get class name
print(f'downloading: {aircraft_type}')
dest = path/aircraft_type; dest.mkdir(parents=True, exist_ok=True) # set & create class folder
download_images(url_path, dest)

Where url_path is a pathlib path to your url files (.csv or .txt), assuming they’re named by class. I’m printing a python f-string, which is very useful.

1.1 Keeping track of Data

Now there’s an issue when you do this. The purpose of me doing this on my Mac, is that I can take advantage of my OS’s GUI to quickly delete images I don’t want. For this project to feel professional to me, I can’t just copy all the cleaned images up to my cloud machine. I need to build a list of good links to download. The issue is fastai’s download_images renames files as consecutive integers: 0000001.jpg, 00000002.jpg, etc. There are 2 solutions to this: save url-filenames, or store a mapping between filenames and urls.

I tried the former first. I got it to work, but it wasn’t pretty and introduced 2 new problems:

  1. non-latin alphabets encoded as utf8 bytes
  2. links using the same filename

The character-encoding issue was very confusing at first, especially since the urls appeared decoded in the address bar, and only showed themselves when I copied the entire address, not just a part of it. That was a simple fix:

The second issue wasn’t going anywhere. So I decided to figure out how to map filenames with urls. I crashed into a very confusing set of problems here. To save url-filenames I had to change the way fastai saved the images it downloaded. To do that I had to edit _download_image_inner which is called by download_images, all from fastai.vision.data. Decoding utf8 bytes was necessary for this (on second thought I’m not sure if it was), but it was pretty straight forward from there. To map filenames to urls I had to go a layer deeper and edit download_image which is called by _download_image_inner.

I first tried to define a dictionary in global scope (outside of any functions) and just have download_image update it. But that didn’t work. I defined the dictionary in the format:

{class : {filepath : url}}

where the current class of aircraft is assigned in download_images, with an empty dict as its value, and that dict is filled in with filepath:url pairs inside download_image.

The top-level class-keys would be assigned, but the underlying dicts were all empty. Nothing was being stored in download_image. I didn’t want to take the time to understand python’s global keyword, and I figured it was because a copy of my url_fname_dict was being written to in the lower-level function. I did a print-out, and indeed key:value pairs were being assigned within download_image, but the original dictionary remained empty at the end.

So I created a class object to hold onto the dictionary and not have to worry about scope. After some debugging I got it to work and felt proud of my class.

Aside: This is actually how I first really started working with objected-oriented programming. I never really understood what was going on in course lectures. In deep learning with fastai and my own projects, I ran into the wall of having to pass data way up and down a stack of functions. You either have to have a huge number of arguments — which easily leads to confusion and restricts the way you can work — or you use a class object that has attached data you can just access.

ImageDownloader class at the end of a debugging session. Downloading is commented out, and I’m testing if anything gets stored in the dictionary

And the new downloader.url_fname_dict … was still empty. I spent a few hours on this problem, and found a dozen ways to poke and prod at it until I had a thought:

“I had issues with download_images using multiple processes in the past, right? Wait, with max_workers=8 there are 8 processes trying to store data in the dictionary at the same time… does Python handle that? I feel like it’d just …not.”

I tested this, setting max_workers to -1, 0, and 1 (I don’t know if 0 and -1 mean use every core and a lot of processes, or just one, so I tested them). And it worked like a charm.

There’s a thread on stack overflow. I don’t know per se if it’s related to my issue, but it felt right enough and I didn’t want to dive into multiprocessing / threading. So it turns out I didn’t even need a class at all.

399 file-url maps stored when `max_workers` ≤ 1; 0 stored when > 1

So this worked. And took. Forever. Only 1 process, remember? That means if a link takes 10 seconds to download… you’re waiting. And there are about 10,500 images in the dataset. Then I realized something great.

I can just print the whole damn thing out. At full speed.

Aside: I’m trying to hone-in ways of doing things, for when I’m running a company. So the question becomes: would I be okay with doing this in my own company? Funny enough: the answer for brute-force quick ‘just get it done’ type deals, is not always no. Sometimes you can gain a lot of flexibility by sacrificing a little ‘elegance’ or automation. In this case the simple solution was a lot more powerful because it could run on the command-line and write to disk. I was stuck in the Jupyter notebook paradigm, and gained a lot by letting go.

Then copy-paste to a text file, and run a few regex filters to extract the filepath and its corresponding url, then put them into your dictionary.

I made 4 regex filters:

fail_pat = re.compile(r'Error \S+') # split
clas_pat = re.compile(r'downloading: \S+') # split
save_pat = re.compile(r'data/\S+')
link_pat = re.compile(r'\s-\s\S+') # split

The ones with ‘split’ comments need part of their filtered text removed, since I don’t know how to do that in regex yet. The dictionaries are just defaltdicts:

removal_urls = defaultdict(lambda:[])
file_mapping = defaultdict(lambda:{})

Then building the file-url mapping, and the first bit of urls to remove:

urls that don’t download are the first additions to `remove_urls`

Everything’s finally ready for cleaning.


2. Datacleaning

There are 3 parts to cleaning this dataset:

  1. Verify downloads (remove corrupted/incomplete images)
  2. Visual inspection
  3. Update URL files

In practice, I did parts of 2. before 1., but the result is the same. FastAI has a useful function to verify images are viewable and have 3 color channels, and which also lets you resize them. The original dataset was about 2.27GB, resizing to a max of 500px on a side brought it down to 141MB.

Aside: I love that there are a lot of Deep Learning tasks that you can do on a laptop. The GPUs are only for training intensive models & data. Resizing and scanning 6–10k images, +2GB worth, took only a minute or two, if that.

2.1 Visual Inspection

This was interesting, and touches a dull and controversial part of AI.

I have opinions.

A lot of people use Amazon’s Mechanical Turk (perhaps a questionable name, by the way) to bulk label data. I don’t like the idea of paying someone pennies for mindless work. So imagine the constraint:

you’re on your own and you have no money. find a smart way to do this.

Now, I can’t apply the assumption that the funds and resources always exist to tackle a problem directly, because when I look at my bank account the assumption doesn’t hold. So instead of buying off the problem, you have to consider more moving parts and how they fit together. If you’re a part of this system, besides being its architect, where do you fit in?

For vision we’re very good at picking the odd-one-out. Other fast.ai students put together a Jupyter widget that lets you review and delete or re-classify images. It creates a .csv file reflecting your changes so the dataset isn’t altered. This is definitely how I’d do it. One problem: the interface is made for smaller datasets, and there isn’t a way to turn off training-set shuffling without playing with PyTorch dataloaders:

A step in the right direction; but this just isn’t going to cut if for 10,000+ images.

What would be perfect is if I could have a full screen of images to review, and if I could do so by class. That requires likely-major edits to the ImageCleaner widget and another side-quest into learning how to build Jupyter widgets. I dont want to spend time on that. It’s also likely quixotic because that’s probably what the ImageCleaner will evolve into anyway.

But by letting go of a little functionality, I can gain a lot of usability.

Turns out I can pretty much get what I need, and how I want it, on my OS:

A bit small though. Can we get a balance between ‘batch size’ and visibility?
Perfect.

This is where the work of mapping filenames to urls pays off. I can delete any images I want, and record the changes to the dataset. I started by scrolling through and picking out corrupted images (file-icon with no picture) and images that obviously didn’t belong. For 10,241 images in 27 classes this took 20 minutes. FastAI’s image verifier handles this in 1/20th the time.

I wanted finer control so I used macOS’s gallery view to scan each image:

Do I want to recognize silhouettes? I don’t know if this would hurt or help the model so I’ll leave it in. Oh, and I think the artist is Ivan Krpan.
That is definitely an Su-30, not an Su-34.

It took about 3 and a quarter hours to go through every image. What’s important is I could choose how fine-grained I wanted control. I could run through the whole dataset in under an hour if I kept the icon-view.

2.2 Updating the Dataset

I’m using a folder of text files containing the URLs of the images that make up my dataset. Thanks to the work done earlier, that’s very easy to update.

a more efficient way to lookup filenames is to store in a `dict` then lookup instead of searching an array. This is O(1) instead of O(n/c)

This is mostly reusing and modifying previous code. First you build a list of filepaths for each class folder. Then for each filepath key in your dictionary — corresponding to the current class — you check if it’s in your list. If not: add its URL to the removal_urls dictionary.

Then, for each class, you open and read the URLs file to a list — not adding URLs that exist in your removal_urls dictionary — and then overwrite the file.

Now you have a set of URL files containing links to images that belong in your dataset. You can choose to save the dictionaries if you want. PyTorch is useful for this (if you defined a defaultdict with a lambda function you’ll have to convert it to a dict):

torch.save(dict(file_mapping), path/'file_mapping.pkl')
torch.save(dict(removal_urls), path/'removal_urls.pkl')

And that’s it. You can use github or anything else to copy the URL files to the cloud GPU machine, then download the images there. I ran a final check to see how large the cleaned dataset was:

tot = 0
for clas in aircraft_types: tot += len((path/clas).ls())
tot

It was 6,373. From an original 10,241.


3. Doing it again

For zero-to-start on a 10k-image dataset, I’d estimate 1 hour with a cursory visual inspection, or 4+ with a detailed one. I showed a lot of code here, but that was the discovery process. To do it again — after modifying the download code to print filepaths and urls, (and getting your urls)— all you really need is: a download block,

for url_path in urls.ls():
aircraft_type = url_path.name.split('.')[0] # get class name
print(f'downloading: {aircraft_type}')
dest = path/aircraft_type; dest.mkdir(parents=True, exist_ok=True) # set & create class folder
download_images(url_path, dest)

(save the printout to a text file). One line for verification,

for c in aircraft_types:
verify_images(path/c, delete=True, max_size=500)

(do a visual inspection); then the regex filters:

fail_pat = re.compile(r'Error \S+') # split
clas_pat = re.compile(r'downloading: \S+') # split
save_pat = re.compile(r'data/\S+')
link_pat = re.compile(r'\s-\s\S+') # split

The work is done by three blocks. One to record broken links (from printout):

with open(download_printout_path) as f:
for line in f:
# update class
aircraft_type = clas_pat.findall(line)
clas = aircraft_type[0].split()[-1] if aircraft_type else clas
# search download path & url
save,link = save_pat.findall(line), link_pat.findall(line)
if save and link:
link = link[0].split(' - ')[-1]
file_mapping[clas][save[0]] = link
# search failed download url
fail_link = fail_pat.findall(line)
if fail_link: removal_urls[clas].append(fail_link[0])

another to record the cleaned-out files:

# lookup urls of missing files in file_mapping & add to removal_urls
for clas in aircraft_types:
flist = (path/clas).ls() # pull all filepaths in class folder
for fpath in file_mapping[clas].keys():
if Path(fpath) not in flist:
removal_urls[clas].append(file_mapping[clas][fpath])

and a final block to update the url files:

for aircraft_type in removal_urls.keys():
fpath = path/'fighterjet-urls'/(aircraft_type + '.txt')
# open file; read lines
with open(fpath) as f: text_file = [line for line in f]
for i,line in enumerate(text_file):
line = line.rstrip() # remove trailing /n for searching
# remove line from text file

if line in removal_urls[aircraft_type]: text_file.pop(i)
# overwrite url files
with open(fpath, mode='wt') as f:
for line in text_file: f.write(line)

The entire process is: download & print → verify & inspect → record broken links & removed files → update url files.


And there you have it. Time to ID some fighter jets.