Loc2vec  —  a fast pytorch implementation

Source: Deep Learning on Medium


Go to the profile of Suresh

Ever since I saw this blog post from sentiance on the Loc2Vec, I was fascinated and wanted to replicate it. They show that it is possible to learn about geography by training in an unsupervised manner similar to Word2Vec and arrive at impressive embedding for each location. It is a fantastic write-up and I assume that you have read it in entirely in this blog. If you have not, please go check it out . It took them 2 weeks of AWS P3.2 time (Retail cost of $1000 @ $3 per hour) and usually it takes many iterations in deep-learning before getting it right, which meant even more expenses. However, I had a nostalgic calling and this project was always in the back of my mind. I recently implemented this with newer techniques/ accepted wisdom since the paper was written (9 months is a generation ago in fast paced deep learning world). To my surprise, I was able to cut the training time by two orders using various techniques and shortcuts detailed below. My implementation is available here. While I am much more adept with Keras, I chose to use pytorch for two reasons: a. I wanted to get better at pytorch. b. I wanted to use mixed precision training to reduce memory requirements and perhaps speedup training.

But first, let’s walk down the memory lane…. :-) Many many years ago, I, along with a few of my friends, started to map cities of India with plain-old GPS and some marine equipment called DGPS, PDA (yes, you read that right) to record NMEA messages and some good old hackery. We were able to survey two coastal cities in India and create maps for those cities using quite a concoction of open source technologies. We were able to show maps “google-maps style” in an era where most maps took a round trip to server to even pan a bit. However, that adventure was very short lived; Apparently, according to a 100+ year old British-era law, surveying was a “spying act” and we were committing national espionage. You can dream about playing James bond in your dreams, but the very threat of being charged with spying is not fun to deal with. We were all doing it as a non-profit as no digital online maps existed at that time. None of us had the resources, will power or time to fight this 100+ year old law and go through the bureaucracy while pursuing full time jobs. I still have very fond memories of the super helpful folks in the GIS opensource community though. (Hi ka-map, postgis, etc!). I love it when someone writes a good map related article like this or the Loc2Vec blog post.

Literature study

Before you implement anything, it is worth scanning for similar papers published on arxiv to get an idea of various techniques people have tried and what kind of parameters they have used. My favorite website to use is arxiv-sanity for this purpose. Tile2Vec paper from Neal Jean and team achieves remarkably similar results, but using satellite imagery instead of the rendered maps like loc2vec. In that paper they

  1. Use Resnet18 with pre-trained vectors
  2. Find that margin size does not matter
  3. L2 Norm does not seem to help with accuracy, but does prevent overfitting.
  4. Regular Adam optimizer with learning rate of 1e-3 is used.

Map data generation

Openstreetmap data is available in a form that is compressed and conveniently split into sub-regions. If you have enough compute power and large enough NVMe-disks, you can download the whole world data. I didn’t; I chose the US west data sub-region is large enough and diverse. Furthermore, unlike other countries, the US government shares the road data via the TIGER project for free and hence the road data is comprehensive. This data is stored and distributed freely by several folks; I downloaded it from geofabrik.de.

I spent two days looking at the various tools to process this data and create a tile-server, mostly because I was distracted due to nostalgia and appreciating how much the landscape has evolved since I looked at this area. Fortunately, I found a docker image that captured everything I needed into a package and required only running three commands to generate the tiles.

The folks at sentiance generated 12 images as input to each area, but I chose to provide input only one single image as input and see how it goes. (trading my time vs compute time). This made it pretty straightforward. I just had to use http://localhost:80/tile/{z}/{x}/{y}.png format to download the tiles. (z — zoom, x — long, y- lat). I also decided to generate the tiles ahead of time and store it on the disc to simplify my implementation and remove the CPU bottleneck. Also, the fact that I used only a smaller part of the world helps with the storage requirements!

Generate tiles

As mentioned before, for each tile, I need three things, lat, long and zoom level. In the Loc2Vec experiment, they generated 128×128 tiles for each lat-long input and nearby inputs with 0–80m shifts on both directions. Our tile-server generates 256×256 tiles. To match sentiance, one should adjust zoom level to get 200mx200m approximately per tile. This website gives you more information on choosing the right tile size. Accordingly, I need to use zoom level 17, but I chose level 14 for the sake of reducing compute needs. My goal is to replicate the spirit of the experiment and general idea, not the specifics. I also generated tiles at zoom level 12 to have a smaller size dataset and run smaller experiments during initial stages.

Once the tiles are generated, the info-less tiles like empty tiles or ocean can be removed using the file size as a proxy. I found this out by plotting the histogram of file sizes. These info-less tiles stood out distinctly and in abundance.

If the data generation is difficult, it might make sense to generate this data and upload to kaggle for others to experiment. If anyone has difficulty generating this data, let me know and I can upload it.

Good Proxy for Location data

Sentiance folks used check-in data to generate tiles. We need to source something equivalent from other sources like four-square/twitter/facebook or look for other means to approximate lat-long data. Fortunately, since all the files are rendered, the file size is a good proxy for how dense the location is. By examining data, we can see that larger file sizes correspond to busy urban locations and smaller file sizes correspond to sparse remote locations. We can also notice this from the histogram plot. If we did uniform sampling, it would be biased towards differentiating remote locations, but not urban areas of interest where humans inhabit and visit. In fact, I did try this first and the network was very good at differentiating between forests, streams and other natural things like lakes. We instead use balanced sampling based on file size and use that as the input to tilt the relative importance towards longer file sizes. The graph below shows the histogram based on uniform sampling and balanced sampling.

Blue — Original file sizes; Green — after balanced sampling

This balanced sampling does indeed generate interesting results as seen below. There are probably a lot of ways to mine the openstreetmap data to generate interesting locations. My postgis-fu is not that great to exploit that. If anyone wants to collaborate on that, I’d love that.

Data augmentation

Like Word2vec, this loc2vec is unsupervised learning. It is based on the fundamental premise that nearby locations are semantically similar to any other random location. The Tile2Vec paper summarizes this much better than I could in a technical yet easy to comprehend way as “an unsupervised representation learning algorithm that extends the distributional hypothesis from natural language — words appearing in similar contexts tend to have similar meanings — to spatially distributed data.” As a result, when any location is sampled, nearby locations are also sampled and provided the same label in the Loc2Vec experiment. I sampled a tile and generated 20 images from it. While I did experiment with 6 samples per tile and 10 samples per tile, 20 seems to provide better results. Pytorch provides an API for sampling 4 corners and center of the image. These five samples were vertically rotated and supplemented with other random translated, rotated tiles as input to create a stack of 20 images per tile.

Start small (with MNIST or similar)

As a general rule of thumb, whenever you are trying to build a new deep learning model, it is prudent to start with a simpler dataset and simpler model and expand gradually. It is even more preferable to start with an existing repository that is proven to work as some of the deep learning bugs are subtle to detect/debug. This project was about online triplet loss with some modification for PN-loss described in the Loc2Vec blog. A quick search led me to this wonderful pytorch implementation. I used that as a starting point and stripped out other loss functions and modified it to match our goal. Specifically, I did the following:

  1. Add metrics to monitor progress: a. At a high level we want to know if the distance between positive pairs is decreasing and distance between the negative pairs is increasing. b. The existing code already also output the number of effective triplets and this was a good metric to see if any of the loss functions collapsed.
  2. Simplify the code by removing other loss functions and networks.
  3. Also use just the online hard triplet mining as this makes it lot more efficient in training.
  4. Modified the existing loss function to take the min of AN and AP to feed into the training. This makes intuitive sense as this is equivalent of taking the worst loss of (A, P, N) vs (P, A, N) triplets. The figure below illustrates how to think about it. I decided not to use the softmax and use the conventional RELU for the triplet loss as I believed that the primary benefit was picking the better loss of two combinations of the triplets mentioned above.
  5. Experimented with L2 normalization on the embedding layer as it should offer better generalization in theory, but it did not yield better separation in this particular dataset for embedding output of two dimensions. Furthermore, since the tile2vec paper said L2 normalization did not yield better results, I abandoned it.
  6. The MNIST example allows you to choose between hard negative vs. semi-hard vs. random online mining. Since our training times are long, I didn’t want to find that one is better than another; I modified the code to allow for gracious fallback of one of them does not yield a good triplet. By monitoring the number of triplets, I was able to detect that using just hard negative mining collapsed the training midway.
PN-loss pulls anchor-positive pair together while pushing AN/PN pair apart
MNIST Embedding without L2 Normalization
MNIST embedding with L2 normalization for embedding

Hyperparameters

After looking at the Tile2Vec paper, it was evident that the default parameters (learning rate of 1e-3) was good and L2 normalization of the embedding layer was unnecessary. I tried the same and got lucky!. Nevertheless, I need run more experiments to see if a better embedding is possible.

Further model modifications

It has been 9 months since the blog was written and in deep learning terms, it is about a generation long due to the fast pace of innovation. I made the following informed changes to Loc2Vec’s approach to speed up/simplify the training.

  1. Use batch normalization: we now know that using batch-normalization is a well known way to speed up training as the gradients propagate better. It makes sense that they didn’t use it because, in their case they had only 5 distinct samples per batch (with 20 neighbors for each sample) and batch normalization does not work well when the batch size is small
  2. Using smaller size images or fewer images reduces the training time. I chose to use a single RGB image instead of 12 channel image and this should lead to 4x reduction in memory constraints for input image. This decision is motivated by the fact that first layer of the neural network is good at teasing out the information from RGB channels.
  3. Use pre-trained weights instead of starting from scratch. Imagenet is comprised of natural images, but cartographic images have a different look and image statistics do not match that of natural images, yet it is a good idea to start with pre-trained weights. A recent paper by Maitra Raghu & team shows that using pre-trained weights to do transfer learning in medical images — which do not share the same image statistics as natural images — indeed offer a 10x improvement in training time than retraining from scratch. I expect to see similar results in this experiment.
  4. Use Densenet or Resnet instead of VGG like architecture. Densenet concatenates feature vectors in each layer rather than merge them like in Resnet. Therefore I’d expect Densenet to offer better embedding compared to Resnet. However, I could not get Densenet included with pytorch to work with smaller image sizes without major surgery, hence I opted to use Resnet instead. This is something to revisit later. I started with Resnet18, but since CPU was the bottleneck and I had spare memory, I upgraded to Resnet50 instead. Furthermore, we recently learned that loss landscapes of Resnet and Densenet are much smoother and do not have local minima like VGG, therefore it is better to use a architecture with skip-connections when possible.
  5. Use mixed precision (FP16 + FP32) instead of FP32. I have a 8GB GTX 2080 whereas the Loc2Vec experiment was done with a 16GB V100 card. I explored FP16+ route in an effort to reduce the memory requirements. I used Nvidia’s amp library to do mixed precision training. I just had to add few lines and it was seamless. This almost never happened to me in deep learning! Says a lot about the maturity of the tools! It is also quite possible that it has been finetuned on Resnet and since I was using that architecture, it worked seamlessly. Either way I am impressed. The amp library also clipped a few gradients that caused overflow. I guess this also helped with training. Very impressive Nvidia!
Using pretrained vectors accelerate training by 10x. [Maithra Raghu et. al]
Densenet and Resnet have smoother loss landscape compared to VGG and hence easier to train.
Enabling Mixed precision Training: import and initialize amp
Step 2 for mixed precision training: Wrap the backpropagation step and optimizer step. That’s all.

Compounding Effect

It is important to note that while several of these adjustments are simple modifications, the collective impact is compounding in nature. Using pre-trained Resnet with batch-normalization offers a loss surface that is smooth which allows us to cut the training time by 10x. The knowledge of the loss surface allows us to confidently train using a high learning rate. Using fewer/smaller images allows us to increase the batch size; so does migrating to FP16. this combined effect allows us to use better architecture ( Resnet18 -> Resnet 50). The effect of online hard negative mining with larger batch sizes improves the quality of the loss provided to the network which also speeds up training! In short “In Resnet we trust” :-)

“In Resnet we trust”

Results

After all these modifications, I was able to run the training for 1 epoch in 8 hours using a GTX2080 with 8GB memory, which is half of what a V100 in the AWS P3.2 instance has. Technically, the training converged in 1/2 epoch but I did not add the ability to save the model mid-epoch. I could claim a 100x improvement in speedup, but that really would be comparing apples and oranges.

I used annoy library to store the embedding and query by either index or feature embedding. I’ll probably expand on Annoy library in future posts, but its clean API hides all the complexity very beautifully. Below are some of the results from my experiments which match the visualizations in the original loc2vec blog.

PCA of the embedding clearly shows that rivers, cities, and green areas (other side of the sphere) are learned
TSNE plot shows clear seperation between green areas, waterfront, highways, etc.
Nearest Neighbor search based on embedding (First tile is the input query image)
Interpolation in the embedding space shows smooth transition from green to suburban space.

Between researching the openstreetmap stuff, trying multiple parameters and doing some literature study, it took me about a week to get this far (I hate writing, so I discounted that time). Given that, it probably was better to run the experiment for two weeks, but where is the fun in that? I loved taking a trip through the nostalgic lane of GIS world. Also, unlike other deeplearning experiments I got lucky and it took far less experimentation to get this far.

Bloopers?

I followed the path of most blogs and reported the happy path and a few missteps. After writing this, I honestly feel that a bloopers section would be the most useful for learning. Also, some of the experiments are iterative and do not lend itself to a clean blog post. I don’t have a good solution to this problem as some of the bloopers are obvious mistakes in hindsight. While developing code, I tend to liberally sprinkle print statements as silent broadcasting is one of my main sources of errors.

Thanks!

I need to thank Rachel Thomas for nudging me to write a blog. I am not a shy person, but keep a low-key online profile and it took quite a few blogs/tweets from her to get over this. I was a international Fellow in the very first fast.ai course and I learned a lot from Jeremy Howard. Apart from the actual techniques and methods, nudging us to read papers and introducing us to Mendeley was one of the most useful takeaways from the class. This has been a good “infotainment” of mine ever since. If anyone is even moderately interested in deep learning and has a software background, I’d strongly urge you to watch the videos. Given how much the community shares in the form of code, papers, books, blogs and free lessons, I decided to do my part. I hope this inspires someone the same way the loc2vec blog post inspired me.

Next Steps/ Potential Ideas

There are a lot of ways you can proceed from here. I need someone who could do GIS stuff to collaborate with me to explore some of these ideas. I’ll teach you what I know about deep learning in exchange.

  • I think providing multiple channels as input is a good idea, albeit at different zoom levels. For example, I’d like to know if a region is a coastal city, ski resort, or a desert. For example, Miami is much more similar to San Diego than Boston or Chicago. Combining lower zoom level data with climate data as input would make the network learn this data. There are many more things you can do like this.
  • Using labels from openstreetmap, you can pretrain the network to classify highways, local streets, stores, places of worship etc. This pre-trained network would generate far better embeddings.
  • Combine this data with satellite imagery to segment roads, add detail to maps etc. For example, the Tile2Vec paper was able to predict poverty levels using a similar technique.
  • Port this to Fast.ai framework. I love what they are doing and tools like stochastic weight averaging and one cycle learning would be helpful for “compute poor” people like me, especially since we now know a bit more about the loss surface. I was not sure how good mixed precision training was in Fast.ai library and could not get around my head around fitting loc2vec data loader requirement into that framework. I look forward to exploring this in the future.
  • Try Densenet instead of Resnet