Can Deep Learning save the Amazon rainforest?

Source: Deep Learning on Medium

We need to save the lungs of our planet

The Amazon is home to one million indigenous people and three million species of plants and animals. Spanning 6.7 million km² (twice the size of India), the Amazon Biome is virtually unrivaled in scale, complexity and opportunity, and truly is a region distinguished by superlatives.

In August 2019, a research by the National Institute for Space Research (INPE) revealed an 84 percent increase in Amazon forest fires compared to 2018. A majority of these fires can be attributed to regional deforestation whose rates have spiked in 2019, driving the devastating fire outbreaks in August that destroyed part of one of the most important carbon storehouses left on the planet. It is essential to understand the location of deforestation and human encroachment to enable quick response times and to curb further damage to the ecosystem.

How does Machine Learning/ Deep Learning help in our conservation efforts?

Previously, the tracking efforts largely relied on coarse-resolution imagery from Landsat (30-meter pixels). However, advancement in satellite imagery and machine learning (ML) has pulled us closer to detecting small-scale deforestation and differentiating between human and natural causes of the degradation.

This advancement allows us to accurately track the changes in the Amazon rainforest, and focus the efforts of the government in the areas most vulnerable. Additionally, we can maintain a log of conditions of a particular geographic location and measure the results of conservation or encroachments.

The first step in controlling this epidemic is classifying the Amazonian landscape

Our project is an effort in this direction, and we aim to provide a tool that can accurately label the terrain from high-res satellite images. This blog traces the journey and lessons learned by our team of five who came together to tackle this problem.

Our team members are Aishwarya Pawar, Ananya Garg, Onyekachi Ugo, Sachin Balakrishnan and Sahana Subramanian.

I. Dataset

Planet Labs, a private Earth imaging company, that designs earth observation satellites to capture swaths of Earth, has a labelled dataset of land surfaces on Kaggle.

The dataset was put together by starting with an initial set of scenes that would cover all the phenomena that Planet wished to demonstrate, which spanned a land area of thirty million hectares. These scenes were processed to create JPG chips of size (256, 256, 3), with the channels representing R, G, B and 4-band TIF chips, with the fourth channel being infrared.

Example of Kaggle chip creation from Planetscope Scene

The chips were labeled manually by a crowd-sourced labor force and while utmost care was taken to get a large and well-labelled dataset, there are some incorrect labels in the dataset.

The dataset was split into two sets of 40,479 training samples and 61,192 test samples. We treat this as a multi-label classification problem, to label each image with one or more of the 17 labels that indicates atmospheric conditions, land cover, and land use.

II. Exploring the Data

Before diving right into building models, we wanted to visualize the images and analyse the label distribution. This is an important step as it will help us verify if our predictions are sensible and further enable better optimization of the models.

The labels can broadly be broken into three groups: atmospheric conditions, common land cover/land use phenomena, and rare land cover/land use phenomena. Each image will have one and potentially more than one atmospheric label and zero or more common and rare labels. We also noticed that images that are labeled as cloudy will have no other labels.

Bar graph demonstrating Label Distribution Frequency

From the label distribution, we see that it is very imbalanced with some really common labels, such as primary and clear and some labels that occur very rarely. For example, the label — conventional_mines occurs in just 99 images (0.2% of the total training images).

Bar graph demonstrating Rare Label Distribution Frequency

We classified 7 labels having less than 1000 occurrences as rare and these labels tend to coexist. So, in total we have only 2180 images with the rare labels. This heavy imbalance is something we attempted to handle in our project.

Sample images with labels

Once we explored our data, we had identified a few key challenges with our dataset. Since the satellite images were labelled manually, there were some inconsistencies with the labels. For example, the “cultivation” label is supposed to be a subset of “agriculture”. However, there are many images that are only labeled as “cultivation”, and not “agriculture”.

To illustrate another example, the figure below shows two images. The first one is not labelled as road, but we can clearly see a road in the image. On the other hand, the second image is labelled as road, but we could not spot it and neither will the machine learning model.

Mislabeled images showing human error

Another challenge was the TIF image files which potentially contained more information due to the additional infrared channel. These images had severe data quality issues and the labels were inconsistent with the respective JPG files. Unfortunately, this rendered the TIF files useless and we were forced to use just the JPG files for our analysis.

III. Data Pre-Processing

We resized our images to 128 x 128 x 3 and normalized the pixel values to be between 0 and 1 before training our models. Additionally, we augmented the images using Keras ImageDataGenerator which included multiple horizontal and vertical flips, zooms and rotations of the images for better future predictions. This would guarantee that the model is exposed to a variety of training images and does not over-train on any particular part of an image. Keras ImageDataGenerator applies these alterations randomly for each batch ensuring that the model is seeing different images each time.

Image Augmentation

IV. Evaluation Metrics

In classification problems, it is very important that we use an evaluation metric that is designed for what we’re trying to achieve. Our dataset warrants a multi-label classification, thus we use a different definition of the common metrics of evaluation.

Definition for evaluation metrics based on our problem statement

A feature of our dataset is the lopsided distribution of labels (more on that later), with over 90% of images having a classification of primary (the forest). This puts our baseline accuracy at about 90% (i.e. we achieve this accuracy by simply labeling all images as primary). 90% accuracy is very misleading, as it does not tell us how many labels we have missed.

For our problem, it is essential that our model labels images as closely as possible to the actual labels if we are to track man-made changes (like mining, agricultural clearing or human habitation). Hence we needed to have a high precision for our model.

Secondly, we have rare labels and correctly classifying these labels increases the recall of our model, increasing its overall reliability at detecting hard to find human encroachments in the Amazon.

Lastly, to control for both precision and recall, we used an F-beta score as our main evaluation metric. We used β as 2, because we had to attenuate the influence of false negatives.

The generalized formula is (1 + β²) * pr / (β²p + r).

V. Hyperparameter Tuning

After deciding on the evaluation metric, the next step was to decide the hyperparameters that needed to be tuned for our deep learning models. For our multi-label classification problem, following were the hyperparameters we chose to tune:

  • Loss Function: Binary cross-entropy for one-hot encoded 17 labels
  • Activation function: Sigmoid and ReLu activation functions. For the final dense layers with 17 neurons, we used Sigmoid activation to obtain prediction probabilities.
  • Optimizer: Stochastic Gradient Descent (SGD) & Adam
  • Number of Epochs
  • Number of Neural Network Layers
  • Learning rates
  • Dropout rate: The rate at which a set of neurons are dropped out in a layer to avoid overfitting

Due to our class imbalance problem, we could not use simple train-test validation while training the data, as training dataset might completely lose out on the rarest labels like conventional mines and the model would never be able to predict these labels. To deal with this problem, we used 5-fold cross validation, while training our models

We started our analysis on Google Colab, while setting up Google Cloud Platform (GCP) in parallel. We took a diverse sample of 200 images (with all labels included) and started with exploratory analysis on Google Colab. In the initial phase of our project, we found out the hyperparameters for CNN and transfer learning model by overtraining these models for our sample size. Later on, while scaling up the model on GCP, we initialized the models with the hyperparameters we had obtained earlier. In some cases, we had to tune it further but these hyperparameters were a good start for our models.

VI. Models

We started by building a basic Convolutional Neural Network (CNN) to see how it performs on our dataset.


We built a 10-layer CNN, with an increasing number of filters. Our first two layers had 32 filters, followed by two layers of 64 filters, and finally six layers of both 128 and 256 filters. This architecture was designed after a lot of trial and error, and this combination worked best for our problem.

Before increasing the number of filters, we dropped out 10% of our nodes. This regularizes the set of layers preceding it, thereby preventing our model from overfitting. We also performed max pooling at this stage.

Our final activation function was Sigmoid, and we used the Adam optimizer. The final F-beta score of this model was 0.792.

CNN Architecture
#hyperparameters for CNN
LR = 0.0001 #learning rate


One of the first transfer learning models we tried out was the VGG16. It is a convolutional neural network that was proposed during the 2014 Imagenet challenge, and gained a lot of appreciation for its breakthrough results. If you are quite new to deep learning world (as we were when we started the adventure!), you can view more details about the Imagenet challenge here — link

VGG16 Architechture

VGG uses a CNN architecture dominated by very small (3×3) convolutional filters with a stride 1 and used a padding and maxpool layer of stride 2. VGG16 has a large number of hyper parameters with network depth pushed to 16–19 layers.

We used the 16 layer version — VGG16 which gave good results on a subset of data (we ran it on Google Colab), so we decided to train the model with the entire dataset in Google Cloud Platform(GCP). Employing 5-fold cross validation and 24 epochs during the training phase, the final Kaggle F-beta score on the test sample of 60k images was 0.92595.

#hyperparameters for VGG16
LR = 0.0001 #learning rate


After VGG16’s tiresome implementation due to 3 fully connected layers, we decided to move on to RESNet to take advantage of “identity shortcut”. With increasing number of layers, training the deep networks became difficult due to the vanishing gradient problem. Another problem with deep layers is due to the large number of parameters, adding more layers for optimization will increase the training error. RESNet introduces identity shortcuts skipping one or more layers.

RESNet Architechture

RESNet mimics ensembles as some connections are skipped and different signals are trained at different rates based on how the error flows in the network. After tuning the parameters for this transfer learning model, we got an accuracy of 86.25%, which is almost equal to the baseline accuracy. Also, the model did not predict the rare labels effectively and hence we did not include this model in our final ensemble.

You can find more details on Resnet — link

#hyperparameters for RESNet
LR = 0.0001 #learning rate


Given that RESNet did not give us a good enough accuracy, we decided to try MobileNet. This was a lightweight architecture proposed by Google to primarily enable low-computation computer vision models on embedded devices.

MobileNet uses depthwise separable convolutions as opposed to a standard convolution. Depthwise separable convolution factorizes the convolutions into two parts:

  1. Depthwise convolution — This is a channel wise Dk x Dk spatial convolution which would mean that it applies a single filter to each of the three input channels that we have (R, G and B)
  2. Pointwise convolution — It is a 1×1 convolution layer that is used to combine the outputs of the depthwise convolution
MobileNet Architecture

The main difference between MobileNet and a “traditional” CNN is that a standard convolution filters and combines inputs into a new set of outputs in one step. The depthwise separable convolution splits this into two layers, a separate layer for filtering and a separate layer for combining.

We used a pre-trained Keras model and used transfer learning to train our images and used 5-fold cross-validation. We trained the model for 25 epochs but found that the validation loss was not converging. We increased the epoch size to 50 and found that it converged better with a F-beta score of 0.9244.

#hyperparameters for MobileNet
LR = 0.0001 #learning rate

VII. Haze Removal

Once we trained the models with various hyperparameters, we realized that we had hit a baseline F-beta score which was not improving with increasing our number of iterations. This is when we decided to revisit the data pre-processing stage to improve our model performance further and handle our highly imbalanced dataset.

In general, satellite images are usually degraded by the light scattering caused by the turbid medium in atmosphere consisting of dust, smoke, water droplets etc. This degradation reduces the applicability of these satellite images and it is important to address this to improve the model performance.

One such algorithm that has gained popularity in the recent years is the Single Image Haze Removal Using Dark Channel Prior. This is based on the observation that haze-free outdoor images have at least one-color channel where some of the pixels have intensities close to 0. In contrast, the additive air light in hazy images increases the intensity in regions with denser haze.

The dark channel prior is calculated by first taking an arbitrary image J of a particular image size and a chosen patch size (say 15 x 15). For each pixel, we calculate the minimum of its (R, G, B) values after which another minimum filter is applied to choose a local patch centered at the pixel.

Dark Channel Prior calculation

The below illustration shows the extraction of the dark channel prior on a haze-free image. Most of the clear images will generate priors similar to this.

Dark Channel Prior extraction from a haze-free image

However, when we calculate the dark channel prior for hazy images, we see that the intensity is much higher in hazy regions. These pixels can directly provide an accurate estimation of the haze transmission and is used to remove the haze in the images.

Dark Channel Prior extraction from a hazy image

We implemented the haze removal algorithm on all the training and test samples to generate a new dataset.

We trained a new model with the newly processed images after haze removal and saw really promising results, especially, with respect to some of the rare labels like artisanal mine, slash-burn etc., which had a recall of 1% previously, had increased to 48% and 92%.

Example of Haze Removal on an image of our dataset

VIII. Class Imbalance

As noted before, we have some classes which form less than 0.2% of our labels. We implemented the next few strategies to address the same.


Known for its ability to boost weak learners, XGBoost generally gives better accuracy on imbalanced data. And our intuition to use this model to improve on recall was right! After overcoming the biggest challenge of tuning the parameters for the model, we were able to improve the recall for rare labels. With XGBoost’s algorithmic enhancement and systematic optimization, this turned out to be our fastest model and with F-beta score of 0.882.

The “Rare” CNN

XGBoost has been the go-to model for most Kagglers when it comes to dealing with class imbalance. Since there wasn’t much ‘boost’ in our model performance so far, we were on the lookout for other options to tackle the minority classes. No surprise, it was again the convolutional neural networks which came to the rescue, this time we like to call it the ‘Rare CNN’.

The idea was simple but proved to be quite effective. With rare label frequency being very low, the models were not able to disentangle the effect in our classification problem. That’s when we thought about developing an additional model exclusively for training the rare labels. Since VGG was the best of the lot in our case, we stacked a VGG16 on top of the base model. When the base model outputs the probabilities for all the 17 classes, the new one classifies images only into one or many of the 7 rare classes.

Training Phase
The very first requirement was to train the model on an equal number of ‘rare’ and ‘non-rare’ images. Many rare labels tend to coexist, so in total we had only 2180 ‘rare’ images from the training data of 40k. We extracted all the images containing rare labels along with an equal number of random samples of ‘non-rare’ images. After the necessary pre-processing, we used a separate VGG16 and trained this data in order to classify an image only on the basis of rare labels. The label vector (y value in supervised learning) on which the model was trained was of length 7, with each element corresponding to the presence of a rare label (in the same order as shown below).

For example:

  1. An input image with label — blooming clear cultivation habitation primary slash_burn will be trained on the label vector [0, 0, 1, 0, 0, 0, 1]. Note that clear, cultivation, habitation and primary are non-rare labels here
  2. An input image with label — agriculture clear habitation primary road will be trained on the label vector [0, 0, 0, 0, 0, 0, 0]. Note that none of the labels here are rare
  3. An input image having all the 7 rare labels (just for information purposes, we didn’t have such a combination) will be trained on the label vector [1, 1, 1, 1, 1, 1, 1]

The predictions for a particular input image will again be a vector of length 7 — each of the elements corresponding to the individual probability of one of the 7 rare labels. Shown below is the model architecture:

The ‘Rare’ CNN Architecture

The same image data will be fed to both the models — The main CNN and ‘Rare’ CNN. The probability outputs from the model that detects the rare labels are used to scale up the rare label probabilities of the corresponding images from the main CNN output.

This approach helped to considerably elevate our model’s ability to detect and predict the rare labels which is evident from the lift in precision and F1-scores for the rare labels as shown below:

Before and After results of the ‘Rare’ CNN

IX. Model Summary

Once we trained all our individual models, we created an ensemble of the best performing ones before making a final Kaggle submission. The final F-beta score that we achieved with the ensemble was 0.9257. Even though this score was slightly lower than our individual best performing model, we were successful in increasing the precision and recall of rare labels.