Spotting Trees with Deep Learning

Original article was published on Deep Learning on Medium

Idea / Inspiration

I’m creating a project that can spot trees, using satellite data. I got this idea from the company Which uses satellite imagery and AI to track forests and vegetation around the world. They are many uses for this technology. You can track deforestation in an area. By tracking the amount of missing trees over a certain timescale. A company called Spacept uses satellite imagery and AI. To prevent fires and power outages by spotting trees growing too close to power lines.

Using satellite imagery saves time and money. As inspecting forestry using a helicopter or drone can cost a lot of money. And will cost a lot of time, to set up the helicopter and drone to fly over the forestry.


When I was first trying to get the dataset together for the machine learning model. I settled on using google earth engine. I got this idea from this medium blog post. Which used satellite imagery to spot landslides. In the blog post he highlighted the issue of getting satellite data. Where if you wanted to get images from the ESA Copernicus SciHub website. You will have to do them manually. And you cannot add coordinates of an area you want to capture. So, you get continent sized images. Which I can agree from my personal use of the service as well. When I tried to get an image that covered a small area in my city. Instead I got large images covering all of France and Britain.

With Google earth engine you can filter by date, area, and cloud cover. Which is why Google earth engine was recommended. I used the code from the blog to create a dataset. I added customisations to the code for my dataset. The blog post described the process below:

How our final (semi) automated pipeline ended up working was as follows:

1. Enter coordinates to calculate the coordinates of a 10 km square.

2. Copy the output and replace the variables at the top of the JavaScript code in the Google Earth Engine console (the JavaScript code we used). Don’t forget to change the date range to one that you’d like.

3. Download the images with less than 100 for the mean cloud density given by clicking the link printed in the console output.

When I got the images, they looked like this:

Image by author

In the beginning I did not know why this was the case. But later learned that satellite images were separated into different bands. Bands are certain wavelengths that the satellite captures. Doing this allows the user to view the image with features that are not visible with human eye or can be highlighted with a chosen wavelength. For example, a common use of satellite imagery is tracking farmland. By using the infrared spectrum, it is more easier spot vegetation.

Figure 1 This false-colour image of Florida combines shortwave infrared, near infrared, and green light. (NASA image by Matt Radcliff with Landsat 5 data from the USGS Earth Explorer.) Also found here:

Wikipedia gives a nice list of wavelengths and their uses:

The wavelengths are approximate; exact values depend on the particular satellite’s instruments:

  • Blue, 450–515..520 nm, is used for atmosphere and deep water imaging, and can reach depths up to 150 feet (50 m) in clear water.
  • Green, 515..520–590..600 nm, is used for imaging vegetation and deep water structures, up to 90 feet (30 m) in clear water.
  • Red, 600..630–680..690 nm, is used for imaging man-made objects, in water up to 30 feet (9 m) deep, soil, and vegetation.
  • Near infrared (NIR), 750–900 nm, is used primarily for imaging vegetation.
  • Mid-infrared (MIR), 1550–1750 nm, is used for imaging vegetation, soil moisture content, and some forest fires.
  • Far-infrared (FIR), 2080–2350 nm, is used for imaging soil, moisture, geological features, silicates, clays, and fires.
  • Thermal infrared, 10400–12500 nm, uses emitted instead of reflected radiation to image geological structures, thermal differences in water currents, fires, and for night studies.
  • Radar and related technologies are useful for mapping terrain and for detecting various objects.

Also, combinations:

  • True-color uses only red, green, and blue channels, mapped to their respective colors. As a plain color photograph, it is good for analyzing man-made objects, and is easy to understand for beginner analysts.
  • Green-red-infrared, where the blue channel is replaced with near infrared, is used for vegetation, which is highly reflective in near IR; it then shows as blue. This combination is often used to detect vegetation and camouflage.
  • Blue-NIR-MIR, where the blue channel uses visible blue, green uses NIR (so vegetation stays green), and MIR is shown as red. Such images allow the water depth, vegetation coverage, soil moisture content, and the presence of fires to be seen, all in a single image.

I wanted a simple RGB image. Which is simple visible light. As its best to start simple. To develop that, we need an image that combines all 3 bands into one. So, using some of the google earth engine example code and my personal changes. I was able to get some images.

The red square covering the map is the area that google earth engine will capture for the image.

Image by author

The image below:

Image by author

As we can the image quality is horrible to be frank. I started to notice problems when setting up the red square. You see in the image of the general area of outside the red square. You can see that image lacks clarity. Which means if you zoom in even further you cannot tell what’s in the image because the quality is pretty bad.

The reason for the problem was the resolution of images. The satellites sentinel-2 and Landsat only have a resolution 30 meters.

This website explains spatial resolution perfectly:

Spatial resolution refers to the size of one pixel on the ground. A pixel is that smallest ‘dot’ that makes up an optical satellite image and basically determines how detailed a picture is. Landsat data, for example, has a 30m resolution, meaning each pixel stands for a 30m x 30m area on the ground. It’s considered a medium-resolution image, which can cover an entire city area alone, but the level of detail isn’t fine enough to distinguish individual objects like houses or cars.

Custom Dataset

All open source data goes down to 30 meters. For finer data one must pay. So I had to find a solution that allows me get higher quality images. Without forking small fortune.

After a lot thinking (and worrying). I found the solution hidden in plain sight. When using navigation systems on your phone. Your phone can give you a satellite image of your area. You can see which buildings are which. And differentiate between objects. So I decided to use (normal) google earth and capture images straight from the website. To make sure the screen captures where all the same size used a program called PicPick.

Here’s one image I captured:

Image by author

Now we differentiate between objects. Namely trees.

Labelling data

Now we have a small sample size. We need a solution for the machine learning model to know it ‘s looking at a tree.

So used a program called LabelImg from GitHub. Which I used to manually label the trees in the images.

Image by author

I did not label every single tree in every image. As that will take forever. But wanted to have some labelling so the model can pick one or two trees in the test image. Same reasoning why the dataset is very small. The reasoning is so if I need to make changes to the dataset. It won’t take a long time. Also, just to create a model as simple prototype before scaling up the work.


This project is an object detection problem. As I’m dealing with image data. I will be using some type of Convolutional Neural Network. One example I was thinking of using for the model is here. But quickly learned that as the model was already trained the model won’t be able to read satellite image for trees. The blog post mentioned you do transfer learning or fine-tuning. But said t that it was outside the scope of the article. So, I had to find a new model to use.


After a while of googling around. I found an interesting model that I can use called R-CNN. More precisely faster R CNN. R CNN which stands for Region-Based Convolutional Neural Network. Where the model selectively goes through many 2000 proposal regions of the image. After the proposal regions are selected, they then shaped for the input of the Convolutional Neural Network. After Support vector machines classify the regions. Then a bounding box regression is used to work out where the box goes.

Figure 2

But they are a few problems relating to using a traditional R CNN. Mainly that processing an image takes a lot of time. As the model is extracting 2000 regions to check. Prediction with the model takes around 40 seconds.

Fast R-CNN

A better version of R CNN was created by the same person. Called Fast R CNN. Main difference is instead of extracting thousands of regions from the image. Fast R-CNN inputs the whole image. Then the CNN produces a convolutional feature map from that the chosen region proposals. The region proposals are then shaped for the pooling layer. After that it gets reshaped again to be entered into a full connected layer. For each region of interest feature vector, it is moved to a SoftMax layer to predict the class. The use of bounding box regression to find the boxes.

One problem is that as Fast R-CNN uses selective search is slowing down testing. But Fast R-CNN is way faster than the traditional R-CNN. As Fast R-CNN takes around 2 seconds to predict.

Faster R-CNN

Now we have faster R-CNN. The examples above uses selective search to go through proposal regions. Faster R-CNN uses a separate network of predict the region proposals. Using the convolutional feature map. Then the proposals are then reshaped for a pooling layer. For classification of the region. Also bounding box regressor used to place the images.

Figure 3

Coding up the of model

I followed tutorial from here. The tutorial used a dataset of blood cells. I replaced the dataset with my custom dataset of google earth photos. To get the data ready for the model I had to convert the bounding boxes from XML to CSV.

To do that I used pandas and python’s ElementTree module. And borrowed code from machine learning mastery and the export file linked in the tutorial.

df = pd.DataFrame(None, columns=['filename', 'tree_type', 'xmin', 'ymax', 'ymix', 'ymax'])list_dataframe = []folder = 'Custom_dataset\\Labels'

This section sets up the variables to be used for the loop. An empty dataframe is set up with the name of the filenames, class and the bounding boxes. The path of folder containing the images are saved in the folder variable.

for file in os.listdir(folder):  path = 'Custom_dataset\Labels\\'  file = path + file  print(file)

The start of the for loop starts by listing the files in the folder. Then the file path then printed by combining the folder path and name of the file. This is done as a normal for loop without os.listdir did not work on my laptop.

filename = re.sub('Custom_dataset\\\Labels', "", file)filename = filename.replace("\\", "")filename = filename.replace(".xml", ".png")

When printing the filename it had too many backslashes. So to get rid of it str.replace functions were used. Also, as the folder it using the label names not the images. The filenames ended in XML. So, replaced to PNG filetype.

tree_type = 'Tree'tree = ElementTree.parse(file)# get the root of the documentroot = tree.getroot()# extract each bounding boxboxes = list()

The section parses through the xml file. Crates the name of the class. In tree_type. tree = ElementTree.parse(file) parses the file. Then root = tree.getroot() gets the root of the file. And saves bonding box in a list.

for box in root.findall('.//bndbox'):xmin = int(box.find('xmin').text)ymin = int(box.find('ymin').text)xmax = int(box.find('xmax').text)ymax = int(box.find('ymax').text)row_list = [filename, tree_type, xmin, xmax, ymin, ymax]print(row_list)list_dataframe.append(row_list)

The for loop goes through each of the boxes and extracts the coordinates. Of the four corners of the box. Then a list is created which saves the filename, class, and coordinates. Then appended to the dataframe from earlier.

print(list_dataframe)df = pd.DataFrame(list_dataframe, columns=['filename', 'tree_type', 'xmin', 'xmax', 'ymin', 'ymax'])print(df.head())df.to_csv('testing.csv', index=False)

After that list_dataframe containing all the data of the files. Is then added into another dataframe with the column names. And df.to_csv turns the dataframe into a CSV file.

Implementing the model

Now the tutorial asks for use to convert the CSV file into txt file. As the model uses a different format.

filepath, x1,y1,x2,y2,class_name


  • filepath is the path of the training image
  • x1 is the xmin coordinate for bounding box
  • y1 is the ymin coordinate for bounding box
  • x2 is the xmax coordinate for bounding box
  • y2 is the ymax coordinate for bounding box
  • class_name is the name of the class in that bounding box

Now the tutorial asks us clone this GitHub repository. When I ran the model I got a few errors. So, I went inside the repository and edited the code. Most errors were AttributeError: module ‘keras.backend’ has no attribute ‘image_dim_ordering’. I found this helpful GitHub issue answer solving the problem. This is because some of the code used is deprecated. And you need to use the new keras API which has different names for the functions.

So, after a few hours of googling other issues the machine learning model started to run. I used a google cloud deep learning VM to train my model.

Image by author

Due to the amount of time. A decided to reduce the amount epochs to 5. As each epoch took 4 hours. And the default number was 1000. Which will take couple of days to train and quite a bit of money.

Now I needed to test model. I did this on my laptop. As Google’s VM did not show images

To do this I ran the statement:

python -p test_images

Image by author

The output shows it found some trees:

Image by author
Image by author

The model was able to find some areas with trees. Mainly thick forestry.

But others, not at all:

Image by author
Image by author

It looks like the model can’t point individual trees. But it can point to big areas forestry.


Like I mentioned above the model can’t spot individual trees. But areas of forestry. But even then with the areas of forestry. It can only spot some sections. A few small boxes cover a large area of forestry.

The dataset was kept brief. So, I can get a workable prototype up and running. Labelling can be improved by boxing every single tree in an image. And having custom segmentation for areas forestry. Instead of boxes to cover an area of forestry.

The dataset is very small (20 images) and trained for only 5 epochs. This can be improved by simply adding more images to the dataset. And spending more time training the model.

A serious question to ask is, how I would label a large dataset. Labelling 20 images was already long enough. By manually labelling the trees. Took around hour or so. Can’t imagine how to label thousand images by hand. Also, with custom segmentation for large areas forestry I’m not sure how to make custom shapes. To cover the shape of the forestry. As the program I was using only allows for boxes. Not custom shapes like polygons.