How to Use ROI Pool and ROI Align in Your Neural Networks (PyTorch 1.0)

Source: Deep Learning on Medium

Go to the profile of Andrew Jong

If you’ve ever wanted to do a deep learning project related to computer vision or image processing, you may have come across the ROI Pool and ROI Align layers. While originally built for object detection, ROI Pool variants are also useful to extract information from localized regions in an image. For example, you may want to extract specific body parts from a person:

In this diagram, ROI Pool is used to extract texture information from six arbitrarily sized regions in an image. ROI Pooling transforms the rectangles into a nice square-shaped tensor. Source: Raj et al. 2018.

I found many helpful articles explaining how ROI Pooling and ROI Align work conceptually (kudos to those authors!). However, I didn’t find any clear tutorials on how to code ROI Pooling/Alignment layers into my neural networks.

Unfortunately, ROI Pooling (and its variants) are not built into PyTorch. You could, of course, implement the layers yourself. But to make a practical GPU-compatible implementation, you’d have to spend time coding in CUDA. The more practical option is to use a third-party library. Yet most of these libraries are frustratingly undocumented.

So this post summarizes what I learned from weeks of exploration, experimentation, and struggle to work with undocumented libraries. I explain the installation and compilation of a third-party implementation for use in your project, as well the API of how to use the provided ROI layers. Hopefully with this guide, I can save others a lot of time!


I used the ROI layer implementation from This is the most popular Faster-RCNN PyTorch repository on GitHub, and so presents a solid choice. As an aside, I believe some of the ROI layer code here was heavily influenced from that of the Facebook’s maskrcnn-benchmark repository. (Maskrcnn introduced the improved variant, ROI Align!)

Note: I used Python 3.7, but this should work with any Python version at or above 2.7. I also use PyTorch 1.0, but PyTorch 0.4 users should be able to follow along with some minor adjustments.

First, clone jwyang’s faster-rcnn.pytorch repository. Then make sure to checkout the pytorch-1.0 branch. This is important! The compilation steps differ across the master branch (for PyTorch 0.4) and the pytorch-1.0 branch.

git clone
cd faster-rcnn.pytorch
git checkout pytorch-1.0

As copied from the instructions in the README, install the requirements with pip, then build and compile using Python setup tools:

A screenshot from of jwyang/faster-rcnn.pytorch’s README on the pytorch-1.0 branch, showing compilation instructions.
pip install -r requirements.txt
cd lib
python build develop

Important: To be able to use the ROI-Pool and ROI-Align layers, the dependencies in requirements.txt MUST be installed in your Python environment. Else you’ll encounter segfault errors. If you use conda, make sure the activated environment is the same as the one used to compile the library.

To make sure the installation succeeded, open up a Python prompt and type:

>>> import sys
>>> sys.path.append(“/[location_to]/faster-rcnn.pytorch/lib”)
>>> from model.roi_layers import ROIPool # PyTorch 1.0 specific!
>>> roi_pool = ROIPool((2,2), 1)

…where [location_to] is wherever the faster-rcnn.pytorch repository is cloned on your system. The sys.path.append statement appends the compiled library to Python’s PATH, which lets us import “ROIPool”.

If all goes well, no import errors should pop up. If the import failed, then something went wrong with compilation or the path is incorrect.

Note: the import statement is PyTorch 1.0 specific. If you’re on PyTorch 0.4, the correct import statement is this: 
> from model.roi_pooling.modules import roi_pool # PyTorch 0.4

Basic Usage

Great! Now that we’ve compiled the library and verified it works, how do we actually use ROIPool and ROIAlign?

The usage is thus:

# your own implementations to load data
image = get_image() # returns a (batch×channel×height×width) tensor
rois = get_rois() # returns a (batch×n×5) tensor
# init the ROI layers
roi_pool = ROIPool((width, height), spatial_scale)
roi_align = ROIAlign((width, height), spatial_scale, sampling_ratio)
# turn our (batch_size×n×5) ROI into just (n×5)
rois = rois.view(-1, 5)
# reset ROI image-ID to align with the 0-indexed minibatch
rois[:, 0] = rois[:, 0] - rois[0, 0]
# feed-forward our data
pooled_output = roi_pool(image, rois)
aligned_output = roi_align(image, rois)

If you’re already confident with how to load ROI files and what spatial_scale and sampling_ratio mean, you’re good to go! Just note the value of sampling_ratio means ROIAlign will sample sampling_ratio² points per bin; e.g. sampling_ratio=2 will sample 4 points per bin via bilinear interpolation, then average the points.

If you have no idea what any of those words meant, then read on!

The Details

First, if you don’t already know how ROI Pool works conceptually, read a tutorial here.

ROI Pool takes 1) an image, and 2) regions of interest (ROIs) to extract from. The image is straightforward — it’s just your standard tensor. The output from your DataLoader yields a (batch×channel×height×width)-shaped tensor. But how do we work with ROIs? What do they look like?

What ROI Data Looks Like

It turns out ROIs, by de facto standard, are formatted with the image-ID in the first column. The remaining four columns contain the coordinates of the bounding box’s upper-left and lower-right corners.

image_id, upper_left_x, upper_left_y, lower_right_x, lower_right_y
0, 10, 10, 20, 20
0, 15, 15, 25, 25
1, 27, 10, 36, 20
1, 15, 15, 25, 25

In the example above, we have two ROIs for the image-ID 0, and two ROIs for the image-ID is 1.

Typically, all the ROIs are stored in a single csv file. We can load it into a numpy array using Pandas, then transform that array into a PyTorch tensor. Last, create a custom Dataset and DataLoader to feed the image+ROIs to your neural network.

When we feed the data to our ROI layers, the input dimensions must look like this:

def roi_pool(image: torch.Tensor, rois: torch.Tensor):
image: a (batch_size×channel×height×width) tensor input to pool
rois: a (n×5) tensor to represent the regions of interest,
where n is the number of rois

Problems with ROI Dimensions and Image-ID

However, you may notice that DataLoaders always prepend an additional dimension for batch size. For example, if your minibatch size is 4, the DataLoader will yield a (4×n×5) ROI tensor. But roi_pool and roi_align only work with (n×5) tensors. What do we do?

The solution is to reshape our ROI tensor with PyTorch’s view() function:

# turn our (batch_size×n×5) ROI into just (n×5)
rois = rois.view(-1, 5)

Another problem is that the image-ID will NOT be aligned with the batch index. This is because each image-ID in your dataset is unique, but the batch index is 0–batch_size. Therefore we must manually “reset” the image-ID:

# reset ROI image-ID to align with the 0-indexed minibatch
rois[:, 0] = rois[:, 0] - rois[0, 0

ROI Layer Initialization Parameters

We construct ROI layers as shown below, but what do these parameters mean?

# init the layers
roi_pool = ROIPool((width, height), spatial_scale)
roi_align = ROIAlign((width, height), spatial_scale, sampling_ratio)

Let’s start with spatial_scale by looking at a typical CNN. Here is a diagram of VGG16:

VGG16 convolutional neural network. Source: Simonyan et al. 2014.

A CNN effectively downscales an image as it progresses through the network. This scaled factor is the spatial scale. For example, the spatial scale of the fourth layer (28×28) relative to the input (224×224) is 28/224=0.125. If we were to ROI pool on the fourth layer, we’d pass 0.125 to the spatial_scale parameter.

How about sampling_ratio in ROI align? To understand this, we need to understand a bit about how ROI align works. Page 3 from this source provides an excellent explanation.

The value of each “bin” in the output size of the ROI Align layer is determined by averaging bilinear-interpolation samples. In the image to the left, there are 4 samples (the blue dots) per bin.

The sampling_ratio parameter determines how “wide” the sampling field is. For example, if sampling_ratio=2, the sampling field will have 2×2=4 points. (If you’d like to verify this yourself, take a look at the implementation’s underlying C source code.)


Congrats! You made it to the end. Hopefully by now you understand how to add ROI layers to your own neural networks in PyTorch. We walked through how to install the ROI implementation from jwyang’s repository, work with the layers and ROIs in code, and explained the initialization parameters. Happy coding!