Serving Pytorch NLP models on AWS Lambda


Deep learning models are achieving state-of-the-art performance on many NLP problems and have experienced exponential growths in various online applications. Recently, during work, I have deployed my deep learning NLP model in AWS Lambda. My model uses Spacy to tokenize texts, and Fastai library (a wrapper for Pytorch) to run NLP models. Although my model is NLP focused, this blog post also applies to other deep learning and machine learning models.

A quick introduction to AWS Lambda

There are many good introductions of AWS Lambda elsewhere, including their own documentation and tutorial from AWS. In short, Lambda is a serverless service provided by AWS, so you can run your code for any services or applications without provisioning and managing any servers. You only pay for the time when your code is actually running, and your code can be triggered by a large range of events, including scheduled jobs. They currently support several runtime environments, including Python, which is the natural habitat for (most) data scientists.

Speaking from my own learning experience, there are two key concepts for understanding how Lambdas work: namely, event and handler. Event is what triggers the Lambda to run; in my case, it is when a new data set is saved into an AWS S3 bucket and a JSON file is spitted out accordingly to the Lambda function. Therefore, my handler, a python function whose inputs are the event and context of the event, will parse the JSON file and carry out subsequent executions. Due to this reason, Lambda jobs are also called “functions”.

For lambda functions to execute properly, you will need to package your application up, which includes both code and all dependencies. Virtual environment is a perfect venue for building such dependencies. Note that if your dependencies include any C compiled libraries (which is true for almost all data scientists) you will have to build your dependencies from an AWS Linux AMI (not any regular Ubuntu instance)! This is because Lambda functions are executed in AWS Linux environments, and they have slightly different C and C++ compiling environment.

The challenge for deploying deep learning NLP models in Lambda, is that Lambda has very, very strict restrictions. The upper limit size of all codes and dependencies for a single Lambda function is 250 MBs; Spacy itself with the English model is over this limit, let alone Pytorch, Numpy and Pandas. Building everything I need using regular PIP install in a virtual environment easily surpasses 800 MBs; additionally, my model itself is over 80 MB. Therefore, the key is to find a way to split and shrink the packages.

Splitting and chaining Lambdas

Luckily, Lambda functions can invoke each other, which is easily done via the AWS boto3 library. Therefore you can divide your program to multiple Lambda functions, each with less dependencies. In my case, my first Lambda was responsible for processing the dataset and tokenizing the text using Spacy, and the results were passed in the second Lambda for prediction using the pre-trained NLP model.

Spiting and chaining NLP models

To invoke other Lambda functions, all you have to do is passing the function name. Since you are directly triggering the function (not through other events), the InvocationType is RequestResponse, and you can pass any values/variables via the Payload argument. Alternatively, you can save the results from the first Lambda in S3 and let the second Lambda to retrieve them. After invoking the second Lambda function, the main Lambda function just waits for the invoked to produce and pass the desired result back.

lambda_client = boto3.client('lambda')
response = lambda_client.invoke(FunctionName='second_lambda', InvocationType='RequestResponse',
Payload=str.encode(json.dumps{"key":value}) )

Also keep in mind that AWS Lambda gives you 500MB ephemeral storage under the path /tmp/. That is where I download my model to and retrieve it later via Pytorch. You can possibly also directly load the model from S3, but I was having some difficulties and took the easy way out.

Making slim version of libraries

Even with the above trick to separate Spacy and Pytorch into different Lambda functions, they are still way too big for individual Lambdas. As a general step, I followed the code from Rustem Feyzkhanov’s Github repo to make slim version of the libraries, which removes test cases and unnecessary intermediate results.

cd /home/
source env/bin/activate
mkdir lambdapack
cd lambdapack
cp -R ${virtualenv_path}/lib/python3.6/site-packages/* .
cp -R ${virtualenv_path}/lib64/python3.6/site-packages/* .
rm -r external
find . -type d -mame "tests" -exec rm -rf {} +
rm -r pip
rm -r wheel
find . -name \*.pyc -delete
zip -FS -r9 /pack.zip * 

Additionally, for Spacy, depends on the language you are working on, you can delete many unnecessary tokenizers from other languages, which can be found in virtualenv/lib/python3.6/site-packages/spacy/lang.Deleting several languages easily reduce the package size by another 100MB.

For Pytorch, you have to make sure you are installing the CPU version of Pytorch. Lambda does not provide any GPU support, so it is unnecessary to install the GPU version anyway. For this I have tried many methods, and the easiest and bullet-proof method was to find the wheel file of the Pytorch version you are using, and do a simple pip install. For example, because I am using Pytorch 0.3.1, I can just do pip3 install http://download.pytorch.org/whl/cpu/torch-0.3.1-cpu36-cp36m-linux_x86_64.whl.

A few other words for Fastai library users

Fastai library is an awesome wrapper for Pytorch which includes some of the state-of-the-art techniques in training deep learning models, and quick pipelines for data processing and loading. If you are using Fastai library for your pytorch models (you should) and trying to deploy them on Lambdas, make sure you are building the dependencies from bottom-up rather than pip installing everything in the Fastai yml file. This would also mean slight tweaks to Fastai source codes to replace unessential functionalities (for example, I have modified the Fastai tokenizer so it would not try using multiple CPU cores) and deleting unnecessary classes, functions and imports. You have no other way to do this than going to each line of your code and perform the trimmings one line at a time.

Other useful sources

Certainly this is not the only or the optimal way for deploying your Pytorch Lambda models. I chose to build all dependencies because I can locally test the build in a virtual environment and easily integrate them to downstream AWS pipelines my company is using. Alternatively, you can use APIs such as Serverless and Zappa for fast and easy deployment. For example, Alec Rubin has deployed his CNN model using Serverless, which I highly recommend in checking out.

Source: Deep Learning on Medium