Way to store large Deep Learning Models in production ready environments

Original article can be found here (source): Deep Learning on Medium

Way to store large Deep Learning Models in production ready environments

A Data scientist usually deals and experiments with number of machine learning(ML) & deep learning(DL) techniques in day to day life. At times it takes several days to train one single model which produces some output that is later used for evaluation.

Until unless these models reside on a single server where other teams members also have excess to, its easy and manageable to work on same project and communicate references to models, but it becomes difficult to work as team size, dependency and other dimensions grow. Also there is disability in such environment to use models from different machines.

Then what’s the best way to store large DL models trained with huge and complex datasets?

Lets see what possible choices are:-

  • I regularly try multiple variations of the same model to find what parameters work best. To do so, I make a simple NumPy dump of the model since it is easy to share it between servers, or colleagues but importing these model on another machine might not work if the python environment slightly differs
  • You should avoid pickle since it stores much more (instances of class, libraries…) than just the parameters learned by your model

When pushing a model in production, one need:-

  • A version of the model that can be loaded fast in case of a server breakdown, typically a binary format, storing only what is necessary, such as weights of a neural network
  • Some way to keep the model in-RAM to quickly deal with the API requests

In such case some possible solutions are:-

  1. Store your model on object storage (eg. Amazon S3) — this method is good if your models are very large, in this case you get unlimited storage and fairly easy API, you pay more, that is for sure. Advantages: Unlimited space and ability to store arbitrary file formats. Disadvantages: cost, and the fact that to do it right you’ll need to develop your own tracking system
  2. Store them in document storage (eg. mongoDB) — this method is recommended when your model files are less then 16MB (or the joblib shards are), then you can store model as binary data. In addition, some ML libr @ aries support model export and import in json (eg. LightGBM), which makes it a perfect candidate for storage in document storage. Advantages: easy tracking of model generation and easy access, Disadvantages: things will get messy if model object is too large

But hey just 16MB with MongoDB, isn’t there a way to increase it?

Actually, The maximum BSON document size(the way MongoDB stores document) is 16 megabytes. The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API.

GridFS

Instead of storing a file in a single document, GridFS divides the file into parts, or chunks, and stores each chunk as a separate document. By default, GridFS uses a default chunk size of 255 kB; that is, GridFS divides a file into chunks of 255 kB with the exception of the last chunk.

The last chunk is only as large as necessary. Similarly, files that are no larger than the chunk size only have a final chunk, using only as much space as needed plus some additional metadata.

It is noteworthy that GridFS stores files in two collections:-

Lets see how to store large machine learning models in GridFS using python.

Traditionally, storing files in GridFS is simple via using pymongo:-

from pymongo import Connection
from gridfs import GridFS
db = Connection().mydb
fs = GridFS(db)
myFile = open('myFile.ext', 'r')
fid = fs.put(myFile, filename='myFile')

This woks well for common file types (text, images, etc.), but the open function doesn’t work well with HDF5 files, and the HDF5 Python module (h5py), as it doesn’t provide a read() method to its file instances. But python provides module “io” which is helpful in this case as:-

import io
with io.FileIO(filename,'r') as fileObject:
objectId = fs.put(fileObject, filename = filename)

Remember that Object Id obtained after saving a file to mongo db is an important and only way to retrieve file from database, which too is easy as:-

with open(filename, 'wb') in fileObject:
fileObject.write( fs.get(objectId).read() )

Attached is a full flashed script used to store all models in folder to MongoDB using gridfs:-

import pymongo, gridfs
import io, glob as g, pandas as pd, sys
from bson import ObjectId
MONGO_HOST = "127.0.0.1"
MONGO_PORT = 27017
MONGO_DB = "test"
con = pymongo.MongoClient(MONGO_HOST, MONGO_PORT)
db = con[MONGO_DB]
fs = gridfs.GridFS(db)signal = sys.argv[1]if signal == 'save':
listOfFiles = []
for filetype in ['*.h5', '*.json', '*.hdf5']:
listOfFiles.extend(g.glob(filetype))
df = pd.DataFrame(columns = ['filename', 'objectId'])
for idx, filename in enumerate(listOfFiles):
with io.FileIO(filename,'r') as fileObject:
objectId = fs.put(fileObject, filename = filename)
df.loc[idx, 'filename'] = filename
df.loc[idx, 'objectId'] = objectId
df.to_csv('objectIdsOfSavedFiles.csv', index = False)

I used it to save 4 large models as observable:-

I could check it in local Mongo as:-

All the objects Ids are stored in a csv file

This can be later used to retrieve models from database:-

import pymongo, gridfs
import io, glob as g, pandas as pd, sys
from bson import ObjectId
MONGO_HOST = "127.0.0.1"
MONGO_PORT = 27017
MONGO_DB = "test"
con = pymongo.MongoClient(MONGO_HOST, MONGO_PORT)
db = con[MONGO_DB]
fs = gridfs.GridFS(db)signal = sys.argv[1]if signal == 'retrieve':
objects = pd.read_csv('objectIdsOfSavedFiles.csv')
for idx, objectDetail in enumerate(objects):
with open(objects.loc[idx,'filename'], 'wb') as fileObject:
fileObject.write(
fs.get(
ObjectId(objects.loc[idx,'objectId']))
.read() )

Before running retrieval script

After python script for retrieval by command

$ python3 connectMongoForStoringAndRetrievingModels retrieve

Keep Exploring new techniques and employee technology to its best use.