Original article was published by Attila Gróf on Deep Learning on Medium
Git Large File Storage for ML/DL projects
Introduction to Git Large File Storage
In long: In some projects storing change history for given files which are somehow wired into the code is a must. Git can actually store large files as well, the problem usually starts with git repository size building up and slowing down in performance. Git is very bad at storing large binary files and very efficient at line-by-line text files.
This is where Git Large File Storage (git lfs)comes into the picture. When using git lfs it will only store text pointers while keeping original binary file (picture, video, txt) in a drive similar like a Google Drive or OneDrive. This way the git repository stays clean from large files and large files won’t hurt performance.
There is a common problem with small and medium sized machine and deep learning projects. That is usually keeping track of train/test/validation data and trained models.
Having the version history for these are very useful and also having them right in the repository can save a lot of time for developers. Most of the times small and medium sized projects don’t have the budget to build custom training pipelines and speciel environments to handle data and code together. With git lfs you can keep your training code, training data, application code and final models together in one place.
Some actual numbers from GitHub
Using Git LFS, you can store files up to:
- GitHub Free 2 GB
- GitHub Pro 2 GB
- GitHub Team 4 GB
- GitHub Enterprise Cloud 5 GB
One data pack costs $5 per month, and provides a monthly quota of 50 GB for bandwidth and 50 GB for storage. You can purchase as many data packs as you need. For example, if you need 150 GB of storage, you’d buy three data packs .
Some practical examples
It is very easy to use git lfs under Linux and Windows environment as well by following official guide: https://git-lfs.github.com/
To use it for a remote repository the service provider needs to support git lfs. The largest code repository providers GitHub, BitBucket, GitLab all support git lfs.
# To set up git lfs for your account (needs to be done only once)
git lfs install# Track models by git lfs
git lfs track "*.h5"# Make sure .gitattributes is tracked
git add .gitattributes
There is no more steps, simply commit and push your large files just as you would do it with git.
My example can be found at: https://github.com/grofattila/git-lfs-for-machine-learning
# Train MNIST model, this will also save model.
python train_mnist.py# Add saved model files to git lfs.
git lfs track trained_models/*git stage *
git commit -m "Added trained model files."
After these steps are done you can see in GitHub that files are tracked via git lfs.
You can see in above image that model file is saved with git lfs.