Source: Deep Learning on Medium
Getting Ready for Deep Learning using GPUs: A Setup Guide
Recent decade has seen an explosive adoption of data intensive approaches to address critical challenges such as recommendation, demand sensing, fraud, vision guidance etc. across diverse industries. With an increasingly significant role played by data and accompanied by an increase in it’s volume, velocity and veracity, it’s imperative that an appropriate analysis is performed to capture desired patterns. Such an analysis requires us to extend the compute capabilities beyond CPU and execute applications utilizing distributed and higher compute power. In this context, several Deep Learning (DL) implementations utilize Graphical Processing Units (GPUs) to accelerate model development cycle.
In a recent project we did, Bidirectional Encoder Representations from Transformers (BERT) was utilized to address a language modeling problem and the process exposed us to complexity of setting up GPU for executing DL applications. This guide will cover detailed setup process and discuss necessary components and their dependencies required. It also identifies the appropriate resources to confirm compatibility across different components. Cloud vendors like Amazon Web Services (AWS), Google Cloud Platform (GCP) etc. provide machine images with installed components. Still, this guide can help you understand how different components intertwine and can be used when setting up infrastructure locally or on-premise.
Overall, GPU setup requires four key components:
- Nvidia Driver: Software that enables operating system and applications to communicate with Nvidia Graphics card.
- CUDA: As an enabling hardware and software technology, CUDA makes it possible to use many computing cores in a graphics processor to perform general-purpose mathematical calculations, achieving dramatic speedups in computing performance.
Generally speaking, there is backward compatibility for drivers with respect to CUDA toolkits. For example, the latest driver should work with any older CUDA toolkit. An older driver may not work with a newer CUDA toolkit. (reference Stack Overflow)
Minimum Driver version required for a CUDA tool kit can be obtained here.
- CuDNN: GPU-accelerated library of primitives for deep neural networks providing highly tuned implementations for standard neural network routines.
- Deep Learning Framework: Frameworks for developing Deep Learning applications such as PyTorch, TensorFlow etc. PyTorch comes packaged with CUDA/ CuDNN. However, other frameworks packaging might be different require additional components. This guide covers Nvidia and CUDA toolkit installation followed by PyTorch setup. Other Frameworks might require additional configuration.
We will be using Google Cloud Platform virtual machines to illustrate the process but this guide should work as well for other cloud services or local systems. In addition, Ubuntu 18.04 will be used as the underlying Operating System (OS) and one may observe differences when using other OS.
1. Obtain a GPU enabled Machine
An obvious requirement for the setup process is to have a GPU enabled machine. When using cloud platforms, usually one needs to submit a request for increase in GPU quota limits. By default, its set to zero. Following image briefly illustrates the process to do so with GCP. Typically, the approval process should finish in few hours.
Once approved, GPUs can be added to Virtual Machines during the launch. For this tutorial, one Tesla K80 along with 8 vCPU cores and Ubuntu 18.04 LTS with 30 GB persistent disk has been used as launch configuration.
SSH into the instance and verify access to GPU.
#Fetch new package versions
sudo apt-get update && sudo apt-get dist-upgrade -y#verify access to GPU
#Install lspci through pciutils
sudo apt-get install pciutils
#Verify Nvidia device is present
lspci | grep -i nvidia
One should observe output confirming GPU accessibility.
00:04.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
2. Install Nvidia driver
Installation of Nvidia driver can be achieved through Advanced Package Tool (apt-get) or alternatively, through a runfile available on Nvidia Driver Downloads. APT installation is easier to execute but might not provide the latest driver version.
In addition, it’s assumed that no existing Nvidia driver exists. If not true, addition steps are required as detailed out in section 2B. One can run nvidia-smi on terminal to confirm this.
2A. Installation using APT
- As a first step, install ubuntu-drivers-common. This package aggregates and abstracts Ubuntu specific logic and knowledge about third-party driver packages.
sudo apt-get install ubuntu-drivers-common -y
- Identify available versions of Nvidia drivers
sudo ubuntu-drivers devices
- Install the recommended or of your choice driver and nvidia-modprobe- a utility to load Nvidia kernel modules and create Nvidia character device files automatically every time the machine boots up.
sudo apt-get install nvidia-driver-435 nvidia-modprobe -y
- On completion of installation process, running nvidia-smi should display an output similar to Figure 4. Though the output displays CUDA version, it only reflects compatible version (but not installation of CUDA).
2B. Uninstallation of existing Nvidia Driver
Now that we already have an Nvidia driver on the device, additional uninstallation steps needs to be executed before a new driver installation. The method depends on the whether the driver was installed through APT or runfile. One would need the installer file for the later method (Checkout this Stack Overflow post).
#If installation was performed through APT
sudo apt-get purge nvidia* -y
sudo apt autoremove -y#If installation was performed through installer
sudo ./NVIDIA-Linux-x86_64-440.33.01.run --uninstall
Before proceeding further, one must disable Nouveau kernel driver. To disable the default Nouveau driver, create a configuration file
sudo nano /etc/modprobe.d/blacklist-nouveau.conf
and copy following contents
options nouveau modeset=0
initramfs is a tiny version of OS that gets loaded by boot loader, right after the kernel. It lives in RAM and provides just enough tools and instructions to tell the kernel how to set up real filesystem, mount HDD read/write, and start loading all the system services. update-initramfs is a script that updates initramfs to work with a new kernel. Reboot the machine for changes to take effect.
sudo update-initramfs -u
2C. Installation using Runfile from Nvidia
- For installation through runfile, additional dependencies are required.
sudo apt-get install build-essential gcc-multilib dkms -y
- As a last step in Nvidia driver installation process, download and install .run file using link obtained above. A warning message to install libglvnd EGL vendor library config files might appear and can be ignored.
#Download the runfile
#Add execute permission
chmod +x NVIDIA-Linux-x86_64-440.33.01.run
#Execute the run script
sudo ./NVIDIA-Linux-x86_64-440.33.01.run --dkms -s
- Run nvidia-smi to confirm Nvidia driver version has been updated.
3. Install CUDA
Next step in the setup process is to install CUDA. One can confirm if there is an existing installation by running following command. It would return no results in absence of an installation.
ls /usr/local | grep cuda
CUDA toolkit comes with a default driver. If an updated driver is required, refer to Section 2 for detailed steps. Reboot might be required after the installation.
Visit Nvidia CUDA downloads link and select appropriate options to obtain and execute download instructions.
5. Install Deep Learning Framework
PyTorch is more user-friendly, intuitive, Pythonic and supports dynamic graphs (personally I prefer PyTorch). That being said, Tensorflow has great features like Tensorflow Extended, visualization capabilities etc. and these can find value depending on the requirement.
PyTorch also comes packaged with CUDA. So, only Nvidia driver is required in addition for successfully executing PyTorch applications. Visit PyTorch link and select appropriate configuration for installation instructions.
Notice that the latest CUDA version supported is 10.1 and as long as the driver version is more recent than the minimum version which supports CUDA 10.1, it should all work out.
In order to install PyTorch through pip, one might need to install pip first.
sudo apt install python3-pip -y
pip3 install torch torchvision
5. It’s all ready now!
Installation process is now complete. Run the following torch commands in Python3 terminal. It should identify one Tesla K80 device,
Thank you for reading this through. I hope that helps you in getting your GPU ready for developing Deep Learning applications.