NB: This is my first blog post on Medium. It might go though several edits before its formatting is up to my standard, if not the best Medium bloggers’ standard.
I have a buddy, who has spent a good amount of $ and time setting up a GPU computing cluster. It was a non-trivial effort to bring the beast up online. After that, I was tasked to get TensorFlow and PyTorch to work with it.
I immediately started my work. During the process, I was surprised to find multiple roadblocks, and spent lots of time searching for solutions. Until recently, I was finally able to put everything together. I am hereby sharing my little experience, and hopefully it can help you.
Here is my procedure:
– OS: Ubuntu 16.04 (codename: xenial)
– Nvidia GeForce GTX 1080 Ti
– Target CUDA versions:
CUDA: 9.1.85, Driver Version: 390.30
I can control CUDA driver version when I install the graphic cards. The default is CUDA 9.1. The first confusion I found was there were many different opinions on whether TensorFlow would work with CUDA 8 only, 9.0 only or even with 9.1, the latest version. After a few trials, I realized that 9.1 was totally fine.
Install Nvidia CUDA Toolkit:
– Nvidia’s installation guide is good enough. I focused on the following sections
– Pre-installation Actions
– CUDA toolkit is available at: https://developer.nvidia.com/cuda-downloads. Based on my platform, this was my choice
- After I downloaded the base installer, I followed the “Installation Instruction” on the same page. Note that I was using
sudo apt-get install cuda-9.1to enforce my version. It turned out to be a good practice, especially when I had multiple CUDA versions on my platform.
– The installation might take quite a few minutes. When it was completed, I updated my ~/.bash_profile:
Install cuDNN 7.0.5
– cuDNN is part of Nvidia’s Deep Learning SDK. So I needed it.
– The installation page is at https://developer.nvidia.com/cudnn. I needed to register as a user, so will you.
– There are 3 libraries: Runtime library, Developer library and Code samples. I had no problem following the installation instructions, so I won’t repeat here.
How to Check CUDA Versions
– Run $ nvcc — version, and my result is like: “…release 9.1, V9.1.85”
– Alternatively, I can run $ cat /usr/local/cuda/version.txt and get “CUDA Version 9.1.85”
– Run $ nvidia-smi, and my result is like: “…NVIDIA-SMI 390.30, Driver Version: 390.30”
Verify Device Versions before Installing TensorFlow
– Verify that /usr/local/cuda/ is symlinked to /usr/local/cuda-9.1/
– $ cd /usr/local/cuda/samples/
– Run $ make clean && make
– The make process takes quite a few minutes to complete. After that, go to /usr/local/cuda/samples/bin/x86_64/linux/release/
– Run $ ./deviceQuery and my result is: “deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 1, Result = PASS”
– A long introduction to the long list of samples is here:
Install TensorFlow from Source
– This is da big discovery: I was not able to install TensorFlow from its pre-installed binary. Instead, I must install from the source!
– TensorFlow provides its own installation instruction: https://www.tensorflow.org/install/install_sources. It is quite detailed but easy to follow. I won’t repeat it here.
– Note that I was using Anaconda 3 (Python 3.6.3) as my Python backend, and it was working fine.
– After TF is installed, I could verify my installation with:
>>> import tensorflow as tf
>>> tf.__version__ # My answer is ‘1.5.0’
>>> hello = tf.constant(‘Hello, TensorFlow!’)
>>> sess = tf.Session()
To summarize, the biggest got-you moments include: TF can work with CUDA 9.1, so you do not have to go through tens of debates on this topic on Stack Overflow. Second, you need to install TensorFlow from source. The simple
pip3 install tensorflow-gpu will not work. (At least it did not work for me.)
Source: Deep Learning on Medium