How to Install Deep Learning Frameworks (Pytorch and Jax)
Sources:
- Pytorch GPU Setup Guide
For CUDA installation, see my How to Install and Setup CUDA post.
Pytorch(CUDA version)
This command will only install a CPU compiled version of pytorch:
1 | conda install torch |
To install a GPU compiled version, you need to:
Go to the Pytorch website
Choose a CUDA version of pytorch:
But how do we know which version CUDA_VERSION we need?
Answer: Use
1 | watch -n 0.1 nvidia-smi |
Check your GPU
To check if Pytorch can find your GPU, use the following:
1 | import torch |
If your GPU cannot be found, it would be helpful to get some more feedback. Try sending something to the GPU. It will fail, and give you the reason:
1 | torch.zeros(1).cuda() |
If any GPU is recognized, you can now get more info about them or even decide which tensors and operations should go on which GPU.
1 | torch.cuda.current_device() # The ID of the current GPU. |
Hopefully, that will resolve some issues for you. Happy hacking!
Jax
See Jax installation
Check if cudnn and jax is compatible
1 | python -c "import jax; m = jax.numpy.array([1,]); m@m" |
Notes
- If you want to install Pytorch and Jax together in one environment, you MUST install CPU version of one of them since the current releases of PyTorch and JAX have incompatible CUDA version dependencies. (->Source)
Common Problems
Error:
1 | NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. |
This happens often after the update of the kernel.
Solution:
Remove current cuda driver:
1
2
3
4
5# First of all, we need to remove all of the previous dependencies
sudo apt-get remove --purge '^nvidia-.*'
sudo apt-get purge nvidia-*
sudo apt-get update
sudo apt-get autoremoveReinstall CUDA.
Error:
1 | Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory |
Solution:
First, check my system to see if libcudnn_cnn_infer.so.8
exists:
Use
ldconfig -p | grep libcudnn_cnn
, didn't return anything.Use
ldconfig -p | grep libcuda
, returned:1
2
3
4
5
6
7
8
9
10libcudart.so.12 (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
libcudart.so.12 (libc6,x86-64) => /usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudart.so.12
libcudart.so.12 (libc6,x86-64) => /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudart.so.12
libcudart.so (libc6,x86-64) => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so
libcudart.so (libc6,x86-64) => /usr/local/cuda-12.3/targets/x86_64-linux/lib/libcudart.so
libcudart.so (libc6,x86-64) => /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudart.so
libcudadebugger.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcudadebugger.so.1
libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
libcuda.so.1 (libc6) => /lib/i386-linux-gnu/libcuda.so.1
libcuda.so (libc6) => /lib/i386-linux-gnu/libcuda.soStill didn't have
libcudnn_cnn_infer.so.8
.
Make sure that cuda and cuda toolkit are installed. See the output of
1 | nvidia-smi |
and
1 | nvcc --version |
According to this reply, we just need to:
1 | conda activate <your env> |
1 | /miniconda3/lib/libtinfo.so.6: no version information available (required by /bin/bash) |
Source:
It seems that the libtinfo shared library shipped with conda does not provide its version information. So it's a problem on their end.
I was able to workaround the problem by using another shared library of the same libtinfo.so
with the same version as the one shipped with conda and that contains the version information. For instance in my case:
1 | rm ${CONDA_PREFIX}/lib/libtinfo* |
1 | AttributeError: 'NoneType' object has no attribute 'glGetError' |
You need to:
1 | pip install --upgrade pyrender |
https://github.com/50ButtonsEach/fliclib-linux-dist/issues/44
1 | strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX_3.4. |
https://askubuntu.com/questions/1418016/glibcxx-3-4-30-not-found-in-conda-environment
First, see
1 | strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX |
Then see
1 | strings /usr/lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX_3.4. |
and
1 | strings /home/lyk/miniconda3/envs/recall2imagine/bin/../lib/libstdc++.so.6 | grep GLIBCXX_3.4. |
Use sym link
1 | ln -sf /usr/lib/x86_64-linux-gnu/libstdc++.so.6 /home/lyk/miniconda3/envs/recall2imagine/bin/../lib/libstdc++.so.6 |
1 | 2024-04-11 06:16:01.747343: E external/xla/xla/stream_executor/cuda/cuda_dnn.cc:439] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR |
Looks like the GPU is out of mem.
1 | libGL error: failed to load driver: swrast |
Solution:
Actually you need to reboot and wait for some time and it will recovered. I don't know why...
One solution is: (-->Source)
From the latter link:
According to online information, there is a problem with the libstdc++.so file in Anaconda (I use this commercial python distribution). It cannot be associated with the driver of the system, so we removed it and used the libstdc++ that comes with Linux. so creates a soft link there.
To solve this problem, run this in bash:
1 | cd miniconda3/{your env}/lib |
where $USER
should be your own username.
If you have
1 | xcb_connection_has_error() returned true |
This won't affect your training. But you can use xvfb-run
to avoid it.