GPU检查#

检查NVIDIA驱动

[1]:
!nvidia-smi
Thu Apr 13 17:18:51 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10-4Q       On   | 00000000:00:07.0 Off |                  N/A |
| N/A   N/A    P0    N/A /  N/A |    128MiB /  3932MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

导入 PyTorch 以检查 GPU 设备

[2]:
import torch

检查 PyTorch 的版本

[3]:
torch.__version__
[3]:
'2.0.0+cu117'

检查PyTorch是否能调用GPU

[4]:
torch.cuda.is_available()
[4]:
True

在PyTorch中检查计算机的GPU的数量

[5]:
gpu_num = torch.cuda.device_count()
print(gpu_num)
1

在PyTorch中检查计算机的GPU型号

[6]:
for i in range(gpu_num):
    print('GPU {}.: {}'.format(i,torch.cuda.get_device_name(i)))
GPU 0.: NVIDIA A10-4Q

导入 TensorFlow 检查 GPU 设备

[7]:
import tensorflow as tf
2023-04-13 17:18:52.308723: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-13 17:18:52.422989: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-13 17:18:52.453216: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-13 17:18:52.918465: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvrtc.so.11.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-11.2/lib64:/home/limingbo/anaconda3/envs/tensorflow/lib:/home/limingbo/TensorRT-7.2.3.4/lib
2023-04-13 17:18:52.918653: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvrtc.so.11.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/cuda-11.2/lib64:/home/limingbo/anaconda3/envs/tensorflow/lib:/home/limingbo/TensorRT-7.2.3.4/lib
2023-04-13 17:18:52.918661: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

检查 TensorFlow 的版本

[8]:
tf.__version__
[8]:
'2.10.1'

检查TensorFlow是否能调用GPU

[9]:
tf.test.is_gpu_available()
WARNING:tensorflow:From /tmp/ipykernel_345917/2294581100.py:1: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.list_physical_devices('GPU')` instead.
2023-04-13 17:18:54.000431: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-13 17:18:54.001471: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:54.012039: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:54.012246: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:56.755971: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:56.756159: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:56.756317: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:56.756460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /device:GPU:0 with 1415 MB memory:  -> device: 0, name: NVIDIA A10-4Q, pci bus id: 0000:00:07.0, compute capability: 8.6
[9]:
True

在TensorFlow中检查物理GPU

[10]:
tf.config.list_physical_devices('GPU')
2023-04-13 17:18:56.761259: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:56.761449: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:56.761587: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[10]:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

在TensorFlow中检查计算机的GPU型号

[11]:
from tensorflow.python.client import device_lib

local_device_protos = device_lib.list_local_devices()
for x in local_device_protos:
    if x.device_type == 'GPU':
        print(x.physical_device_desc)
device: 0, name: NVIDIA A10-4Q, pci bus id: 0000:00:07.0, compute capability: 8.6
2023-04-13 17:18:56.765979: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:56.766168: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:56.766340: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:56.766507: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:56.766647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:980] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-04-13 17:18:56.766791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1616] Created device /device:GPU:0 with 1415 MB memory:  -> device: 0, name: NVIDIA A10-4Q, pci bus id: 0000:00:07.0, compute capability: 8.6

导入 Jax 和 jaxlib 以检查 GPU 设备

[12]:
import jax
import jaxlib

检查jax和jaxlib的版本

[13]:
jax.__version__
[13]:
'0.4.8'
[14]:
jaxlib.__version__
[14]:
'0.4.7'

从 jax 的默认后端检查所有设备

[15]:
jax.devices()
[15]:
[StreamExecutorGpuDevice(id=0, process_index=0, slice_index=0)]

查看jax的设备总数

[16]:
jax.device_count()
[16]:
1

查看jax后端关联的JAX进程数

[17]:
jax.process_count()
[17]:
1

检查jaxlib的后端

[18]:
from jax.lib import xla_bridge
print(xla_bridge.get_backend().platform)
gpu

导入SecretFlow以验证此环境中未报告任何错误

[19]:
import secretflow