本文解决TrueNAS 删除显卡驱动 & 安装vGPU 驱动 & 安装指定版本Linux驱动

删除驱动

使用 dpkg -l | grep nvidia 查看需要删除的nvidia 显卡软件，使用 dpkg --purge 来删除

Update：

TrueNAS-SCALE-24.10.* 之后，TrueNas不在包含Nvidia 驱动，而是通过网络的形式自动安装

最终只保留以下软件：

dpkg -l | grep nvidia

root@XIMCloudNAS[~]# dpkg -l | grep nvidia                                                                   
ii  libnvidia-container-tools                     1.13.4-1                       amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                    1.13.4-1                       amd64        NVIDIA container runtime library
ii  nvidia-container-runtime                      3.13.0-1                       all          NVIDIA container runtime
ii  nvidia-container-toolkit                      1.13.4-1                       amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base                 1.13.4-1                       amd64        NVIDIA Container Toolkit Base

配置 apt 代理

配置apt代理用于加速下载

vim /etc/apt/apt.conf.d/proxy.conf

Acquire::http::Proxy "http://192.168.5.1:1088/";
Acquire::https::Proxy "http://192.168.5.1:1088/";

在 TrueNAS-SCALE-24.* 之前打开：

chmod +x /usr/bin/apt apt-get apt-key

在 TrueNAS-SCALE-24.10.* 及之后使用：

install-dev-tools

安装vGPU 驱动

下载地址：Nvidia vGPU 驱动

./NVIDIA-Linux-x86_64-535.161.07-grid.run --tmpdir /tmp

安装完毕通过命令 nvidia-smi 确认

配置gridd服务

cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf

vim /etc/nvidia/gridd.conf

# /etc/nvidia/gridd.conf.template - Configuration file for vGPU Licensing Daemon

# This is a template for the configuration file for vGPU Licensing Daemon.
# For details on the file format, please refer to the nvidia-gridd(1)
# man page.

# Description: Set License Server Address
# Data type: string
# Format:  "<address>"
ServerAddress=nvidia-dls address

# Description: Set License Server port number
# Data type: integer
# Format:  <port>, default is 7070
ServerPort=443

重启 host 并通过 nvidia-smi -q 确认 license，有过期时间即可

TrueNAS-SCALE-24.10.* 及之后可能会遇到的疑难杂症

Q1：

[EFAULT] Command /root/tmpj_sx5af4/NVIDIA-Linux-x86_64-550.127.05-no-compat32.run --tmpdir /root/tmpj_sx5af4 -s failed (code 1): Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.127.05....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release. Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information. ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

TrueNas 下载驱动失败，检查显卡是否是vGPU 或检查网络

Q2:

TrueNAS 24.04 disable the apt command. So need using the one to turn on https://www.truenas.com/community/threads/no-apt-after-update-to-release.99579/post-808108

In DragonFish you can enable apt / toggle "developer" mode by running the command "install-dev-tools" or /usr/local/libexec/disable-rootfs-protection.
This makes the boot device read-write and sets an internal flag so that we know the base install has been altered (helps for triaging bug reports).

Q3:

truenas 23.10 配置 containerd 镜像加速地址
在/etc/rancher/k3s路径下新建registries.yaml，写入
mirrors:
"docker.io":
endpoint:
- "https://docker.nju.edu.cn/" ##加速地址，我使用的是南京大学开源镜像站
- "https://registry-1.docker.io"
然后重启k3s服务 systemctl restart k3s.service
已证实重启不会失效，
估计升级要重新配置，待验证
参考 https://www.cnblogs.com/rancherlabs/p/14324469.html

Q3:

[EFAULT] Failed to render compose templates: base_v1_1_4.utils.TemplateException: Expected [uuid] to be set for GPU inslot [0000:02:00.0] in [nvidia_gpu_selection]

未配置GPU UUID

执行命令

midclt call -job docker.update '{"nvidia": true}'

https://ixsystems.atlassian.net/browse/NAS-132086



midclt call app.gpu_choices | jq

'{"values": {"resources": {"gpus": {"use_all_gpus": false, "nvidia_gpu_selection": {"PCI_SLOT": {"use_gpu": true, "uuid": "GPU_UUID"}}}}}}'

Now for each app that you encounter the mentioned error:

On the following command, before running.
- Replace APP_NAME with the name you entered in the application (Example “plex”)
- Replace PCI_SLOT with the pci slot from the error (Example “0000:2d:00.0”)
- Replace GPU_UUID with the uuid that you retrieved from the above command, that matches the pci slot.

midclt call -job app.update APP_NAME '{"values": {"resources": {"gpus": {"use_all_gpus": false, "nvidia_gpu_selection": {"PCI_SLOT": {"use_gpu": true, "uuid": "GPU_UUID"}}}}}}'

Q4:

root@XIMCloudNAS[...loudMassStorage/XIMCloudSharedStorage]# install-dev-tools 
+ FORCE_ARG=
+ [[ '' == \-\-\f\o\r\c\e ]]
+ [[ ! -S /var/run/middleware/middlewared.sock ]]
+ PACKAGES=(make open-iscsi python3-cryptography python3-pip python3-pyfakefs python3-pyotp python3-pytest python3-pytest-asyncio python3-pytest-dependency python3-pytest-rerunfailures python3-pytest-timeout snmp sshpass zstd)
+ PIP_PACKAGES=()
+ '[' -f /usr/local/libexec/disable-rootfs-protection ']'
+ /usr/local/libexec/disable-rootfs-protection
/usr is currently provided by a readonly systemd system extension. This may occur if nvidia module support is enabled. System extensions must be disabled prior to disabling rootfs protection.

解决办法

$sudo systemd-sysext unmerge
$sudo install-dev-tools
$sudo systemd-sysext merge

https://forums.truenas.com/t/is-install-dev-tools-broken-in-24-10-2/28673/3

空间管理

Breadcrumbs

TrueNAS 更新/安装 nvidia 显卡驱动

本文解决TrueNAS 删除 显卡驱动 & 安装vGPU 驱动 & 安装指定版本Linux驱动