本文解决TrueNAS 删除 显卡驱动 & 安装vGPU 驱动 & 安装指定版本Linux驱动


删除驱动

使用 dpkg -l | grep nvidia 查看需要删除的nvidia 显卡软件, 使用 dpkg --purge 来删除


Update:

TrueNAS-SCALE-24.10.* 之后,TrueNas不在包含Nvidia 驱动,而是通过网络的形式自动安装


最终 只保留以下软件:

dpkg -l | grep nvidia

root@XIMCloudNAS[~]# dpkg -l | grep nvidia                                                                   
ii  libnvidia-container-tools                     1.13.4-1                       amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                    1.13.4-1                       amd64        NVIDIA container runtime library
ii  nvidia-container-runtime                      3.13.0-1                       all          NVIDIA container runtime
ii  nvidia-container-toolkit                      1.13.4-1                       amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base                 1.13.4-1                       amd64        NVIDIA Container Toolkit Base

配置 apt 代理


配置apt代理用于加速下载

vim /etc/apt/apt.conf.d/proxy.conf

Acquire::http::Proxy "http://192.168.5.1:1088/";
Acquire::https::Proxy "http://192.168.5.1:1088/";


在 TrueNAS-SCALE-24.* 之前打开:

chmod +x /usr/bin/apt apt-get apt-key


在 TrueNAS-SCALE-24.10.* 及之后 使用:

install-dev-tools


安装vGPU 驱动

下载地址:Nvidia vGPU 驱动

./NVIDIA-Linux-x86_64-535.161.07-grid.run --tmpdir /tmp

安装完毕通过命令 nvidia-smi  确认

配置gridd服务


cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf

vim /etc/nvidia/gridd.conf



# /etc/nvidia/gridd.conf.template - Configuration file for vGPU Licensing Daemon

# This is a template for the configuration file for vGPU Licensing Daemon.
# For details on the file format, please refer to the nvidia-gridd(1)
# man page.

# Description: Set License Server Address
# Data type: string
# Format:  "<address>"
ServerAddress=nvidia-dls address

# Description: Set License Server port number
# Data type: integer
# Format:  <port>, default is 7070
ServerPort=443


重启 host 并通过 nvidia-smi -q 确认 license,有过期时间即可

TrueNAS-SCALE-24.10.* 及之后可能会遇到的疑难杂症

Q1:

[EFAULT] Command /root/tmpj_sx5af4/NVIDIA-Linux-x86_64-550.127.05-no-compat32.run --tmpdir /root/tmpj_sx5af4 -s failed (code 1): Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.127.05....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release. Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information. ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

  TrueNas 下载驱动失败,检查显卡是否是vGPU 或检查网络


Q2:

TrueNAS 24.04 disable the apt command. So need using the one to turn on https://www.truenas.com/community/threads/no-apt-after-update-to-release.99579/post-808108

In DragonFish you can enable apt / toggle "developer" mode by running the command "install-dev-tools" or /usr/local/libexec/disable-rootfs-protection.
This makes the boot device read-write and sets an internal flag so that we know the base install has been altered (helps for triaging bug reports).



Q3:

truenas 23.10 配置 containerd 镜像加速地址
在/etc/rancher/k3s路径下新建registries.yaml,写入
mirrors:
  "docker.io":
    endpoint:
      - "https://docker.nju.edu.cn/"  ##加速地址,我使用的是南京大学开源镜像站
      - "https://registry-1.docker.io
然后重启k3s服务 systemctl restart k3s.service 
已证实重启不会失效,
估计升级要重新配置,待验证
参考 https://www.cnblogs.com/rancherlabs/p/14324469.html 



Q3:

[EFAULT] Failed to render compose templates: base_v1_1_4.utils.TemplateException: Expected [uuid] to be set for GPU inslot [0000:02:00.0] in [nvidia_gpu_selection]


未配置GPU UUID

执行命令

midclt call -job docker.update '{"nvidia": true}'



https://ixsystems.atlassian.net/browse/NAS-132086



midclt call app.gpu_choices | jq

'{"values": {"resources": {"gpus": {"use_all_gpus": false, "nvidia_gpu_selection": {"PCI_SLOT": {"use_gpu": true, "uuid": "GPU_UUID"}}}}}}'

Now for each app that you encounter the mentioned error:

On the following command, before running.
- Replace APP_NAME with the name you entered in the application (Example “plex”)
- Replace PCI_SLOT with the pci slot from the error (Example “0000:2d:00.0”)
- Replace GPU_UUID with the uuid that you retrieved from the above command, that matches the pci slot.

midclt call -job app.update APP_NAME '{"values": {"resources": {"gpus": {"use_all_gpus": false, "nvidia_gpu_selection": {"PCI_SLOT": {"use_gpu": true, "uuid": "GPU_UUID"}}}}}}'

Q4:

root@XIMCloudNAS[...loudMassStorage/XIMCloudSharedStorage]# install-dev-tools 
+ FORCE_ARG=
+ [[ '' == \-\-\f\o\r\c\e ]]
+ [[ ! -S /var/run/middleware/middlewared.sock ]]
+ PACKAGES=(make open-iscsi python3-cryptography python3-pip python3-pyfakefs python3-pyotp python3-pytest python3-pytest-asyncio python3-pytest-dependency python3-pytest-rerunfailures python3-pytest-timeout snmp sshpass zstd)
+ PIP_PACKAGES=()
+ '[' -f /usr/local/libexec/disable-rootfs-protection ']'
+ /usr/local/libexec/disable-rootfs-protection
/usr is currently provided by a readonly systemd system extension. This may occur if nvidia module support is enabled. System extensions must be disabled prior to disabling rootfs protection.


解决办法

$sudo systemd-sysext unmerge
$sudo install-dev-tools
$sudo systemd-sysext merge


https://forums.truenas.com/t/is-install-dev-tools-broken-in-24-10-2/28673/3



Link to this page

10 评论

  1. 匿名用户 发表:

    TrueNAS 24.04 disable the apt command. So need using the one to turn on https://www.truenas.com/community/threads/no-apt-after-update-to-release.99579/post-808108

  2. 匿名用户 发表:

    In DragonFish you can enable apt / toggle "developer" mode by running the command "install-dev-tools" or /usr/local/libexec/disable-rootfs-protection.

    This makes the boot device read-write and sets an internal flag so that we know the base install has been altered (helps for triaging bug reports).

  3. 匿名用户 发表:

     truenas 23.10 配置 containerd 镜像加速地址
    在/etc/rancher/k3s路径下新建registries.yaml,写入
    mirrors:
      "docker.io":
        endpoint:
          - "https://docker.nju.edu.cn/"  ##加速地址,我使用的是南京大学开源镜像站
          - "https://registry-1.docker.io
    然后重启k3s服务 systemctl restart k3s.service 
    已证实重启不会失效,
    估计升级要重新配置,待验证
    参考 https://www.cnblogs.com/rancherlabs/p/14324469.html 


  4. 匿名用户 发表:

    [EFAULT] Command /root/tmpj_sx5af4/NVIDIA-Linux-x86_64-550.127.05-no-compat32.run --tmpdir /root/tmpj_sx5af4 -s failed (code 1): Verifying archive integrity... OK Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 550.127.05....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources, with a version of gcc that differs from the one used to build the target kernel, or if another driver, such as nouveau, is present and prevents the NVIDIA kernel module from obtaining ownership of the NVIDIA device(s), or no NVIDIA device installed in this system is supported by this NVIDIA Linux graphics driver release. Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information. ERROR: Installation has failed. Please see the file '/var/log/nvidia-installer.log' for details. You may find suggestions on fixing installation problems in the README available on the Linux driver download page at www.nvidia.com.

  5. 匿名用户 发表:

    [EFAULT] Failed to render compose templates: base_v1_1_4.utils.TemplateException: Expected [uuid] to be set for GPU inslot [0000:02:00.0] in [nvidia_gpu_selection]

  6. 匿名用户 发表:

    midclt call -job docker.update '{"nvidia": true}'

  7. usami mizugi 发表: 作者

    {
        "data-root": "/mnt/.ix-apps/docker",
        "exec-opts": ["native.cgroupdriver=cgroupfs"],
        "iptables": true,
        "storage-driver": "overlay2",
        "default-address-pools": [{
            "base": "172.17.0.0/12",
            "size": 24
        }],
        "runtimes": {
            "nvidia": {
                "path": "/usr/bin/nvidia-container-runtime",
                "runtimeArgs": []
            }
        },
        "registry-mirrors": [
            "https://docker.1panel.live"
        ],
        "default-runtime": "nvidia"
    }

  8. usami mizugi 发表: 作者

    https://ixsystems.atlassian.net/browse/NAS-132086


    midclt call app.gpu_choices | jq

    '{"values": {"resources": {"gpus": {"use_all_gpus": false, "nvidia_gpu_selection": {"PCI_SLOT": {"use_gpu": true, "uuid": "GPU_UUID"}}}}}}'

    Now for each app that you encounter the mentioned error:

    On the following command, before running.
    - Replace APP_NAME with the name you entered in the application (Example “plex”)
    - Replace PCI_SLOT with the pci slot from the error (Example “0000:2d:00.0”)
    - Replace GPU_UUID with the uuid that you retrieved from the above command, that matches the pci slot.

    midclt call -job app.update APP_NAME '{"values": {"resources": {"gpus": {"use_all_gpus": false, "nvidia_gpu_selection": {"PCI_SLOT": {"use_gpu": true, "uuid": "GPU_UUID"}}}}}}'

  9. usami mizugi 发表: 作者

    {
        "live-restore": true,
        "proxies": {
            "http-proxy": "http://192.168.5.1:1088",
            "no-proxy": "localhost,127.0.0.0/8",
            "https-proxy": "http://192.168.5.1:1088"
        }
    }

写评论...