Preparing nodes for GPU virtualization

Before configuring GPU virtualization, you need to check whether your NVIDIA graphics card supports SR-IOV. The SR-IOV technology enables splitting a single physical device (physical function) into several virtual devices (virtual functions).

Legacy GPUs are based on the NVIDIA Tesla architecture and have no SR-IOV support. For such GPUs, virtualization is performed by creating a mediated device (mdev) over the physical function.
Modern GPUs are based on the NVIDIA Ampere architecture or newer and support SR-IOV. For such GPUs, virtualization is performed by creating a mdev over the virtual function.

For vGPU to work, enable it on the node by installing the NVIDIA kernel module, and then enable IOMMU. If you are using a modern GPU that based on the NVIDIA Ampere architecture or newer, you need to enable the virtual functions for the GPU. For more details, refer to the official NVIDIA documentation.

Note that if you want to virtualize a GPU that was previously detached from the node for GPU passthrough, you need to additionally modify the GRUB configuration file.

To obtain the GPU PCI address

List all graphics cards on the node and obtain their PCI addresses:

# lspci -D | grep NVIDIA
0000:01:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
0000:81:00.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

In the command output, 0000:01:00.0 and 0000:81:00.0 are the PCI addresses of the graphics cards.

To enable vGPU on a node

On the node with the physical GPU, do one of the following:
- If the physical GPU is attached to the node
  Blacklist the Nouveau driver:
```
# rmmod nouveau
# echo -e "blacklist nouveau\noptions nouveau modeset=0" > /usr/lib/modprobe.d/nouveau.conf
# echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/nouveau.conf
```
- If the physical GPU is detached from the node
  1. In the /etc/default/grub file, locate the GRUB_CMDLINE_LINUX line, and then delete pci-stub.ids=<gpu_vid>:<gpu_pid>. For example, for a GPU with the VID and PID 10de:1eb8, delete pci-stub.ids=10de:1eb8, and check the resulting file:
    # cat /etc/default/grub | grep CMDLINE GRUB_CMDLINE_LINUX="crashkernel=auto tcache.enabled=0 quiet iommu=pt rd.driver.blacklist=nouveau nouveau.modeset=0"
  2. Regenerate the GRUB configuration file.
    
    On a BIOS-based system, run:
    # /usr/sbin/grub2-mkconfig -o /etc/grub2.cfg --update-bls-cmdline
    
    On a UEFI-based system, run:
    # /usr/sbin/grub2-mkconfig -o /etc/grub2-efi.cfg --update-bls-cmdline
  3. Reboot the node to apply the changes:
    # reboot
Install the vGPU NVIDIA driver:
1. Install the kernel-devel and dkms packages:
```
# dnf install kernel-devel dkms
```
2. Enable and start the dkms service:
```
# systemctl enable dkms.service 
# systemctl start dkms.service
```
3. Install the vGPU KVM kernel module from the NVIDIA GRID package with the --dkms option:
```
# bash NVIDIA-Linux-x86_64-xxx.xx.xx-vgpu-kvm*.run --dkms
```
4. Re-create the Linux boot image by running:
```
# dracut -f
```
Enable IOMMU on the node:
1. Run the pci-helper.py enable-iommu script:
```
# /usr/libexec/vstorage-ui-agent/bin/pci-helper.py enable-iommu
```
  The script works for both Intel and AMD processors.
2. Reboot the node to apply the changes:
```
# reboot
```
3. Check that IOMMU is successfully enabled in the dmesg output:
```
# dmesg | grep -e DMAR -e IOMMU
[    0.000000] DMAR: IOMMU enabled
```
[For modern GPUs with SR-IOV support] Enable the virtual functions for your GPU:
```
# /usr/libexec/vstorage-ui-agent/bin/pci-helper.py nvidia-sriov-mgr --enable
```

To check that vGPU is enabled

[For legacy GPUs without SR-IOV support] Check the /sys/bus/pci/devices/<pci_address>/mdev_supported_types directory. For example, for the GPU with the PCI address 0000:01:00.0, run:
```
ls /sys/bus/pci/devices/0000\:01:00.0/mdev_supported_types
nvidia-222  nvidia-223  nvidia-224  nvidia-225  nvidia-226  nvidia-227  nvidia-228  nvidia-229  nvidia-230  nvidia-231
nvidia-232  nvidia-233  nvidia-234  nvidia-252  nvidia-319  nvidia-320  nvidia-321
```
For a vGPU-enabled card, the directory contains a list of supported vGPU types. A vGPU type is a vGPU configuration that defines the vRAM size, maximum resolution, maximum number of supported vGPUs, and other parameters.
[For modern GPUs with SR-IOV support] Check supported vGPU types and the number of available instances per vGPU type. For example, for the GPU with the PCI address 0000:c1:00.0, run:
```
# cd /sys/bus/pci/devices/0000:c1:00.0/virtfn0/mdev_supported_types
# grep -vR --include=available_instances 0
nvidia-568/available_instances:1
nvidia-558/available_instances:1
nvidia-556/available_instances:1
```
In the command output, the supported types are nvidia-568, nvidia-558, and nvidia-556 and each virtual function can host one instance.