Enabling PCI passthrough and vGPU support

To enable PCI passthrough and vGPU support for the compute cluster, you need to create a configuration file in the YAML format, and then use it to reconfigure the compute cluster.

Prerequisites

The compute nodes are prepared for GPU passthrough, vGPU support, or SR-IOV, as described in Preparing nodes for GPU passthrough, Preparing nodes for GPU virtualization, and Preparing nodes for SR-IOV.

To create the PCI passthrough and vGPU configuration file

Specify the identifier of a compute node that hosts PCI devices, and then add host devices that you want to pass through or virtualize:

To create virtual functions for a network adapter, add these lines:
```
- device_type: sriov
  device: enp2s0
  physical_network: sriovnet
  num_vfs: 8
```
where:
- sriov is the device type for a network adapter
- enp2s0 is the device name of a network adapter
- sriovnet is an arbitrary name that will be used as an alias for a network adapter
- num_vfs is the number of virtual functions to create for a network adapter
The maximum number of virtual functions supported by a PCI device is specified in the /sys/class/net/<device_name>/device/sriov_totalvfs file. For example:
```
# cat /sys/class/net/enp2s0/device/sriov_totalvfs
63
```
To enable GPU passthrough, add these lines:
```
- device_type: generic
  device: 1b36:0100
  alias: gpu
```
where:
- generic is the device type for a physical GPU that will be passed through
- 1b36:0100 is the VID and PID of a physical GPU
- gpu is an arbitrary name that will be used as an alias for a physical GPU
To enable vGPU, add these lines:
```
- device_type: pgpu
  device: "0000:03:00.0"
  vgpu_type: nvidia-224
```
where:
- pgpu is the device type for a physical GPU that will be virtualized
- "0000:03:00.0" is the PCI address of a physical GPU
- nvidia-224 is the vGPU type that will be enabled for a physical GPU

The entire configuration file may look as follows:

# cat config.yaml
- node_id: c3b2321a-7c12-8456-42ce-8005ff937e12
  devices:
    - device_type: sriov
      device: enp2s0
      physical_network: sriovnet
      num_vfs: 8
    - device_type: generic
      device: 1b36:0100
      alias: gpu
    - device_type: pgpu
      device: "0000:01:00.0"
      vgpu_type: nvidia-232
- node_id: 1d6481c2-1fd5-406b-a0c7-330f24bd0e3d
  devices:
    - device_type: generic
      device: 10de:1eb8
      alias: gpu
    - device_type: pgpu
      device: "0000:03:00.0"
      vgpu_type: nvidia-224
    - device_type: pgpu
      device: "0000:81:00.0"
      vgpu_type: nvidia-228

To configure the compute cluster for PCI passthrough and vGPU support

Pass the configuration file to the vinfra service compute set command. For example:

# vinfra service compute set --pci-passthrough-config config.yaml

If the compute configuration fails

Check whether the following error appears in /var/log/vstorage-ui-backend/ansible.log:

2021-09-23 16:42:59,796 p=32130 u=vstoradmin | fatal: [32c8461b-92ec-48c3-ae02-
4d12194acd02]: FAILED! => {"changed": true, "cmd": "echo 4 > /sys/class/net/
enp103s0f1/device/sriov_numvfs", "delta": "0:00:00.127417", "end": "2021-09-23 
19:42:59.784281", "msg": "non-zero return code", "rc": 1, "start": "2021-09-23 
19:42:59.656864", "stderr": "/bin/sh: line 0: echo: write error: Cannot allocate 
memory", "stderr_lines": ["/bin/sh: line 0: echo: write error: Cannot allocate memory"], 
"stdout": "", "stdout_lines": []}

In this case, run the the pci-helper.py script, and reboot the node:

# /usr/libexec/vstorage-ui-agent/bin/pci-helper.py enable-iommu --pci-realloc
# reboot

When the node is up again, repeat the vinfra service compute set command.