8.9. Troubleshooting Shaman Resources

The high availability feature is managed by the shaman-monitor service running on each cluster node. Each node’s service has its own resources repository in the <cluster_mount>/.shaman/md.<host_ID>/resources directory.

Once initiated, shaman-monitor enumerates and maintains a list of all resources, that is, virtual machines and containers, on the node in its repository directory. Each resource is presented by a file in node’s shaman repository that contains a path to that resource.

One of the running shaman-monitor services in the cluster is elected the master shaman service (and the node it runs on becomes a master shaman node). If a node fails, the shaman master starts processing failed node’s shaman repository. Resources that were running on the failed node are relocated to, registered, and started on healthy nodes by their respective shaman services. Resources that were not running or could not be relocated according to the policy (see Configuring Resource Relocation Modes) are registered on the master shaman node and remain stopped.

8.9.1. Possible Issues

In certain situations that involve moving and re-registering VEs (like failed migrations or backups), shaman’s list of resources may become outdated and some of the resources may not actually be where shaman expects them to be. This may cause various issues in case of HA events:

  • If during a previous shaman action a command for a resource returned an error, the resource will be marked as broken and moved to <cluster_mount>/.shaman/broken. Broken resources are ignored during events.

  • A resource may be present in shaman’s repository on a node but not actually registered in it. This VE may actually be running on a different node. If an HA failover occurs on the node that has unregistered VEs, the master will try to re-register the resources on other nodes while they are already running somewhere else. VEs duplicated in such a way will not start and the hypervisor will consider them invalid.

  • A virtual environment may happen to be registered and running on one node but the corresponding resource may exist in a shaman’s repository on a different node. If an HA failover occurs on the node, such VEs will not be relocated, because shaman is unaware of them. In addition, if you attempt to migrate such a VE, the source node will try to pass the corresponding shaman resource to the destination node. This will result in error since the resource is not present on the source node.

  • Two shaman resources in different shaman repositories may point to the same virtual environment. As a result, VE may become duplicated and hypervisor will consider one of the duplicates invalid.

  • A virtual environment’s shaman resource may be missing from repositories. This is the same as disabling HA for that VE, except for migration that will result in an error after the resource is not found.

8.9.2. Shaman Resource Consistency Best Practices

Resources may become broken if the underlying storage has had no write access during an HA event. To avoid this, starting from Virtuozzo Hybrid Server 7 Update 10, when a VE restarts, shaman automatically registers it on the node where it is actually running. This also fixes any issues with VE location being different from the location of its corresponding shaman resource.

In general, you can locate any broken resources in shaman repositories by taking note of VE records marked “B” in the output of shaman stat. If such VEs are shown in the output of prlctl list on nodes where they reside, you can restart them to fix their shaman resources. Otherwise you can try to re-register all broken resources on a node by running

# shaman [-c <cluster_name>] cleanup-broken

In particular, this command cleans the shaman repository of resources with incorrect paths.

Besides that, you can verify if local resources point to local VEs by running

# shaman [-c <cluster_name>] sync

Note

Since the shaman sync command only fixes shaman resources, you need to manually clean up any unneeded (stopped or invalid) VE duplicates before launching shaman sync in your cluster.

This command updates shaman resources on the local node according to the following rules:

  • If a VE is present on a node but its shaman resource is missing or located on another node, the resource is re-registered on the local node.

  • If a VE is not present on a node but its resource is, this shaman resource is deleted.

  • If both the VE and its shaman resource are present on a node but their parameters (e.g., resource relocation priority) differ, VE parameters overwrite resource parameters so that both are in sync.

You may want to run shaman sync on all nodes in the cluster to fix all of the resources.