Garbage collection

In erasure-coded storage, updates and truncations of files can generate stale fragments, that is, data and parity pieces that are no longer needed. Since erasure coding splits the incoming data stream into multiple fragments dand distributes them across different failure domains, removing unneeded fragments safely requires a coordinated process. Virtuozzo Hybrid Infrastructure uses a background garbage-collection (GC) mechanism to reclaim storage space without compromising data integrity and redundancy.

The GC process operates in several stages:

  1. When a file is updated or truncated, the corresponding old fragments are marked as stale in the cluster metadata.
  2. A background GC process periodically scans the metadata for stale fragments and schedules them for removal. Deletions are performed in batches to limit the impact on cluster performance.
  3. Stale fragments are removed in complete chunks that contain less than 3/4 of user data. Due to this design, clusters with many partially filled chunks may retain on average about 1/8 of garbage.

    Because of erasure coding constraints, individual fragments cannot always be deleted without affecting valid data. In such cases, GC relocates non-stale data into new chunks.

  4. The last chunk of a file is handled differently. Only empty stripes within this chunk are removed. If the last chunk is heavily fragmented, garbage in it may remain because partially filled stripes cannot be safely deleted.

To check a file for garbage

Use the following command:

vstorage usage-info <path_to_file>

For example:

# vstorage usage-info /mnt/vstorage/file-example
{
 "total-ondisk"              : 3180912,
 "total-usage"               : 3145736,
 "garbage-percent"           : 1.1,
 "retained-holes"            : 0,
 "retained-usage"            : 0
}

In this example, the file contains only 1.1% garbage.

If a file contains a large amount of garbage, you can reduce it by copying the file to a new one and then renaming the copy to replace the original. The new file will be free of garbage.