Object storage metrics and alerts

Metrics used for monitoring object storage are configured in the Prometheus recording rules and can be found in these files on any node in the cluster:

  • /var/lib/prometheus/rules/s3.rules
  • /var/lib/prometheus/rules/ostor.rules

Metrics that are used to generate object storage alerts are added to the alerting rules in /var/lib/prometheus/alerts/s3.rules. These metrics are described in the table:

Metric Description
instance_vol_svc:ostor_s3gw_req:rate5m Number of all requests per second by a particular S3 gateway service for 5 minutes
instance_vol_svc:ostor_s3gw_req_cancelled:rate5m Number of canceled requests per second by a particular S3 gateway service for 5 minutes
instance_vol_svc:ostor_req_server_err:rate5m Number of failed requests with a server error (5XX status code) per second by a particular S3 gateway service for 5 minutes
instance_vol_svc:ostor_s3gw_get_req_latency_ms_bucket:rate5m Current GET request latency by a particular S3 gateway service for 5 minutes, for each bucket
instance_vol_svc:ostor_commit_latency_us_bucket:rate5m Current commit latency by the Object storage service for 5 minutes, for each bucket
instance_vol_svc_req:ostor_os_req_latency_ms_bucket:rate5m Current request latency by a particular OS service for 5 minutes, for each bucket
instance_vol_svc_req:ostor_ns_req_latency_ms_bucket:rate5m Current request latency by a particular NS service for 5 minutes, for each bucket
pcs_process_inactive_seconds_total Total amount of time a process has been inactive
process_cpu_seconds_total Total amount of time a process has used CPU
ostor_svc_start_failed_count_total Total number of failed attempts to start a service
ostor_svc_registry_cfg_failed_total Total number of failed attempts to connect to the configuration service

Based on the metrics above, the following object storage alerts are generated in Prometheus:

Title Message Severity
S3 Gateway service has high GET request latency S3 Gateway service (<service_id>) on <node> has the median GET request latency higher than 1 second warning
S3 Gateway service (<service_id>) on <node> has the median GET request latency higher than 5 seconds. critical
Object service has high request latency Object service (<service_id>) on <node> has the median request latency higher than 1 second. warning
Object service (<service_id>) on <node> has the median request latency higher than 5 seconds. critical
Name service has high request latency Name service (<service_id>) on <node> has the median request latency higher than 1 second. warning
Name service (<service_id>) on <node> has the median request latency higher than 5 seconds. critical
Name service has high commit latency Name service (<service_id>) on <node> has the median commit latency higher than 1 second. Check the storage performance. warning
Name service (<service_id>) on <node> has the median commit latency higher than 10 seconds. Check the storage performance. critical
Object service has high commit latency Object service (<service_id>) on <node> has the median commit latency higher than 1 second. Check the storage performance. warning
Object service (<service_id>) on <node> has the median commit latency higher than 10 seconds. Check the storage performance. critical
S3 Gateway service has high cancel request rate S3 Gateway service (<service_id>) on <node> has the cancel request rate higher than 5%. It may be caused by connectivity issues, requests timeouts, or a small limit for pending requests. warning
S3 Gateway service (<service_id>) on <node> has the cancel request rate higher than 30%. It may be caused by connectivity issues, requests timeouts, or a small limit for pending requests. critical
Object storage agent is frozen for a long time Object storage agent on <node> has the event loop inactive for more than 1 minute. critical
S3 service is frozen for a long time S3 service (<service_name>, <service_id>) on <node> has the event loop inactive for more than 1 minute. critical
S3 Gateway service has high CPU usage S3 Gateway service (<service_id>) on <node> has CPU usage higher than 75%. The service may be overloaded. warning
S3 Gateway service (<service_id>) on <node> has CPU usage higher than 90%. The service may be overloaded. critical
S3 Gateway service has too many failed requests S3 Gateway service (<service_id>) on <node> has a lot of failed requests with a server error (5XX status code). critical
S3 service failed to start Object storage agent failed to start <service_name>(<service_id>) on <node>. critical
FS failed to start Object storage agent failed to start file service on <node>. critical
Object storage agent is offline Object storage agent is offline on <node>. warning
Object storage agent is not connected to configuration service Object storage agent failed to connect to the configuration service on <node>. warning
S3 cluster has unavailable object services Some Object services are not running on <node>. Check the service status in the command-line interface. warning
S3 cluster has unavailable name services Some Name services are not running on <node>. Check the service status in the command-line interface. warning
S3 cluster has unavailable S3 Gateway services Some S3 Gateway services are not running on <node>. Check the service status in the command-line interface. warning
S3 cluster has unavailable Geo-replication services Some Geo-replication services are not running on <node>. Check the service status in the command-line interface. warning
NFS service has unavailable FS services Some File services are not running on <node>. Check the service status in the command-line interface. warning

Bucket and user size metrics

Metrics that report object storage usage per bucket and per user are not available by default. To collect this statistics, you need to enable it by running the following command on an S3 node:

# ostor-ctl set-vol -V 0100000000000002 --enable-stat

The following metrics will appear in Prometheus:

  • account_control_buckets_size: Bucket size, in bytes
  • account_control_user_size: Total size of all user buckets, in bytes