Object storage metrics and alerts

Metrics used for monitoring object storage are configured in the Prometheus recording rules and can be found in these files on any node in the cluster:

/var/lib/prometheus/rules/s3.rules
/var/lib/prometheus/rules/ostor.rules

Metrics that are used to generate object storage alerts are added to the alerting rules in /var/lib/prometheus/alerts/s3.rules. These metrics are described in the table:

Metric	Description
`instance_vol_svc:ostor_s3gw_req:rate5m`	Number of all requests per second by a particular S3 gateway service for 5 minutes
`instance_vol_svc:ostor_s3gw_req_cancelled:rate5m`	Number of canceled requests per second by a particular S3 gateway service for 5 minutes
`instance_vol_svc:ostor_req_server_err:rate5m`	Number of failed requests with a server error (5XX status code) per second by a particular S3 gateway service for 5 minutes
`instance_vol_svc:ostor_s3gw_get_req_latency_ms_bucket:rate5m`	Current GET request latency by a particular S3 gateway service for 5 minutes, for each bucket
`instance_vol_svc:ostor_commit_latency_us_bucket:rate5m`	Current commit latency by the Object storage service for 5 minutes, for each bucket
`instance_vol_svc_req:ostor_os_req_latency_ms_bucket:rate5m`	Current request latency by a particular OS service for 5 minutes, for each bucket
`instance_vol_svc_req:ostor_ns_req_latency_ms_bucket:rate5m`	Current request latency by a particular NS service for 5 minutes, for each bucket
`pcs_process_inactive_seconds_total`	Total amount of time a process has been inactive
`process_cpu_seconds_total`	Total amount of time a process has used CPU
`ostor_svc_start_failed_count_total`	Total number of failed attempts to start a service
`ostor_svc_registry_cfg_failed_total`	Total number of failed attempts to connect to the configuration service

Based on the metrics above, the following object storage alerts are generated in Prometheus:

Title	Message	Severity
S3 Gateway service has high GET request latency	S3 Gateway service (<service_id>) on <node> has the median GET request latency higher than 1 second	warning
S3 Gateway service has high GET request latency	S3 Gateway service (<service_id>) on <node> has the median GET request latency higher than 5 seconds.	critical
Object service has high request latency	Object service (<service_id>) on <node> has the median request latency higher than 1 second.	warning
Object service has high request latency	Object service (<service_id>) on <node> has the median request latency higher than 5 seconds.	critical
Name service has high request latency	Name service (<service_id>) on <node> has the median request latency higher than 1 second.	warning
Name service has high request latency	Name service (<service_id>) on <node> has the median request latency higher than 5 seconds.	critical
Name service has high commit latency	Name service (<service_id>) on <node> has the median commit latency higher than 1 second. Check the storage performance.	warning
Name service has high commit latency	Name service (<service_id>) on <node> has the median commit latency higher than 10 seconds. Check the storage performance.	critical
Object service has high commit latency	Object service (<service_id>) on <node> has the median commit latency higher than 1 second. Check the storage performance.	warning
Object service has high commit latency	Object service (<service_id>) on <node> has the median commit latency higher than 10 seconds. Check the storage performance.	critical
S3 Gateway service has high cancel request rate	S3 Gateway service (<service_id>) on <node> has the cancel request rate higher than 5%. It may be caused by connectivity issues, requests timeouts, or a small limit for pending requests.	warning
S3 Gateway service has high cancel request rate	S3 Gateway service (<service_id>) on <node> has the cancel request rate higher than 30%. It may be caused by connectivity issues, requests timeouts, or a small limit for pending requests.	critical
Object storage agent is frozen for a long time	Object storage agent on <node> has the event loop inactive for more than 1 minute.	critical
S3 service is frozen for a long time	S3 service (<service_name>, <service_id>) on <node> has the event loop inactive for more than 1 minute.	critical
S3 Gateway service has high CPU usage	S3 Gateway service (<service_id>) on <node> has CPU usage higher than 75%. The service may be overloaded.	warning
S3 Gateway service has high CPU usage	S3 Gateway service (<service_id>) on <node> has CPU usage higher than 90%. The service may be overloaded.	critical
S3 Gateway service has too many failed requests	S3 Gateway service (<service_id>) on <node> has a lot of failed requests with a server error (5XX status code).	critical
S3 service failed to start	Object storage agent failed to start <service_name>(<service_id>) on <node>.	critical
FS failed to start	Object storage agent failed to start file service on <node>.	critical
Object storage agent is offline	Object storage agent is offline on <node>.	warning
Object storage agent is not connected to configuration service	Object storage agent failed to connect to the configuration service on <node>.	warning
S3 cluster has unavailable object services	Some Object services are not running on <node>. Check the service status in the command-line interface.	warning
S3 cluster has unavailable name services	Some Name services are not running on <node>. Check the service status in the command-line interface.	warning
S3 cluster has unavailable S3 Gateway services	Some S3 Gateway services are not running on <node>. Check the service status in the command-line interface.	warning
S3 cluster has unavailable Geo-replication services	Some Geo-replication services are not running on <node>. Check the service status in the command-line interface.	warning
NFS service has unavailable FS services	Some File services are not running on <node>. Check the service status in the command-line interface.	warning

Bucket and user size metrics

Metrics that report object storage usage per bucket and per user are not available by default. To collect this statistics, you need to enable it by running the following command on an S3 node:

# ostor-ctl set-vol -V 0100000000000002 --enable-stat

The following metrics will appear in Prometheus:

account_control_buckets_size: Bucket size, in bytes
account_control_user_size: Total size of all user buckets, in bytes