Object storage metrics and alerts
Metrics used for monitoring object storage are configured in the Prometheus recording rules and can be found in these files on any node in the cluster:
- /var/lib/prometheus/rules/s3.rules
- /var/lib/prometheus/rules/ostor.rules
Metrics that are used to generate object storage alerts are added to the alerting rules in /var/lib/prometheus/alerts/s3.rules. These metrics are described in the table:
Metric | Description |
---|---|
instance_vol_svc:ostor_s3gw_req:rate5m
|
Number of all requests per second by a particular S3 gateway service for 5 minutes |
instance_vol_svc:ostor_s3gw_req_cancelled:rate5m
|
Number of canceled requests per second by a particular S3 gateway service for 5 minutes |
instance_vol_svc:ostor_req_server_err:rate5m
|
Number of failed requests with a server error (5XX status code) per second by a particular S3 gateway service for 5 minutes |
instance_vol_svc:ostor_s3gw_get_req_latency_ms_bucket:rate5m
|
Current GET request latency by a particular S3 gateway service for 5 minutes, for each bucket |
instance_vol_svc:ostor_commit_latency_us_bucket:rate5m
|
Current commit latency by the Object storage service for 5 minutes, for each bucket |
instance_vol_svc_req:ostor_os_req_latency_ms_bucket:rate5m
|
Current request latency by a particular OS service for 5 minutes, for each bucket |
instance_vol_svc_req:ostor_ns_req_latency_ms_bucket:rate5m
|
Current request latency by a particular NS service for 5 minutes, for each bucket |
pcs_process_inactive_seconds_total
|
Total amount of time a process has been inactive |
process_cpu_seconds_total
|
Total amount of time a process has used CPU |
ostor_svc_start_failed_count_total
|
Total number of failed attempts to start a service |
ostor_svc_registry_cfg_failed_total
|
Total number of failed attempts to connect to the configuration service |
Based on the metrics above, the following object storage alerts are generated in Prometheus:
Title | Message | Severity |
---|---|---|
S3 Gateway service has high GET request latency | S3 Gateway service (<service_id>) on <node> has the median GET request latency higher than 1 second | warning |
S3 Gateway service (<service_id>) on <node> has the median GET request latency higher than 5 seconds. | critical | |
Object service has high request latency | Object service (<service_id>) on <node> has the median request latency higher than 1 second. | warning |
Object service (<service_id>) on <node> has the median request latency higher than 5 seconds. | critical | |
Name service has high request latency | Name service (<service_id>) on <node> has the median request latency higher than 1 second. | warning |
Name service (<service_id>) on <node> has the median request latency higher than 5 seconds. | critical | |
Name service has high commit latency | Name service (<service_id>) on <node> has the median commit latency higher than 1 second. Check the storage performance. | warning |
Name service (<service_id>) on <node> has the median commit latency higher than 10 seconds. Check the storage performance. | critical | |
Object service has high commit latency | Object service (<service_id>) on <node> has the median commit latency higher than 1 second. Check the storage performance. | warning |
Object service (<service_id>) on <node> has the median commit latency higher than 10 seconds. Check the storage performance. | critical | |
S3 Gateway service has high cancel request rate | S3 Gateway service (<service_id>) on <node> has the cancel request rate higher than 5%. It may be caused by connectivity issues, requests timeouts, or a small limit for pending requests. | warning |
S3 Gateway service (<service_id>) on <node> has the cancel request rate higher than 30%. It may be caused by connectivity issues, requests timeouts, or a small limit for pending requests. | critical | |
Object storage agent is frozen for a long time | Object storage agent on <node> has the event loop inactive for more than 1 minute. | critical |
S3 service is frozen for a long time | S3 service (<service_name>, <service_id>) on <node> has the event loop inactive for more than 1 minute. | critical |
S3 Gateway service has high CPU usage | S3 Gateway service (<service_id>) on <node> has CPU usage higher than 75%. The service may be overloaded. | warning |
S3 Gateway service (<service_id>) on <node> has CPU usage higher than 90%. The service may be overloaded. | critical | |
S3 Gateway service has too many failed requests | S3 Gateway service (<service_id>) on <node> has a lot of failed requests with a server error (5XX status code). | critical |
S3 service failed to start | Object storage agent failed to start <service_name>(<service_id>) on <node>. | critical |
FS failed to start | Object storage agent failed to start file service on <node>. | critical |
Object storage agent is offline | Object storage agent is offline on <node>. | warning |
Object storage agent is not connected to configuration service | Object storage agent failed to connect to the configuration service on <node>. | warning |
S3 cluster has unavailable object services | Some Object services are not running on <node>. Check the service status in the command-line interface. | warning |
S3 cluster has unavailable name services | Some Name services are not running on <node>. Check the service status in the command-line interface. | warning |
S3 cluster has unavailable S3 Gateway services | Some S3 Gateway services are not running on <node>. Check the service status in the command-line interface. | warning |
S3 cluster has unavailable Geo-replication services | Some Geo-replication services are not running on <node>. Check the service status in the command-line interface. | warning |
NFS service has unavailable FS services | Some File services are not running on <node>. Check the service status in the command-line interface. | warning |
Bucket and user size metrics
Metrics that report object storage usage per bucket and per user are not available by default. To collect this statistics, you need to enable it by running the following command on an S3 node:
# ostor-ctl set-vol -V 0100000000000002 --enable-stat
The following metrics will appear in Prometheus:
account_control_buckets_size
: Bucket size, in bytesaccount_control_user_size
: Total size of all user buckets, in bytes