System Metrics and Alarms

For managed environments (cloud, single-tenant and managed on-premise), Grafana, Prometheus, and several exporters are deployed to the environment to aggregate system metrics, render them in custom dashboards and send alerts when deviations or dangerous values are detected.

Some of the systems that are monitored include:

  • Load balancer

  • Redis

  • Postgres

  • Postgres worker queue

  • Redis worker queue

  • Workflow executions

  • Connect Proxy Requests

  • Microservices

Load balancer

Metrics and alarms include:

  • Request count / response time

  • HTTP response codes

  • Connection count

  • Bytes processed

  • TLS errors

Redis

Metrics and alarms include:

  • Uptime

  • Clients

  • Memory usage

  • Commands executed/second

  • Hits/misses per second

  • Total items per database

  • Network I/O

  • Expiring vs non-expiring keys

  • Expired/evicted

  • Command calls/second

Postgres

Metrics and alarms include:

  • CPU usage

  • Memory usage

  • Transactions

  • Locks

  • Conflicts/deadlocks

  • Cache hit rate

Postgres Worker Queue

Metrics and alarms include:

  • Workers

  • Throughput

  • Average wait

  • Job statuses

  • Job duration

  • Error rate

  • Average wait per queue

  • Workers per queue

Redis Worker Queue

Metrics and alarms include:

  • Queue length

  • Queue states

  • Failures by queue

  • Job duration

Workflow Executions

Metrics and alarms include:

  • Workflow executions

  • Step executions

  • Workflow completion rate

Connect Proxy Requests

Metrics and alarms include:

  • Total request count

  • Latency

  • Open requests

  • Status code

  • Status code by integration

  • Status code by credential

Microservices

Metrics and alarms include:

  • Requests

  • Apdex score

  • Error rate

  • Event loop lag

  • CPU

  • Heap

  • Request duration

Last updated