Monitoring & Observability
Grafana

Monitoring & Observability

In this section we will provide a bit of context on our monitoring infrastructure together with information about our alerting setup, through our Grafana (opens in a new tab).

Access

Grafana is available only to whitelisted people. If you need you access, please ask an administrator. If you have an access rights you can add new person in our Azure Portal (opens in a new tab).

If you need to add an external developer, you can Invite them as guests into Bratislava Active Directory (Azure portal -> Active Directory -> Users -> New/Invite -> Guest/External).

A Bit of Context

For our monitoring and observability we use Grafana (opens in a new tab) with Prometheus (opens in a new tab), Loki (opens in a new tab) and Infinity (opens in a new tab) plugin/application stack.

You can read more in their linked documentations, but to describe the setup in short:

  • Grafana is only a visualization tool with alerting capabilities. You can add additional plugins and application to it to extend it's functionality
  • Prometheus is a monitoring tool that sits on top of our kubernetes infrastructure and provides various metrics (exposed through /metric endpoint) about nodes health, resources, and application (pod) state, resources, etc.
  • Loki is a Grafana application that specializes in logs monitoring and alerting. We use promtail to push application logs into Loki.
  • Infinity is a very simple grafana plugin, that provides HTTP requests capabilities, it can be used to monitor health endpoint and is capable of parsing JSON responses and alerting on them

All of these application can be use to monitor your application and alert in case of any issues. Currently, we use this to monitor and observe our kubernetes infrastructure, together with some critical applications. For example, we are monitoring hardware resources for all our nodes and pods and when they reach critical usage >=95%, we alert on it.

Dashboards

If you have access to Grafana you can take a look at our dashboards, that provide more information about the state of our infrastructure and individual applications together with their logs.

Pod Dashboard

Pod Dashboard (opens in a new tab) is a application dashboard where you can see logs of applications together with their volume and system statistics.

Dashboard is driven by filters that can go up to container granularity.

  • You can search through the logs with regex pattern. Log volume is color coded base on the stream it was emitted to (stdout/stderr).
  • It also provides current running status of all associated containers
  • Application system statics, such as, memory, CPU, network and disk (PVC (opens in a new tab)) usage
  • At the bottom of the dashboard we have alerting panel where you can see all the alerting rules associated with your filter selection together with their state (opens in a new tab)

Persistent Volumes Dashboard

Persistent Volumes Dashboard (opens in a new tab) is monitoring on kubernetes PV/PVC (opens in a new tab) disk usage.

Dashboard is driven by filters that can go up to application granularity.

  • It has current volume usage with "standard" gradient color coding from green to red, when the usage start hitting ~60%
  • It also provides a simple table showcasing the full volume capacity
  • And lastly, it shows historical disk usage in % of full capacity

Health Status

Health Status Dashboard (opens in a new tab) is complex monitoring dashboard, where you can find everything from monitoring single application state with their logs all the way up to kubernetes node resource utilization.

Dashboard is driven by filters that can go up to individual POD granularity.

  • It provides statistics on health status of all the application running within the cluster and their listing
  • You can also find there POD's system resource (CPU, Memory, ...) current and historical usage and running state
  • Containers's system resource (CPU, Memory, ...) current and historical usage and running state
  • It also has information about application replicas
  • And lastly, also has resource (CPU, Memory, Disk) and health information on kubernetes cluster nodes

Alerting

For actual alerting we have setup:

  • A Grafana Bratislava Slack application/bot, that you can add to your channel
  • Email address grafana[at]devops.bratislava.sk, that you can use to send a alert notification to you mailbox

To setup a new contact point, for instance if you and only you want get some specific alert, please follow our "Add New Contact Point" recipe.

For recipes on how to create your own alert take a look at the following

  • Alerting on application system resources (CPU, Memory, Disk, etc.)
  • Alerting on application's logs and specific keywords or pattern in those logs
  • Alerting on availability of specific endpoints or data provided by those endpoints