Enable Observability for FastAPI Service

Debugging and tracing issues in the distributed system is a nightmare if you can't observe service from the inside. Enabling Observability for service is the solution for this situation. You will have a better understanding of how your service is doing.

This post gives a brief introduction to Observability and demonstrates how to enable Observability in a FastAPI application. The observation targets are logs, metrics, and traces in our sample project. We use OpenTelemetry, Prometheus, and a set of Grafana tools to collect and present data. The sample project is available on our GitHub repository fastapi-observability.

What is Observability

CNCF defines Observability as follows:

Observability is a characteristic of an application that refers to how well a system’s state or status can be understood from its external outputs. Computer systems are measured by observing CPU time, memory, disk space, latency, errors, etc. The more observable a system is, the easier it is to understand how it’s doing by looking at it.

Source: https://glossary.cncf.io/observability/

However, we are focusing on the service's state instead of the machine's state. So our observation targets are:

Traces: recoding a request between services with a span in the distributed system
Metrics: service's time series data of indicators, e.g. latency, request rate, and process duration
Logs: things happened in service, e.g. error messages, exceptions, and request logs

They have called three pillars of Observability since the founder of LightStep, Ben Sigelman, gave the Three Pillars of Observability talk on KubeCon NA 2018.

OpenTelemetry

In order to unify the standardization of Observability, CNCF proposed a vendor-neutral open-source Observability framework OpenTelemetry. OpenTelemetry also provides a collection of tools, APIs, and SDKs for instrumenting, generating, collecting, and exporting telemetry data such as traces, metrics, and logs. However, we only use the OpenTelemetry Python SDK in our sample project.

Connecting All

Observing logs, metrics, and traces separately is not good enough. We need a mechanism to connect this information and view it on a unified tool. Therefore Grafana provides a great solution, which could observe specific action in service between traces, metrics, and logs through trace ID from traces and exemplar from OpenMetrics.

Observability Correlations

Image Source: Grafana

Sample Project

For a better understanding of Observability, we created a sample project with FastAPI, a python API framework, and a set of Grafana tools. All service is defined in the docker compose file. The architecture is as follows:

Traces with Tempo and OpenTelemetry Python SDK
Metrics with Prometheus and Prometheus Python Client
Logs with Loki

Telemetry Architecture

Quick Start

Cloud sample project repository

 git clone https://github.com/Blueswen/fastapi-observability.git
 cd fastapi-observability

Install Loki Docker Driver

 docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions

Build application image and start all service with docker-compose
```
 docker-compose build
 docker-compose up -d
```
Send requests with siege to FastAPI app
```
 bash request-script.sh
 bash trace.sh
```
Check predefined dashboard FastAPI Observability on Grafana http://localhost:3000/

Dashboard screenshot:

The dashboard is also available on Grafana Dashboards.

Explore with Grafana

Metrics to Traces

Get Trace ID from an exemplar in metrics, then query in Tempo.

Query: histogram_quantile(.99,sum(rate(fastapi_requests_duration_seconds_bucket{app_name="app-a", path!="/metrics"}[1m])) by(path, le))

Metrics to Traces

Traces to Logs

Get Trace ID and tags (here is compose.service) defined in Tempo data source from span, then query with Loki.

Traces to Logs

Logs to Traces

Get Trace ID pared from log (regex defined in Loki data source), then query in Tempo.

Logs to Traces

FastAPI Application

For a more complex scenario, we use three FastAPI applications with the same code in this project. There is a cross-service action in /chain endpoint, which provides a good example of how to use OpenTelemetry SDK and how Grafana presents trace information.

Traces and Logs

We use OpenTelemetry Python SDK to send trace info with gRCP to Tempo. Each request span contains other child spans when using OpenTelemetry instrumentation. The reason is that instrumentation will catch each internal asgi interaction (opentelemetry-python-contrib issue #831). If you want to get rid of the internal spans, there is a workaround in the same issue #831 by using a new OpenTelemetry middleware with two overridden methods of span processing.

We use OpenTelemetry Logging Instrumentation to override the logger format with another format with trace id and span id.

# fastapi_app/utils.py

def setting_otlp(app: ASGIApp, app_name: str, endpoint: str, log_correlation: bool = True) -> None:
    # Setting OpenTelemetry
    # set the service name to show in traces
    resource = Resource.create(attributes={
        "service.name": app_name, # for Tempo to distinguish source
        "compose_service": app_name # as a query criteria for Trace to logs
    })

    # set the tracer provider
    tracer = TracerProvider(resource=resource)
    trace.set_tracer_provider(tracer)

    tracer.add_span_processor(BatchSpanProcessor(
        OTLPSpanExporter(endpoint=endpoint)))

    if log_correlation:
        LoggingInstrumentor().instrument(set_logging_format=True)

    FastAPIInstrumentor.instrument_app(app, tracer_provider=tracer)

The following image shows the span info sent to Tempo and queried on Grafana. Trace span info provided by FastAPIInstrumentor with trace ID (17785b4c3d530b832fb28ede767c672c), span id(d410eb45cc61f442), service name(app-a), custom attributes(service.name=app-a, compose_service=app-a) and so on.

Span Information

Log format with trace id and span id, which are overridden by LoggingInstrumentor

%(asctime)s %(levelname)s [%(name)s] [%(filename)s:%(lineno)d] [trace_id=%(otelTraceID)s span_id=%(otelSpanID)s resource.service.name=%(otelServiceName)s] - %(message)s

The following image is what the logs look like.

Log With Trace ID And Span ID

Span Inject

If you want other services to use the same Trace ID, you have to use inject function to add current span information to the header. Because OpenTelemetry FastAPI instrumentation only takes care of the asgi app's request and response, it does not affect any other modules or actions like sending HTTP requests to other servers or function calls.

# fastapi_app/main.py

from opentelemetry.propagate import inject

@app.get("/chain")
async def chain(response: Response):

    headers = {}
    inject(headers)  # inject trace info to header

    async with httpx.AsyncClient() as client:
        await client.get(f"http://localhost:8000/", headers=headers,)
    async with httpx.AsyncClient() as client:
        await client.get(f"http://{TARGET_ONE_HOST}:8000/io_task", headers=headers,)
    async with httpx.AsyncClient() as client:
        await client.get(f"http://{TARGET_TWO_HOST}:8000/cpu_task", headers=headers,)

    return {"path": "/chain"}

Metrics

Use Prometheus Python Client to generate OpenTelemetry format metric with exemplars and expose them on /metrics for Prometheus.

In order to add exemplar to metrics, we retrieve the trace id from the current span for exemplar and add the trace id dict to Histogram or Counter metrics.

# fastapi_app/utils.py

from opentelemetry import trace
from prometheus_client import Histogram

REQUESTS_PROCESSING_TIME = Histogram(
    "fastapi_requests_duration_seconds",
    "Histogram of requests processing time by path (in seconds)",
    ["method", "path", "app_name"],
)

# retrieve trace id for exemplar
span = trace.get_current_span()
trace_id = trace.format_trace_id(
      span.get_span_context().trace_id)

REQUESTS_PROCESSING_TIME.labels(method=method, path=path, app_name=self.app_name).observe(
      after_time - before_time, exemplar={'TraceID': trace_id}
)

Because exemplars is a new datatype proposed in OpenMetrics, /metrics have to use CONTENT_TYPE_LATEST and generate_latest from prometheus_client.openmetrics.exposition module instead of prometheus_client module. Otherwise using the wrong generate_latest the exemplars dict behind Counter and Histogram will never show up, and using the wrong CONTENT_TYPE_LATEST will cause Prometheus scrape to fail.

# fastapi_app/utils.py

from prometheus_client import REGISTRY
from prometheus_client.openmetrics.exposition import CONTENT_TYPE_LATEST, generate_latest

def metrics(request: Request) -> Response:
    return Response(generate_latest(REGISTRY), headers={"Content-Type": CONTENT_TYPE_LATEST})

Metrics with exemplars

Metrics With Exemplars

Prometheus - Metrics

Collects metrics from applications.

Prometheus Config

Define all FastAPI applications metrics scrape jobs in etc/prometheus/prometheus.yml

...
scrape_configs:
  - job_name: 'app-a'
    scrape_interval: 5s
    static_configs:
      - targets: ['app-a:8000']
  - job_name: 'app-b'
    scrape_interval: 5s
    static_configs:
      - targets: ['app-b:8000']
  - job_name: 'app-c'
    scrape_interval: 5s
    static_configs:
      - targets: ['app-c:8000']

Grafana Data Source

Add an Exemplars which uses value of TraceID label to create a Tempo link.

Grafana data source setting example:

Data Source of Prometheus: Exemplars

Grafana data sources config example:

name: Prometheus
type: prometheus
typeName: Prometheus
access: proxy
url: http://prometheus:9090
password: ''
user: ''
database: ''
basicAuth: false
isDefault: true
jsonData:
exemplarTraceIdDestinations:
   - datasourceUid: tempo
      name: TraceID
httpMethod: POST
readOnly: false
editable: true

Tempo - Traces

Receives spans from applications.

Grafana Data Source

Trace to logs setting:

Data source: target log source
Tags: key of tags or process level attributes from the trace, which will be log query criteria if the key exists in the trace
Map tag names: Convert existing key of tags or process level attributes from trace to another key, then used as log query criteria. Use this feature when the values of the trace tag and log label are identical but the keys are different.

Grafana data source setting example:

Data Source of Tempo: Trace to logs

Grafana data sources config example:

name: Tempo
type: tempo
typeName: Tempo
access: proxy
url: http://tempo
password: ''
user: ''
database: ''
basicAuth: false
isDefault: false
jsonData:
nodeGraph:
   enabled: true
tracesToLogs:
   datasourceUid: loki
   filterBySpanID: false
   filterByTraceID: true
   mapTagNamesEnabled: false
   tags:
      - compose_service
readOnly: false
editable: true

Loki - Logs

Collect logs with Loki Docker Driver from all services.

Loki Docker Driver

Use YAML anchor and alias feature to set logging options for each service.
Set Loki Docker Driver options
1. loki-url: Loki service endpoint
2. loki-pipeline-stages: processes multiline log from FastAPI application with multiline and regex stages (reference)

x-logging: &default-logging # anchor(&): 'default-logging' for defines a chunk of configuration
  driver: loki
  options:
    loki-url: 'http://localhost:3100/api/prom/push'
    loki-pipeline-stages: |
      - multiline:
          firstline: '^\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2}'
          max_wait_time: 3s
      - regex:
          expression: '^(?P<time>\d{4}-\d{2}-\d{2} \d{1,2}:\d{2}:\d{2},d{3}) (?P<message>(?s:.*))$$'
# Use $$ (double-dollar sign) when your configuration needs a literal dollar sign.

version: "3.4"

services:
   foo:
      image: foo
      logging: *default-logging # alias(*): refer to 'default-logging' chunk

Grafana Data Source

Add a TraceID derived field to extract the trace id and create a Tempo link from the trace id.

Grafana data source setting example:

Data Source of Loki: Derived fields

Grafana data source config example:

name: Loki
type: loki
typeName: Loki
access: proxy
url: http://loki:3100
password: ''
user: ''
database: ''
basicAuth: false
isDefault: false
jsonData:
derivedFields:
   - datasourceUid: tempo
      matcherRegex: (?:trace_id)=(\w+)
      name: TraceID
      url: $${__value.raw}
      # Use $$ (double-dollar sign) when your configuration needs a literal dollar sign.
readOnly: false
editable: true

Grafana

Add Prometheus, Tempo, and Loki to the data source with the config file etc/grafana/datasource.yml.
Load a predefined dashboard with etc/dashboards.yaml and etc/dashboards/fastapi-observability.json.

# grafana in docker-compose.yaml
grafana:
   image: grafana/grafana:8.4.3
   volumes:
      - ./etc/grafana/:/etc/grafana/provisioning/datasources # data sources
      - ./etc/dashboards.yaml:/etc/grafana/provisioning/dashboards/dashboards.yaml # dashboard setting
      - ./etc/dashboards:/etc/grafana/dashboards # dashboard json files directory

Conclusion

In this post, we introduce Observability and show how to enable Observability for a service. We only focus on logs, metrics, and traces with Prometheus and Grafana tools. If you don't prefer Grafana, there are a lot of alternative open-source solutions like OpenTelemetry, Jaeger with Service Performance Monitoring (SPM), or Elastic Observability.

Besides tools, there is much room for development in Observability itself. The concept of three pillars of Observability was proposed in 2018, almost 4 years ago. After years of community discussions, Technical Advisory Group of CNCF gives more details about Observability in their whitepaper. They use Observability Signals to describe logs, metrics, and traces instead of pillars and also add two more signals, profiles, and dumps. Maybe in the near future, we can see many more tools for profiles and dumps having better compatibility with the current system.

Enable Observability for FastAPI Service with OpenTelemetry, Prometheus, and Grafana

What is Observability

OpenTelemetry

Connecting All

Sample Project

Quick Start

Explore with Grafana

Metrics to Traces

Traces to Logs

Logs to Traces

FastAPI Application

Traces and Logs

Span Inject

Metrics

Prometheus - Metrics

Prometheus Config

Grafana Data Source

Tempo - Traces

Grafana Data Source

Loki - Logs

Loki Docker Driver

Grafana Data Source

Grafana

Conclusion

Reference

Comments (1)

Command Palette

What is Observability

OpenTelemetry

Connecting All

Sample Project

Quick Start

Explore with Grafana

Metrics to Traces

Traces to Logs

Logs to Traces

FastAPI Application

Traces and Logs

Span Inject

Metrics

Prometheus - Metrics

Prometheus Config

Grafana Data Source

Tempo - Traces

Grafana Data Source

Loki - Logs

Loki Docker Driver

Grafana Data Source

Grafana

Conclusion

Reference

Comments (1)