← Back to blog

Airflow on Kubernetes, Part 4: XCom on Object Storage, Secrets, and Slack Alerts

Moving XCom payloads to object storage instead of bloating the metadata DB, managing secrets and connections the isolated-worker way, and wiring up Slack notifications for failing DAGs.

#airflow#kubernetes#data-engineering

The last part is about the data that flows around your tasks: the XCom values they pass to each other, the secrets and connections they need, and the notification that tells you when one of them dies at 3am. Each has a Kubernetes-specific wrinkle, and the first one — XCom — is a database problem in disguise.

XCom: get the payloads out of the database

XCom (“cross-communication”) is how tasks pass small values to each other. By default those values are serialized into the metadata database via the BaseXCom backend. That’s fine for a return code or an S3 path; it’s a problem the moment people start passing anything chunky. The docs warn directly: XComs “are only designed for small amounts of data; do not use them to pass around large values, like dataframes.” (XComs)

Every oversized XCom is a row in the DB you covered in Part 2 — bloating it, slowing the queries the scheduler runs constantly, and dragging out backups. The fix is the Object Storage XCom backend, which keeps XCom off the database and in S3/GCS/Azure Blob, leaving only a reference behind.

Configuring it

It ships in the apache-airflow-providers-common-io provider (add it to your pyproject.toml, which per Part 1 means an image rebuild). Then (Object Storage XCom Backend):

[core]
xcom_backend = airflow.providers.common.io.xcom.backend.XComObjectStorageBackend

[common.io]
# The user part of the URL is the Airflow connection id used to reach the bucket
xcom_objectstorage_path = s3://aws_default@my-airflow-xcom/xcoms
# Bytes. Anything smaller stays in the DB; anything larger goes to object storage.
xcom_objectstorage_threshold = 1048576       # 1 MB
xcom_objectstorage_compression = gzip        # optional

The mechanics are the nice part. It’s hybrid, not all-or-nothing: values under xcom_objectstorage_threshold bytes stay in the database (fast, no round-trip), and only values above it get written to object storage with a reference row saved in the DB. Set the threshold to something like 1 MB and you keep small XComs snappy while guaranteeing a big one can never bloat the database. compression (gzip, zip, snappy — snappy needs python-snappy installed) shrinks what you store. (source)

A few practical notes:

  • The connection id comes from the user part of the path (s3://CONN_ID@bucket/key), so you don’t pass credentials in the config — you reference an Airflow connection (see secrets below).

  • On Kubernetes, verify the backend actually loaded inside the container — environment drift makes this easy to get wrong. Exec into a worker and print the active class (XComs):

    from airflow.sdk.execution_time.xcom import XCom
    print(XCom.__name__)   # should NOT be BaseXCom
  • This does not change how your DAGs are written — xcom_push / xcom_pull and TaskFlow returns work exactly the same. It’s a storage swap, transparent to task code.

  • One behavioural reminder unaffected by the backend: if a task is retried, its XComs are cleared so the retry is idempotent — don’t lean on XCom to carry state across retries. (XComs)

Rule of thumb: enable the object-storage backend on day one with a 1 MB threshold. It costs nothing for small XComs and saves you from the slow-motion database bloat that’s miserable to unwind later.

Secrets and connections

Airflow resolves connections and variables through a chain of backends, in order, stopping at the first hit:

  1. Environment variablesAIRFLOW_CONN_<ID> for connections, AIRFLOW_VAR_<KEY> for variables. (env var ref)
  2. A configured external secrets backend — Vault, AWS Secrets Manager, GCP Secret Manager, etc.
  3. The metadata database — connections you create in the UI/CLI, encrypted with the Fernet key.

The Airflow 3 isolation model makes where this runs matter. Workers never use the database-backed MetastoreBackend; they resolve secrets via environment variables and the Execution API (which proxies to the server’s backends). The server components (scheduler, API server) use env vars then the metastore directly. External backends like Vault and AWS Secrets Manager work in all contexts — worker, supervisor, and server. (release notes 3.1.3) Net effect: an external secrets backend is the clean choice on Kubernetes, because it works identically everywhere and keeps secrets out of both the database and your DAG code.

The options, ranked for a Kubernetes deployment

External secrets backend (recommended). Point [secrets] backend at your provider and store connections/variables there. Tasks fetch them by id at runtime; nothing sensitive lives in the DB or in git.

[secrets]
backend = airflow.providers.hashicorp.secrets.vault.VaultBackend
backend_kwargs = {"connections_path": "airflow/connections", "variables_path": "airflow/variables", "url": "https://vault.internal:8200"}

Kubernetes Secrets → environment variables. The Helm chart already turns its built-in secrets (DB connection, Fernet key, API/JWT keys) into env vars, and you can add your own connections the same way via extraEnv. The chart also supports _CMD and _SECRET variants if you want Airflow to fetch a value at runtime rather than have it sit in the env — but note that if the plain <VAR> is set it takes precedence, so to use a _CMD/_SECRET variant you must disable the built-in via enableBuiltInSecretEnvVars.<VAR>: false. (production guide)

extraEnv: |-
  - name: AIRFLOW_CONN_MY_PROD_DB
    valueFrom:
      secretKeyRef:
        name: my-db-conn
        key: connection_uri

Database connections (UI/CLI). Still fine for low-sensitivity connections; passwords are Fernet-encrypted at rest. Just know that with worker isolation these resolve via the Execution API, and that storing everything in the DB couples your secrets to your metadata DB’s backup/restore lifecycle. For a Kubernetes deployment I’d reserve this for convenience connections, not crown jewels.

Whichever you choose: never put the Fernet key or DB credentials inline in values.yaml. Reference pre-created Kubernetes Secrets by name (fernetKeySecretName, data.metadataSecretName, etc.). The Fernet key in particular is what decrypts every DB-stored password — lose it and you re-enter every connection; leak it and they’re all readable. (connections)

Slack alerts for failing DAGs

You don’t want to discover a failed DAG by someone asking where their data is. Wire failures to Slack with the Slack provider’s notifier.

Add apache-airflow-providers-slack to pyproject.toml (image rebuild per Part 1), create a Slack connection (an API token, ideally stored in your secrets backend above), and attach send_slack_notification to a callback. It works at both DAG level and task level, on success or failure (Slack notifications how-to):

from datetime import datetime
from airflow.sdk import DAG
from airflow.providers.standard.operators.bash import BashOperator
from airflow.providers.slack.notifications.slack import send_slack_notification

slack_failed = send_slack_notification(
    slack_conn_id="slack_default",
    text=(
        ":red_circle: *DAG failed* — `{{ dag.dag_id }}`\n"
        "Run: `{{ run_id }}`\n"
        "Task: `{{ ti.task_id }}`\n"
        "<{{ ti.log_url }}|View logs>"
    ),
    channel="#airflow-alerts",
    username="Airflow",
)

with DAG(
    dag_id="etl_pipeline",
    schedule="@daily",
    start_date=datetime(2025, 1, 1),
    catchup=False,
    on_failure_callback=[slack_failed],   # fires when the DAG run fails
):
    BashOperator(
        task_id="extract",
        bash_command="run_extract.sh",
        on_failure_callback=[slack_failed],  # also alert per-task if you want granularity
    )

The message body is Jinja-templated, so you have the full task context — dag_id, run_id, ti.task_id, and ti.log_url for a one-click jump to the failing task’s logs. That log link is the difference between an alert that’s actionable and one that just makes you anxious.

A few tips that make Slack alerts survive contact with reality:

  • Put the alert at the DAG level (on_failure_callback on the DAG) so you get exactly one message per failed run, then add task-level callbacks only on the few tasks where you want immediate, granular paging.
  • Store the Slack token in your secrets backend, not in a DAG or in values.yaml — it’s a credential like any other.
  • The same send_slack_notification works for on_success_callback and SLA/deadline callbacks. Resist alerting on success for everything; alert fatigue is how real failures get ignored.
  • For richer routing (severity, on-call escalation) the notifier is just the delivery mechanism — pair it with SlackWebhookHook or a dedicated alerting tool if you outgrow a single channel.

That closes the series. You now have the full picture: the component architecture and git-sync deploy workflow, a database that won’t fall over on connection counts, the executor and lifecycle model including what happens to code mid-run, and the data plumbing above, XCom off the DB, secrets done the isolated-worker way, and a Slack hook so you hear about failures before your users do.