← Back to blog

Airflow on Kubernetes, Part 1: Setup, Architecture, and the git-sync Sidecar

The moving pieces of Airflow 3, the git-sync sidecar pattern for DAG delivery, and the image-rebuild-on-dependency-change workflow with Helmsman.

#airflow#kubernetes#data-engineering

Most “Airflow on Kubernetes” tutorials stop at helm install. That gets you a running UI and nothing resembling a production deployment. This part lays out the actual moving pieces in Airflow 3, the DAG-delivery pattern we use (a git-sync sidecar), and the deployment workflow that keeps DAG edits cheap and dependency changes deliberate.

The components you’re actually running

Airflow 3 split the monolith into more processes than Airflow 2. On Kubernetes, the official chart gives you a deployment/statefulset per role:

  • API server: serves the React UI, the public REST API, and the new internal Execution API that workers call. (This replaced the Airflow 2 webserver.)
  • Scheduler: the brain. Creates DAG runs, decides which task instances are runnable, and hands them to the executor. The executor logic runs inside the scheduler process, there is no separate “executor pod.”
  • DAG processor : a standalone component in Airflow 3. It parses DAG files out of bundles and writes the serialized DAG into the database. The scheduler never imports your DAG code; it reads the serialized form. This is a security and stability boundary, and it’s new.
  • Triggerer: runs async deferrable operators/sensors so they don’t hog a worker slot while waiting.
  • Workers: only present for the CeleryExecutor (persistent pods). For KubernetesExecutor, “workers” are ephemeral per-task pods the scheduler launches directly.
  • Metadata database: Postgres (covered in Part 2).
  • PgBouncer: connection pooler, shipped as an optional pod in the chart.
Airflow 3 component architecture on Kubernetes: DAG processor, scheduler, API server, triggerer, executor, workers/task pods, metadata DB behind PgBouncer, and git-sync sidecars feeding the DAG processor and workers.

The single most important architectural change from Airflow 2: workers don’t connect to the metadata database. They reach the Execution API over HTTP and authenticate with a short-lived JWT. The scheduler and API server hold direct DB access; everything task-side is isolated. This is why you can - and should - network-policy the database off from worker pods entirely. See Workload Isolation.

Delivering DAGs: the git-sync sidecar

DAGs need to be on disk wherever code is parsed or executed: the DAG processor (to parse) and every worker (to run). Baking DAGs into the image means an image rebuild and a full rollout for every DAG edit. That’s miserable. The git-sync sidecar pattern fixes it.

In the chart, you enable git-sync and point it at your repo. The sidecar continuously pulls the repo into a shared volume that the main container reads as its dags/ folder (the LocalDagBundle):

# values.yaml
dags:
  gitSync:
    enabled: true
    repo: git@github.com:your-org/airflow-dags.git
    branch: main
    subPath: dags # the folder inside the repo that holds your DAGs
    period: 30s # how often to pull
    depth: 1
    sshKeySecret: airflow-git-ssh # SSH deploy key as a k8s secret
    # When using sshKeySecret you MUST also pin knownHosts:
    knownHosts: |
      github.com ssh-rsa AAAA...your-pinned-key...

A few things that are easy to get wrong here:

Pin knownHosts. If you use an SSH key, the chart docs are explicit: also set dags.gitSync.knownHosts, generated with ssh-keyscan -t rsa github.com and verified against GitHub’s published fingerprints. Skip it and you’re one MITM away from running arbitrary DAG code. (source)

The sidecar runs on every component that needs DAGs. Both the DAG processor and the Celery workers get their own git-sync sidecar and their own copy. They pull independently, so at any instant two workers can be on slightly different commits. Usually harmless; occasionally the source of “why did this task run old code?” Keep period short and be aware of the skew window. (Part 3 covers how Airflow 3’s DAG versioning interacts with this.)

git-sync ≠ DAG bundle versioning. The LocalDagBundle that git-sync populates is a single mutable directory. Airflow’s stronger bundle versioning (pinning a DAG run to an immutable commit) is a property of the GitDagBundle, where Airflow itself does the checkout per version. More on that trade-off in Part 3, for now, just register that “git-sync into the dags folder” and “GitDagBundle” are two different mechanisms.

Trying it locally on Kind

Before talking about CI and Helmsman, it’s worth showing the shortest path from zero to a running UI on a laptop. Kind (“Kubernetes in Docker”) is the friendliest local cluster for this - Minikube and a cloud cluster work the same way, just swap the first command.

# 1. Create a single-node cluster.
kind create cluster --name airflow

# 2. Create the namespace and the Helm repo.
kubectl create namespace airflow
helm repo add apache-airflow https://airflow.apache.org
helm repo update

# 3. A minimal local values file. Leaves the embedded Postgres on
#    (toy-only, see Part 2) and enables git-sync against a public repo.
cat > values.local.yaml <<'YAML'
executor: CeleryExecutor
dags:
  gitSync:
    enabled: true
    repo: https://github.com/apache/airflow.git
    branch: main
    subPath: airflow-core/src/airflow/example_dags
    period: 30s
    depth: 1
postgresql:
  enabled: true     # embedded Postgres fine for Kind, never for prod
pgbouncer:
  enabled: false    # not needed at this scale
YAML

# 4. Install the chart.
helm install airflow apache-airflow/airflow \
  --namespace airflow \
  --values values.local.yaml \
  --wait --timeout 10m

# 5. Port-forward the API server and open http://localhost:8080
#    (default credentials: admin / admin).
kubectl port-forward -n airflow svc/airflow-api-server 8080:8080

A few notes on what just happened so the production sections below don’t feel disconnected:

  • The chart installed the same components Part 1 lists: scheduler, API server, DAG processor, triggerer, Celery workers, Redis, and an embedded Postgres. The architecture is identical to a cloud deployment; only the database and the scale differ.
  • dags.gitSync pointed at the Airflow repo’s example DAGs to give you something to click. Swap repo/subPath/sshKeySecret to point at your own bundle and you’re a step from real use.
  • The embedded Postgres is there so helm install works standalone. Never run anything you care about against it, Part 2 covers managed Postgres and PgBouncer.

To tear it down: helm uninstall airflow -n airflow && kind delete cluster --name airflow.

The deployment workflow: cheap DAGs, deliberate dependencies

Here’s the rule we operate by, and the reasoning:

DAG-only change → nothing to deploy. git-sync picks it up within period seconds. Dependency change (pyproject.toml) → rebuild the image, then roll out with Helmsman.

Why split them? Because a dependency change means a new Python environment, which means new container images for the scheduler, DAG processor, triggerer, and workers. Rolling that out restarts pods, including workers. If we coupled DAG changes to deployments, every trivial DAG edit would churn worker pods and risk interrupting running tasks. By delivering DAGs out-of-band via git-sync, the only thing that ever forces a worker restart is a genuine environment change.

Building the image

Extend the official image; don’t fight it. (Extending the Airflow image)

FROM apache/airflow:3.2.2-python3.12

# Bring uv in as a static binary, no Python runtime cost.
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/

# System packages, if you truly need them
USER root
RUN apt-get update && apt-get install -y --no-install-recommends \
    some-binary && rm -rf /var/lib/apt/lists/*
USER airflow

# Sync straight into the venv the base image already provisioned for Airflow,
# so `airflow` on PATH sees our deps.
ENV UV_PROJECT_ENVIRONMENT=/home/airflow/.local

WORKDIR /opt/airflow
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev --no-cache

uv sync --frozen installs exactly what uv.lock says, refusing to re-resolve, so the image is byte-identical to the lock that landed in review. The constraints belong outside this file: when you regenerate the lock, pin against the published Airflow constraints with uv lock --build-constraint https://raw.githubusercontent.com/apache/airflow/constraints-3.2.2/constraints-3.12.txt (matched to your Airflow and Python versions). Without that, a transitive dependency will eventually upgrade something Airflow pins and you’ll get an import error at parse time that’s painful to bisect. The Dockerfile stays simple because all the resolution decisions are already encoded in the lock.

Rolling it out with Helmsman

Helmsman lets you declare the desired release state in a DSF (desired state file) and reconcile it, instead of running imperative helm upgrade commands by hand. The CI flow:

  1. Dependency PR merges → CI builds and pushes your-registry/airflow:<git-sha>.
  2. CI updates the image tag in the Helmsman DSF (or a values file it references).
  3. helmsman --apply -f airflow.dsf reconciles, Helm computes the diff and rolls the changed deployments.
# airflow.dsf (excerpt)
[apps.airflow]
namespace = "airflow"
enabled   = true
chart     = "apache-airflow/airflow"
version   = "1.22.0"          # the CHART version, pinned
valuesFiles = ["values.yaml", "values.image.yaml"]
# values.image.yaml, the only file CI rewrites
images:
  airflow:
    repository: your-registry/airflow
    tag: "a1b2c3d" # the git sha CI just built

Two operational wins from this split: the chart version is pinned and reviewed like code, and the only thing that moves on a routine deploy is the image tag. Diffs are trivial to read in review.

Make pods restart only when they must

The chart signs internal tokens and session cookies with secret keys. If those keys are generated dynamically, every helm upgrade rotates them and restarts your components unnecessarily, sometimes failing in-flight log fetches or DAG runs. Set them statically:

# Generate once: python3 -c 'import secrets; print(secrets.token_hex(16))'
apiSecretKeySecretName: airflow-api-secret # signs API sessions
jwtSecretName: airflow-jwt-secret # signs worker JWTs

The docs call this out directly: a static API secret key “will help ensure your Airflow components only restart when necessary,” and for the JWT secret, “consider creating a custom JWT Secret rollover procedure which will not cause failures in dag runs due to mismatch in tokens.” (production guide) Treat these as the difference between a deploy that’s a no-op for running tasks and one that quietly kills them.

Pitfalls to avoid on day one

The embedded Postgres is a toy. The chart ships a Postgres pod so helm install works standalone. The docs are blunt that “you might experience data loss when you are using it.” Disable it and point at managed Postgres (RDS, Cloud SQL) before you put anything real on it. See Part 2.

postgresql:
  enabled: false
data:
  metadataSecretName: airflow-metadata-db # your external DB, as a k8s secret

safeToEvict on Kubernetes workers must stay false. When you run the cluster-autoscaler, evicting a worker pod mid-task kills the task. The chart defaults workers.kubernetes.safeToEvict to false for exactly this reason, don’t flip it. (production guide)

load_examples defaults on outside the official image. The example DAGs clutter a production UI and parse on every loop. The official Docker image already sets AIRFLOW__CORE__LOAD_EXAMPLES=False, but confirm it, [core] load_examples defaults to True in raw config. (config ref)

Don’t bake credentials into values.yaml. Database creds, the Fernet key, the API/JWT secrets, all belong in pre-created Kubernetes Secrets referenced by name, not inline in the values file. The chart supports *SecretName parameters for every one of them.


Next: Part 2: the metadata database and PgBouncer, where we get into why Airflow is so hard on Postgres connection counts and how to actually size the pool.