Airflow on Kubernetes, Part 2: The Metadata Database and PgBouncer
Why Airflow opens so many connections, how to size the SQLAlchemy pool, and why PgBouncer in transaction mode is non-negotiable for Postgres.
The metadata database is the one stateful thing Airflow cannot run without, and on Kubernetes it’s the component most likely to fall over under load, not because the queries are heavy, but because of how many connections Airflow opens. This part explains where those connections come from, how to size the SQLAlchemy pool, and why PgBouncer is mandatory rather than optional for any serious Postgres deployment.
Why Airflow hammers Postgres connection counts
PgBouncer earned its place in this stack for one concrete reason: Airflow opens a connection per task, and a busy cluster opens enough of them to take an RDS instance down. Postgres uses a process-per-connection model and every open connection is a separate backend process with its own memory so it’s the number of connections, not the weight of the queries, that saturates the database first.
The Airflow docs say it directly:
“Airflow is known especially in high-performance setup to open many connections to metadata database. This might cause problems for Postgres resource usage, because in Postgres, each connection creates a new process and it makes Postgres resource-hungry when a lot of connections are opened. Therefore we recommend to use PGBouncer as database proxy for all Postgres production installations.” Set up a Database Backend
And the Helm chart’s production guide makes the same point:
“Due to distributed nature of Airflow, it can open a lot of database connections. Using a connection pooler can significantly reduce the number of open connections on the database.” - Helm production guide → PgBouncer
That’s the entire motivation for the pooler: multiplex a large, spiky number of client connections — one per running task, plus every server component — onto a small, fixed set of real Postgres backends. With Celery, every running task talks to the database, so your peak connection count tracks your peak task concurrency. Scale concurrency up and RDS marches toward max_connections and starts refusing connections — the whole instance gets overwhelmed by connection count long before CPU or IO is the bottleneck. PgBouncer is how you decouple the two: collapse ~100 client connections onto ~10 real backends so concurrency stops being a database-killer.
Version note. This per-task connection pressure is most acute on Airflow 2, where workers and tasks open metadata-DB connections directly. Airflow 3 isolates workers behind the Execution API, so task pods no longer open their own metadata connections — but the server components (scheduler, DAG processor, triggerer, API server) still each hold a pool, and the Celery result backend still needs connections, so PgBouncer remains the right call. If you sized PgBouncer on Airflow 2 to survive task concurrency, expect that pressure to drop after upgrading to 3.x.
Two databases, two pools
A Celery-based deployment actually puts two distinct kinds of connection on Postgres, and the chart pools them separately:
- The metadata database — task and DAG state, the scheduler’s bookkeeping, connections/variables, and XCom rows. This is
sql_alchemy_conn, pooled bypgbouncer.metadataPoolSize(default 10). This is the connection that “manages the tasks.” - The Celery result backend — where workers write the result/return state of each executed task so the scheduler can collect the outcome. This is
[celery] result_backend, pooled bypgbouncer.resultBackendPoolSize(default 5). This is the connection that carries task output back. (config ref → celery, helm production guide)
One precision worth stating, since it’s easy to conflate: the Celery result backend holds Celery’s per-task execution result, not your XCom values. XCom lives in the metadata DB by default (or in object storage once you configure that backend — see Part 4). In most deployments both backends are the same Postgres server, which is exactly why a single PgBouncer in front of it — with one pool for the metadata DB and one for the result backend — is the clean solution.
pgbouncer:
enabled: true
metadataPoolSize: 10 # pool for the metadata DB (sql_alchemy_conn) — task/DAG state
resultBackendPoolSize: 5 # pool for the Celery result backend — task results/output
Where do the server-side connections come from? Every server component holds its own SQLAlchemy pool: each scheduler replica, each DAG processor, the triggerer, and the API server. Multiply pool size by replica count and you can saturate a small Postgres’ max_connections before a single task runs — on top of the per-task connections above. That arithmetic is next.
Sizing the SQLAlchemy pool
Each component’s pool is governed by these [database] settings (config ref):
| Setting | Default | What it controls |
|---|---|---|
sql_alchemy_pool_size | 5 | Persistent connections kept open per component |
sql_alchemy_max_overflow | 10 | Extra connections allowed above pool size under burst (discarded when returned) |
sql_alchemy_pool_recycle | 1800 | Seconds before an idle connection is recycled |
sql_alchemy_pool_pre_ping | True | SELECT 1 on checkout to detect dead connections |
So a single component can hold up to pool_size + max_overflow = 15 connections under burst. The arithmetic that matters:
worst-case connections ≈ Σ over components of (pool_size + max_overflow) × replicas
Two schedulers + two API servers + one DAG processor + one triggerer at defaults is 6 × 15 = 90 connections before headroom for migrations jobs, db clean, and your own monitoring. That’s already past the default max_connections = 100 on many managed Postgres tiers. You have two levers: shrink the pools, or put a pooler in front. Do both.
pool_pre_ping = True is worth keeping on — it costs a trivial round-trip but saves you from the stale-connection errors that managed databases love to throw after idle periods.
Managed Postgres: also set keepalives (a separate gotcha)
This one is unrelated to connection counts — it’s about idle connections being dropped, and it’s a distinct issue from why you need PgBouncer. RDS, Cloud SQL and Azure Postgres silently close connections that sit idle (typically after ~300s), and Airflow then trips over the dead socket: psycopg2.OperationalError: SSL SYSCALL error: EOF detected. The Airflow docs call this out specifically for managed Postgres:
“For managed Postgres such as Azure Postgresql, CloudSQL, Amazon RDS, you should use
keepalives_idlein the connection parameters and set it to less than the idle time because those services will close idle connections after some time of inactivity (typically 300 seconds), which results with error …psycopg2.operationalerror: SSL SYSCALL error: EOF detected.” — Set up a Database Backend
The fix is TCP keepalives via sql_alchemy_connect_args, pointed at a dict in your local settings (same source):
# airflow_local_settings.py
keepalive_kwargs = {
"keepalives": 1,
"keepalives_idle": 30,
"keepalives_interval": 5,
"keepalives_count": 5,
}
# airflow.cfg / env
[database]
sql_alchemy_connect_args = airflow_local_settings.keepalive_kwargs
PgBouncer: not optional for Postgres
A connection pooler sits between Airflow and Postgres, multiplexing many client connections onto a small set of real server connections. The Airflow docs recommend PgBouncer “for all Postgres production installations,” and the chart ships a pre-configured instance behind a boolean flag.
# values.yaml
pgbouncer:
enabled: true
maxClientConn: 100 # max client conns PgBouncer accepts
metadataPoolSize: 10 # real server conns to the metadata DB
resultBackendPoolSize: 5 # real server conns to the Celery result backend
The shape that makes this work is transaction pooling. The chart’s PgBouncer config uses pool_mode = transaction on listen_port = 6543 (production guide):
[pgbouncer]
pool_mode = transaction
listen_port = 6543
auth_type = scram-sha-256
max_client_conn = 100
ignore_startup_parameters = extra_float_digits
In transaction mode a server connection is assigned to a client only for the duration of a transaction, then returned to the pool. That’s what lets metadataPoolSize: 10 serve a hundred Airflow client connections — the real backends are released between transactions instead of being held idle. The effect: your 90 worst-case client connections collapse onto ~10–15 real Postgres backends.
When you enable PgBouncer, Airflow components connect to PgBouncer’s service and port (6543), not to Postgres directly. If you store DB creds in a Kubernetes Secret rather than the values file, the connection string has to target PgBouncer:
kubectl create secret generic airflow-metadata \
--from-literal=connection=postgresql://user:pass@RELEASE-pgbouncer.airflow:6543/airflow-metadata
Scheduler ×2 ─┐
API server ×2 ┤
DAG processor ┼──▶ PgBouncer (transaction mode, :6543) ──▶ Postgres
Triggerer ─┘ ~10–15 real conns
Two gotchas with transaction pooling
extra_float_digits must be ignored. Postgres clients set session parameters on connect; in transaction mode those don’t survive across the shared backends. The chart already adds ignore_startup_parameters = extra_float_digits, which is why it works out of the box — but if you hand-roll PgBouncer, omitting it produces cryptic connection failures.
Prepared statements and session state don’t carry across transactions. Transaction pooling is incompatible with anything that relies on session-scoped state surviving between transactions. Airflow core is built to tolerate this; just don’t point your own session-stateful tooling at the 6543 port expecting session semantics.
Running more than one scheduler (and what the DB needs)
Scheduler HA in Airflow relies entirely on the database — no Raft, no Zookeeper. The schedulers coordinate through row-level locks (SELECT ... FOR UPDATE) so that only one scheduler is in the “critical section” (where task instances move from scheduled to queued) at a time, respecting pool and concurrency limits. (Scheduler HA)
This needs a database that implements SKIP LOCKED / NOWAIT:
- PostgreSQL 12+ and MySQL 8.0+ — fully supported, run as many schedulers as you like, no extra config.
- MariaDB — didn’t implement those clauses until 10.6; multiple schedulers have produced deadlocks. Avoid for HA.
- SQL Server — untested with HA.
- Keep
[scheduler] use_row_level_locking = True(the default). If you set it toFalse, you must run exactly one scheduler.
Practically: use Postgres 12+ or MySQL 8+, and the HA story is free.
Keep the database from growing forever
The metadata DB accretes task instance history, logs metadata, XCom rows, and serialized DAG versions. Left alone it grows without bound and queries slow down. The chart can schedule airflow db clean as a CronJob:
databaseCleanup:
enabled: true
retentionDays: 90
This matters more once you’re versioning DAGs and storing XCom references (Part 4) — both add rows. Decide a retention window early; reclaiming space from a bloated Airflow DB later is no fun.
Next: Part 3: executors and the execution lifecycle, including how a worker actually receives the task it runs and what happens when code changes mid-run.