**idle_in_transaction_session_timeout fired mid-pipeline.** PostgreSQL closes sessions that sit inside an open transaction without activity for longer than the configured threshold. Long-running dbt models or ADF pipelines that open a transaction, pause for upstream processing, then resume will hit this when the pause exceeds the timeout — typically 30 s–5 min in managed services like Azure Database for PostgreSQL Flexible Server.. **Connection pool returned a stale socket after a server restart or failover.** When the database server restarts (patch, failover, maintenance window), existing TCP connections are torn down. The pool does not know this until it tries to use one. The next query that gets that dead socket receives 08003 immediately.. **Application held a connection open across a slow external call.** A pipeline that opens a transaction, calls an external API or writes to blob storage mid-transaction, then comes back to write to PostgreSQL will find the connection gone if the external call took longer than the server timeout allows.. **TCP keepalive not configured, connection silently dropped by a firewall or NAT gateway.** Cloud NAT gateways (Azure NAT Gateway, AWS NAT) silently drop idle TCP connections after 4–10 minutes. The OS and application think the socket is still open; the server has already discarded it. The error only appears when the driver next writes to the dead socket.. **`pool_recycle` interval longer than `wait_timeout` / server-side timeout.** If the pool recycles connections every 3600 s but the server closes idle connections after 600 s, there is a 2400-second window where the pool holds connections that are already dead server-side.. **pgBouncer or other connection pooler in transaction mode closing the backend connection mid-session.** In transaction-pooling mode, pgBouncer may assign a different backend connection between statements. If the pipeline assumes connection-level state (temp tables, SET LOCAL, prepared statements), the mismatch causes the next statement to land on a connection that does not have that state — or on one that is closing.. **Concurrent `pg_terminate_backend()` call from an admin or monitoring script.** A DBA running `SELECT pg_terminate_backend(pid)` to clear long-running queries will terminate any active pipeline connection, which raises 08003 on the application side.

Step 1: Enable pre-ping / connection validation in your pool. In SQLAlchemy: `create_engine(url, pool_pre_ping=True)`. In psycopg2 connection pools: issue a `SELECT 1` before handing out a connection. This adds ~0.3 ms overhead per checkout but eliminates 08003 from stale sockets.. Step 2: Set `pool_recycle` shorter than the server's idle timeout. Check the server timeout: `SHOW idle_in_transaction_session_timeout;` and `SHOW tcp_keepalives_idle;`. Set `pool_recycle` to 60–80 % of the lower value. Example SQLAlchemy: `create_engine(url, pool_recycle=300)` when server timeout is 600 s.. Step 3: Verify and tune server-side timeouts on Azure Database for PostgreSQL: in the Azure Portal → your server → Server parameters → search `idle_in_transaction_session_timeout`. For pipelines that need long transactions, raise it or set it to 0 (disabled) only for the pipeline role: `ALTER ROLE pipeline_user SET idle_in_transaction_session_timeout = 0;`. Step 4: Configure OS-level TCP keepalives so the OS probes the connection before the NAT gateway drops it. In libpq connection string: `keepalives=1&keepalives_idle=60&keepalives_interval=10&keepalives_count=5`. This sends a keepalive probe after 60 s of inactivity, retrying every 10 s up to 5 times.. Step 5: Identify currently open idle-in-transaction connections before they time out: `SELECT pid, usename, state, now() - state_change AS idle_duration, query FROM pg_stat_activity WHERE state = 'idle in transaction' ORDER BY idle_duration DESC;` Terminate the longest ones if they are from a hung pipeline: `SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - state_change > interval '5 minutes';`. Step 6: Add retry logic with exponential backoff around database calls in your pipeline. Catch `psycopg2.OperationalError` (or `sqlalchemy.exc.OperationalError`) where `pgcode == '08003'`, close and discard the connection, then re-acquire a fresh one from the pool and retry up to 3 times. Do not retry on the same connection object.. Step 7: If using pgBouncer, switch the pipeline's database to session-pooling mode (`pool_mode = session` in pgbouncer.ini for that database entry) or ensure the pipeline does not rely on connection-level state across statements. Alternatively, connect the pipeline directly to PostgreSQL bypassing pgBouncer for long-running jobs.

Low severityconnectivity

PostgreSQL Error:
08003

Impact

When 08003 appears more than once per hour or recurs after a server maintenance window, your pipeline is running without connection validation and will cascade: dbt runs abort mid-graph leaving downstream models stale, ADF copy activities fail and retry from the beginning wasting compute, and any downstream dashboard or alert built on those tables serves yesterday's data.

A single 08003 failure aborts the entire active transaction — no partial writes are committed, so the target table is not corrupted, but the pipeline run fails completely. In dbt this means all downstream models depending on the failed model are skipped and show as errored in the run results. In ADF, the copy activity fails and the pipeline moves to its error handler; if no retry policy is configured the downstream datasets are not refreshed, causing stale data for the full refresh interval. Where SLAs require fresh data by a specific time, a single uncaught 08003 at the start of a run can miss the window entirely.

What does this error mean?

SQLSTATE 08003 means the connection handle your application or driver references has already been closed on the server side — the TCP session is gone before the query was sent. In a data pipeline this typically surfaces mid-transaction: an ADF copy activity, dbt model run, or SQLAlchemy session tries to execute a statement, the driver hands it a pooled socket that the server already killed, and PostgreSQL immediately returns this error before any data is read or written. The symptom is an abrupt failure with no partial result, often wrapped in a higher-level message like 'connection closed unexpectedly' from your orchestration layer.

Common causes

1**idle_in_transaction_session_timeout fired mid-pipeline.** PostgreSQL closes sessions that sit inside an open transaction without activity for longer than the configured threshold. Long-running dbt models or ADF pipelines that open a transaction, pause for upstream processing, then resume will hit this when the pause exceeds the timeout — typically 30 s–5 min in managed services like Azure Database for PostgreSQL Flexible Server.
2**Connection pool returned a stale socket after a server restart or failover.** When the database server restarts (patch, failover, maintenance window), existing TCP connections are torn down. The pool does not know this until it tries to use one. The next query that gets that dead socket receives 08003 immediately.
3**Application held a connection open across a slow external call.** A pipeline that opens a transaction, calls an external API or writes to blob storage mid-transaction, then comes back to write to PostgreSQL will find the connection gone if the external call took longer than the server timeout allows.
4**TCP keepalive not configured, connection silently dropped by a firewall or NAT gateway.** Cloud NAT gateways (Azure NAT Gateway, AWS NAT) silently drop idle TCP connections after 4–10 minutes. The OS and application think the socket is still open; the server has already discarded it. The error only appears when the driver next writes to the dead socket.
5**`pool_recycle` interval longer than `wait_timeout` / server-side timeout.** If the pool recycles connections every 3600 s but the server closes idle connections after 600 s, there is a 2400-second window where the pool holds connections that are already dead server-side.
6**pgBouncer or other connection pooler in transaction mode closing the backend connection mid-session.** In transaction-pooling mode, pgBouncer may assign a different backend connection between statements. If the pipeline assumes connection-level state (temp tables, SET LOCAL, prepared statements), the mismatch causes the next statement to land on a connection that does not have that state — or on one that is closing.
7**Concurrent `pg_terminate_backend()` call from an admin or monitoring script.** A DBA running `SELECT pg_terminate_backend(pid)` to clear long-running queries will terminate any active pipeline connection, which raises 08003 on the application side.

How to fix it

1Step 1: Enable pre-ping / connection validation in your pool. In SQLAlchemy: `create_engine(url, pool_pre_ping=True)`. In psycopg2 connection pools: issue a `SELECT 1` before handing out a connection. This adds ~0.3 ms overhead per checkout but eliminates 08003 from stale sockets.
2Step 2: Set `pool_recycle` shorter than the server's idle timeout. Check the server timeout: `SHOW idle_in_transaction_session_timeout;` and `SHOW tcp_keepalives_idle;`. Set `pool_recycle` to 60–80 % of the lower value. Example SQLAlchemy: `create_engine(url, pool_recycle=300)` when server timeout is 600 s.
3Step 3: Verify and tune server-side timeouts on Azure Database for PostgreSQL: in the Azure Portal → your server → Server parameters → search `idle_in_transaction_session_timeout`. For pipelines that need long transactions, raise it or set it to 0 (disabled) only for the pipeline role: `ALTER ROLE pipeline_user SET idle_in_transaction_session_timeout = 0;`
4Step 4: Configure OS-level TCP keepalives so the OS probes the connection before the NAT gateway drops it. In libpq connection string: `keepalives=1&keepalives_idle=60&keepalives_interval=10&keepalives_count=5`. This sends a keepalive probe after 60 s of inactivity, retrying every 10 s up to 5 times.
5Step 5: Identify currently open idle-in-transaction connections before they time out: `SELECT pid, usename, state, now() - state_change AS idle_duration, query FROM pg_stat_activity WHERE state = 'idle in transaction' ORDER BY idle_duration DESC;` Terminate the longest ones if they are from a hung pipeline: `SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND now() - state_change > interval '5 minutes';`
6Step 6: Add retry logic with exponential backoff around database calls in your pipeline. Catch `psycopg2.OperationalError` (or `sqlalchemy.exc.OperationalError`) where `pgcode == '08003'`, close and discard the connection, then re-acquire a fresh one from the pool and retry up to 3 times. Do not retry on the same connection object.
7Step 7: If using pgBouncer, switch the pipeline's database to session-pooling mode (`pool_mode = session` in pgbouncer.ini for that database entry) or ensure the pipeline does not rely on connection-level state across statements. Alternatively, connect the pipeline directly to PostgreSQL bypassing pgBouncer for long-running jobs.

Example log output

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  connection does not exist
SSLERROR: decryption failed or bad record mac
[SQL: SELECT id, updated_at FROM public.fact_orders WHERE updated_at > %(ts)s]
(Background on this error at: https://sqlalche.me/e/14/e3q8)ERROR 1 of 1 in model fact_orders (models/marts/fact_orders.sql)
  Database Error in model fact_orders
  connection does not exist
  compiled Code at target/compiled/marts/fact_orders.sqlActivity failed: CopyData | ErrorCode: UserErrorOdbcOperationFailed | Message: ODBC Source, Error [08003] connection does not exist | Pipeline: pl_ingest_postgres | Duration: 00:02:14

Frequently asked questions

08003 fix — what is the single fastest change I can make?

Add `pool_pre_ping=True` to your SQLAlchemy engine (or the equivalent validation call in your driver). This makes the pool test the socket with a lightweight `SELECT 1` before each checkout and discard dead connections immediately instead of handing them to your query. This one change eliminates the majority of 08003 occurrences from stale pools without requiring a server restart or schema change.

08003 retry — can I just retry the failed query automatically?

Yes, but only after you discard the broken connection and get a fresh one from the pool. Retrying on the same connection object will always fail again. Catch `OperationalError` where `pgcode == '08003'`, call `conn.close()` or return it to the pool as invalid, then call `engine.connect()` again to get a new socket. Wrap this in exponential backoff (0.5 s, 1 s, 2 s) with a maximum of 3 attempts before raising to the orchestration layer.

Why does 08003 happen after a server restart or Azure maintenance window?

When PostgreSQL restarts, all existing TCP connections are terminated at the OS level. Connection pools (SQLAlchemy, HikariCP, psycopg2 pool) do not detect this until they try to use a socket. If `pool_pre_ping` is off and `pool_recycle` is longer than the restart interval, the pool serves dead sockets for hours after the restart. The fix is pre-ping plus setting `pool_recycle` to under 5 minutes for managed services that have regular maintenance windows.

postgresql connection does not exist vs connection refused — what is the difference?

SQLSTATE 08003 ('connection does not exist') means your application had a valid connection that was closed while it was in use or between uses — the handshake succeeded earlier, but the session is gone now. SQLSTATE 08006 or a 'connection refused' (typically from the OS, not PostgreSQL) means the server is not accepting new connections at all — the port is closed, the server is down, or `max_connections` is exhausted. 08003 is a pool/lifecycle problem; connection refused is a server availability problem.

Source · www.postgresql.org/docs/current/errcodes-appendix.html