Skip to content

Set HikariCP connectionTimeout to 30s to prevent thread starvation#310

Open
NicoPiel wants to merge 1 commit into
OpenIntegrationEngine:mainfrom
NicoPiel:fix/hikari-connection-timeout
Open

Set HikariCP connectionTimeout to 30s to prevent thread starvation#310
NicoPiel wants to merge 1 commit into
OpenIntegrationEngine:mainfrom
NicoPiel:fix/hikari-connection-timeout

Conversation

@NicoPiel
Copy link
Copy Markdown
Collaborator

Previously connectionTimeout was set to 0, which HikariCP interprets as
wait indefinitely for a pool connection. Under any connection pressure
this causes caller threads to block forever, silently starving the
application with no way to detect or recover from the condition.

Change the value to 30000 ms (30 seconds), which matches the HikariCP
default. Callers will now receive a SQLException after 30 s instead of
hanging, surfacing pool starvation quickly and allowing the application
to fail fast and recover.

Previously connectionTimeout was set to 0, which HikariCP interprets as
wait indefinitely for a pool connection. Under any connection pressure
this causes caller threads to block forever, silently starving the
application with no way to detect or recover from the condition.

Change the value to 30000 ms (30 seconds), which matches the HikariCP
default. Callers will now receive a SQLException after 30 s instead of
hanging, surfacing pool starvation quickly and allowing the application
to fail fast and recover.

Signed-off-by: Nico Piel <nico.piel@hotmail.de>
@github-actions
Copy link
Copy Markdown

Test Results

  111 files  ±0    214 suites  ±0   6m 57s ⏱️ + 1m 7s
  654 tests ±0    654 ✅ ±0  0 💤 ±0  0 ❌ ±0 
1 308 runs  ±0  1 308 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 3634d7a. ± Comparison against base commit 5423492.

@NicoPiel NicoPiel requested review from a team, gibson9583, jonbartels, kayyagari, mgaffigan, pacmano1, ssrowe and tonygermano and removed request for a team May 25, 2026 21:20
@mgaffigan
Copy link
Copy Markdown
Contributor

Can you clarify the problem case? What is calling this on a threadpool?

I don't understand how this would block forward progress. With a queuing system, forward progress is usually the primary test. Presumably there is some excess of work entering the threadpool - which I presume is the problem.

If this is a "just in case" for infinite waits, I don't have a problem with a pathologically long timeout (e.g. 30 minutes).

@NicoPiel
Copy link
Copy Markdown
Collaborator Author

NicoPiel commented May 25, 2026

There's no scenario where thread A holds a connection waiting for thread B who holds a connection waiting for thread A. Each getConnection() call is wrapped in a try/finally dao.close(). Connections are never held across two acquisitions.

Risks with connectionTimeout=0:

  1. DB outage / network partition, all threads block forever on getConnection(). Process appears alive, queue grows unboundedly, no exception surface, no recovery path -> silent hang.
  2. Pool misconfiguration (e.g. maxConnections set too low); same silent hang, no diagnostic signal.

The other pool (DBCPConnectionPool) has no maxWaitMillis set -> also infinite wait. So this fix is Hikari-specific and inconsistent with DBCP behavior.

@mgaffigan
Copy link
Copy Markdown
Contributor

That sounds like you agree this is not a deadlock and not threadpool exhaustion. I still do not understand the repro.

What is this meant to address? For example, I'm guessing:

Repro:

  1. Configure the server to use Postgres and Hikari (options: ...)
  2. After the server is booted, block network communication between the server and the database engine
  3. Make a request to the management API (e.g. list channels using the client's "refresh" button)

Expected behavior:

  • the client receives a 5xx error after a timeout

Actual behavior:

  • the connection hangs indefinitely
  • with enough connections hung, the server stops accepting connections due to jetty threadpool starvation

@NicoPiel
Copy link
Copy Markdown
Collaborator Author

Your repro is exactly right.

  1. DB network blocked -> all active pool connections hang on in-flight queries (TCP read() blocks forever with no socket timeout)
  2. Pool slots are "in use", even though no work is progressing
  3. New management API requests arrive -> Jetty threads call getConnection() -> block indefinitely waiting for a pool slot to free
  4. Jetty threadpool fills with blocked threads -> server stops accepting connections

@mgaffigan
Copy link
Copy Markdown
Contributor

Can you document the configuration required to exhibit the issue? I assume it does not happen in the default derby config.

Unless this is an unusual configuration, this seems like something that needs to be configurable. 30s is not that long - database servers can exceed that during HA failover or high load.

Especially given the legacy behavior, I would suggest 60s or more. The primary consumer of the pool (I presume) is message processing, where it is beneficial to insulate from intermittent failures. The admin API would benefit from a shorter timeout, but a longer timeout is just annoying. Losing messages is critical.

@pacmano1
Copy link
Copy Markdown
Contributor

Defaulted to the current value of 0 and configurable. Changing it may break working configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants