Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import { Knex } from "knex";

import { TableName } from "../schemas";

const MIGRATION_TIMEOUT = 4 * 60 * 60 * 1000; // 4 hours

export async function up(knex: Knex): Promise<void> {
const result = await knex.raw("SHOW statement_timeout");
const originalTimeout = result.rows[0].statement_timeout;

try {
await knex.raw(`SET statement_timeout = ${MIGRATION_TIMEOUT}`);

if (
(await knex.schema.hasTable(TableName.IdentityAccessToken)) &&
(await knex.schema.hasColumn(TableName.IdentityAccessToken, "accessTokenLastUsedAt"))
) {
// No AT TIME ZONE 'UTC' cast here — COALESCE(timestamptz, timestamptz)
// returns timestamptz, which is already immutable for index purposes.
// The existing expiration index applies AT TIME ZONE to convert to
// timestamp (no-tz) before arithmetic; we skip that cast so the query
// predicate can compare timestamptz directly without a matching cast.
await knex.raw(`
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_identity_access_tokens_idle
ON ${TableName.IdentityAccessToken} (
(COALESCE("accessTokenLastUsedAt", "createdAt"))
)
`);
}
} finally {
await knex.raw(`SET statement_timeout = '${originalTimeout}'`);
}
}

export async function down(knex: Knex): Promise<void> {
await knex.raw(`
DROP INDEX IF EXISTS idx_identity_access_tokens_idle
`);
}

Check failure on line 39 in backend/src/db/migrations/20260416180000_add-identity-access-token-idle-index.ts

View check run for this annotation

Claude / Claude Code Review

DROP INDEX without CONCURRENTLY in down migration causes table lock on 64M-row table

The down() migration uses plain `DROP INDEX IF EXISTS` rather than `DROP INDEX CONCURRENTLY IF EXISTS`, which acquires an AccessExclusiveLock that blocks all reads and writes on the `identity_access_tokens` table for the full duration of the drop. On the 64M-row throttled-EBS table this PR explicitly calls out, rolling back this migration in production would cause a write outage on identity access token operations; change line 36 to `DROP INDEX CONCURRENTLY IF EXISTS idx_identity_access_tokens_i
Comment on lines +35 to +39
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The down() migration uses plain DROP INDEX IF EXISTS rather than DROP INDEX CONCURRENTLY IF EXISTS, which acquires an AccessExclusiveLock that blocks all reads and writes on the identity_access_tokens table for the full duration of the drop. On the 64M-row throttled-EBS table this PR explicitly calls out, rolling back this migration in production would cause a write outage on identity access token operations; change line 36 to DROP INDEX CONCURRENTLY IF EXISTS idx_identity_access_tokens_idle to avoid the lock.

Extended reasoning...

What the bug is and how it manifests

PostgreSQL's plain DROP INDEX acquires an AccessExclusiveLock on the parent table, which blocks every concurrent read and write for the entire duration of the drop. This is in direct contrast to DROP INDEX CONCURRENTLY, which only takes brief metadata locks and allows normal DML to proceed throughout. On a large table, this distinction is critical.

The specific code path that triggers it

The bug is in the down() function of 20260416180000_add-identity-access-token-idle-index.ts (lines 35–39):

export async function down(knex: Knex): Promise<void> {
  await knex.raw(`
      DROP INDEX IF EXISTS idx_identity_access_tokens_idle
    `);
}

This runs whenever the migration is rolled back (e.g., knex migrate:down or a failed deployment rollback).

Why existing code doesn't prevent it

The up() migration correctly uses CREATE INDEX CONCURRENTLY — but the same care was not applied to the rollback path. The migration already exports config = { transaction: false }, which is the exact prerequisite PostgreSQL requires before it will accept DROP INDEX CONCURRENTLY (concurrent index operations cannot run inside a transaction block). So the infrastructure for a safe drop is already in place; only the keyword is missing.

What the impact would be

The PR description explicitly states this table has ~64M rows on a throttled EBS volume, which is the reason CONCURRENTLY was chosen for the forward migration. While dropping an index is faster than building one, on throttled I/O it can still take minutes. During that window, every request that touches identity_access_tokens — token validation, authentication, refresh — would be blocked, causing a production outage on all identity access token operations for the full duration.

How to fix it

Replace line 36 with:

await knex.raw(`DROP INDEX CONCURRENTLY IF EXISTS idx_identity_access_tokens_idle`);

No other changes are needed; config = { transaction: false } is already exported.

Step-by-step proof

  1. A deployment is rolled back and Knex runs down() on the 20260416180000 migration.
  2. PostgreSQL executes DROP INDEX IF EXISTS idx_identity_access_tokens_idle.
  3. PostgreSQL acquires AccessExclusiveLock on identity_access_tokens.
  4. Every concurrent SELECT, INSERT, UPDATE, or DELETE on identity_access_tokens blocks immediately — no reads, no writes proceed.
  5. On a 64M-row table on throttled EBS I/O, the drop could take several minutes.
  6. All API endpoints that validate or refresh identity access tokens return errors or time out for the full duration.
  7. Had DROP INDEX CONCURRENTLY been used, only a brief ShareUpdateExclusiveLock would be taken at the start and end, and all DML would continue uninterrupted throughout.


const config = { transaction: false };
export { config };
5 changes: 4 additions & 1 deletion backend/src/keystore/keystore.ts
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,10 @@ export const KeyStorePrefixes = {
CertDashboardStats: (projectId: string) => `cert-dashboard-stats:${projectId}` as const,
CertActivityTrend: (projectId: string, range: string) => `cert-activity-trend:${projectId}:${range}` as const,
RefreshTokenGrace: (sessionId: string) => `refresh-token-grace:${sessionId}` as const,
InsightsCache: (projectId: string, endpoint: string) => `insights-cache:${projectId}:${endpoint}` as const
InsightsCache: (projectId: string, endpoint: string) => `insights-cache:${projectId}:${endpoint}` as const,

FrequentResourceCleanUpLock: "frequent-resource-cleanup-lock" as const,
WeeklyResourceCleanUpLock: "weekly-resource-cleanup-lock" as const
};

export const KeyStoreTtls = {
Expand Down
6 changes: 6 additions & 0 deletions backend/src/queue/queue-service.ts
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ export enum QueueName {
AuditLogPrune = "audit-log-prune",
DailyResourceCleanUp = "daily-resource-cleanup",
FrequentResourceCleanUp = "frequent-resource-cleanup",
WeeklyResourceCleanUp = "weekly-resource-cleanup",
DailyExpiringPkiItemAlert = "daily-expiring-pki-item-alert",
DailyPkiAlertV2Processing = "daily-pki-alert-v2-processing",
PkiAlertV2Event = "pki-alert-v2-event",
Expand Down Expand Up @@ -116,6 +117,7 @@ export enum QueueJobs {
AuditLogPrune = "audit-log-prune-job",
DailyResourceCleanUp = "daily-resource-cleanup-job",
FrequentResourceCleanUp = "frequent-resource-cleanup-job",
WeeklyResourceCleanUp = "weekly-resource-cleanup-job",
DailyExpiringPkiItemAlert = "daily-expiring-pki-item-alert",
DailyPkiAlertV2Processing = "daily-pki-alert-v2-processing",
PkiAlertV2ProcessEvent = "pki-alert-v2-process-event",
Expand Down Expand Up @@ -499,6 +501,10 @@ export type TQueueJobTypes = {
name: QueueJobs.FrequentResourceCleanUp;
payload: undefined;
};
[QueueName.WeeklyResourceCleanUp]: {
name: QueueJobs.WeeklyResourceCleanUp;
payload: undefined;
};
[QueueName.PkiDiscoveryScan]:
| {
name: QueueJobs.PkiDiscoveryRunScan;
Expand Down
3 changes: 2 additions & 1 deletion backend/src/server/routes/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2204,7 +2204,8 @@ export const registerRoutes = async (
approvalRequestDAL,
approvalRequestGrantsDAL,
certificateRequestDAL,
scepTransactionDAL
scepTransactionDAL,
keyStore
});

const healthAlert = healthAlertServiceFactory({
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -143,5 +143,91 @@
);
};

return { ...identityAccessTokenOrm, findOne, removeExpiredTokens };
// Deletes tokens that have been idle for longer than IDLE_THRESHOLD_DAYS.
// "Idle" is COALESCE(accessTokenLastUsedAt, createdAt) — i.e. tokens that
// have never been used fall back to their creation time. Known edge case:
// a token that was used but whose accessTokenQueue update job failed
// (removeOnFail: true) will keep accessTokenLastUsedAt = NULL and may be
// deleted early. Equally, a token that is only ever renewed (never used to
// auth) stays NULL here — accessTokenLastRenewedAt is not considered. Both
// cases are rare at a 30-day threshold and the worst outcome is a forced
// re-auth.
const removeIdleTokens = async (tx?: Knex) => {
logger.info(`${QueueName.WeeklyResourceCleanUp}: remove idle access tokens started`);

const BATCH_SIZE = 5000;
const MAX_RETRY_ON_FAILURE = 3;
const QUERY_TIMEOUT_MS = 10 * 60 * 1000; // 10 minutes
const IDLE_THRESHOLD_DAYS = 30;

const dbConnection = tx || db;
const nowResult = await dbConnection.raw<{ rows: Array<{ now: Date }> }>(`SELECT NOW() AT TIME ZONE 'UTC' as now`);
const { now } = nowResult.rows[0];

let deletedTokenIds: { id: string }[] = [];
let numberOfRetryOnFailure = 0;
let isRetrying = false;
let totalDeletedCount = 0;

// No AT TIME ZONE 'UTC' cast — COALESCE(timestamptz, timestamptz) is
// immutable and the index expression matches this predicate as-is. The
// expiration index uses AT TIME ZONE to produce a timestamp (no-tz) before
// interval arithmetic; here we stay in timestamptz throughout so the cast
// is unnecessary and omitting it keeps the index expression consistent.
const getIdleTokensQuery = (dbClient: Knex | Knex.Transaction, nowTimestamp: Date) =>
dbClient(TableName.IdentityAccessToken)
.whereRaw(
`COALESCE(
"${TableName.IdentityAccessToken}"."accessTokenLastUsedAt",
"${TableName.IdentityAccessToken}"."createdAt"
) < ?::timestamptz - make_interval(days => ?)`,
[nowTimestamp, IDLE_THRESHOLD_DAYS]
)
.select("id");

do {
try {
const deleteBatch = async (dbClient: Knex | Knex.Transaction) => {
await dbClient.raw(`SET LOCAL random_page_cost = 1.1`);
const idsToDeleteQuery = getIdleTokensQuery(dbClient, now).limit(BATCH_SIZE);
return dbClient(TableName.IdentityAccessToken).whereIn("id", idsToDeleteQuery).del().returning("id");
};

if (tx) {
// eslint-disable-next-line no-await-in-loop
deletedTokenIds = await deleteBatch(tx);
} else {
// eslint-disable-next-line no-await-in-loop
deletedTokenIds = await db.transaction(async (trx) => {
await trx.raw(`SET LOCAL statement_timeout = ${QUERY_TIMEOUT_MS}`);
return deleteBatch(trx);
});
}

numberOfRetryOnFailure = 0;
totalDeletedCount += deletedTokenIds.length;
} catch (error) {
numberOfRetryOnFailure += 1;
logger.error(error, "Failed to delete a batch of idle identity access tokens on pruning");
} finally {
// eslint-disable-next-line no-await-in-loop
await new Promise((resolve) => {
setTimeout(resolve, 500);
});
}
isRetrying = numberOfRetryOnFailure > 0;
} while (deletedTokenIds.length > 0 || (isRetrying && numberOfRetryOnFailure < MAX_RETRY_ON_FAILURE));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Bound retry loop by clearing stale delete results

If one batch deletes rows and a later batch throws (e.g., statement timeout or lock contention), deletedTokenIds keeps its previous non-empty value, so deletedTokenIds.length > 0 remains true and the loop never exits even after MAX_RETRY_ON_FAILURE is reached. In that failure mode the weekly worker can run forever, repeatedly logging errors and never releasing capacity for future scheduled runs.

Useful? React with 👍 / 👎.


if (numberOfRetryOnFailure >= MAX_RETRY_ON_FAILURE) {
logger.error(
`IdentityAccessTokenIdlePrune: Pruning failed and stopped after ${MAX_RETRY_ON_FAILURE} consecutive retries.`
);
}

logger.info(
`${QueueName.WeeklyResourceCleanUp}: remove idle access tokens completed. Deleted ${totalDeletedCount} tokens.`
);
};

return { ...identityAccessTokenOrm, findOne, removeExpiredTokens, removeIdleTokens };

Check failure on line 232 in backend/src/services/identity-access-token/identity-access-token-dal.ts

View check run for this annotation

Claude / Claude Code Review

Infinite retry loop in removeIdleTokens when previous batch succeeded

The do-while loop in `removeIdleTokens` can spin infinitely if any batch succeeds before a permanent DB failure. Once a batch returns N deleted IDs, `deletedTokenIds` remains non-empty forever because the `catch` block never resets it — so even after MAX_RETRY_ON_FAILURE consecutive failures the left-hand condition `deletedTokenIds.length > 0` keeps the loop alive, bypassing the retry guard entirely and hammering the broken database every 500 ms. Fix: add `deletedTokenIds = []` at the top of the
Comment on lines +215 to +232
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The do-while loop in removeIdleTokens can spin infinitely if any batch succeeds before a permanent DB failure. Once a batch returns N deleted IDs, deletedTokenIds remains non-empty forever because the catch block never resets it — so even after MAX_RETRY_ON_FAILURE consecutive failures the left-hand condition deletedTokenIds.length > 0 keeps the loop alive, bypassing the retry guard entirely and hammering the broken database every 500 ms. Fix: add deletedTokenIds = [] at the top of the do block (before the try). The identical pattern in removeExpiredTokens has the same pre-existing bug.

Extended reasoning...

What the bug is and how it manifests

removeIdleTokens (identity-access-token-dal.ts) uses a do-while loop controlled by:

while (deletedTokenIds.length > 0 || (isRetrying && numberOfRetryOnFailure < MAX_RETRY_ON_FAILURE));

deletedTokenIds is assigned inside the try block and never reset in the catch block. If any batch succeeds (sets deletedTokenIds to a non-empty array) and then the database becomes permanently unavailable, subsequent batches throw exceptions. The catch block increments numberOfRetryOnFailure but leaves deletedTokenIds holding the results of the last successful batch. Once numberOfRetryOnFailure >= MAX_RETRY_ON_FAILURE (3), the right-hand side of the OR becomes false — but the left side (deletedTokenIds.length > 0) remains true indefinitely. The loop never exits.

The specific code path that triggers it

  1. Batch 1 succeeds: deletedTokenIds = [5000 items], numberOfRetryOnFailure reset to 0.
  2. DB goes down.
  3. Batch 2 fails: numberOfRetryOnFailure = 1. deletedTokenIds unchanged. Condition: 5000 > 0 || (true && 1 < 3)true.
  4. Batch 3 fails: numberOfRetryOnFailure = 2. Condition: 5000 > 0 || (true && 2 < 3)true.
  5. Batch 4 fails: numberOfRetryOnFailure = 3. Condition: 5000 > 0 || (true && 3 < 3)5000 > 0 || falsetrue.
  6. Every subsequent batch fails: condition stays true forever.

Why existing code does not prevent it

The MAX_RETRY_ON_FAILURE guard is only effective when deletedTokenIds is empty at the time retries are exhausted (i.e., if the very first batch fails). It provides zero protection after any prior batch has succeeded with deletions, because there is no code path that clears deletedTokenIds on failure or at the start of each iteration.

What the impact would be

In the failure scenario the weekly cleanup job enters an infinite busy loop — one DB round-trip every 500 ms — against an already broken or overloaded database, worsening the outage and keeping the BullMQ worker occupied indefinitely. The Redis distributed lock (3-hour TTL) will also be held, preventing any other instance from picking up the job.

How to fix it

Reset deletedTokenIds to [] at the top of the do block, before the try:

do {
  deletedTokenIds = []; // add this line
  try {
    // ...
  } catch (error) {
    // ...
  }
} while (...);

This ensures that on any iteration where the DB throws, deletedTokenIds is empty at loop evaluation time. After MAX_RETRY_ON_FAILURE consecutive failures the right-hand side becomes false and the left-hand side (0 > 0) is also false, so the loop terminates as intended.

The same bug exists in removeExpiredTokens (a pre-existing issue), but removeIdleTokens introduces it fresh in this PR.

};
56 changes: 53 additions & 3 deletions backend/src/services/resource-cleanup/resource-cleanup-queue.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import { TScimServiceFactory } from "@app/ee/services/scim/scim-types";
import { TSnapshotDALFactory } from "@app/ee/services/secret-snapshot/snapshot-dal";
import { TKeyValueStoreDALFactory } from "@app/keystore/key-value-store-dal";
import { KeyStorePrefixes, TKeyStoreFactory } from "@app/keystore/keystore";
import { getConfig } from "@app/lib/config/env";
import { logger } from "@app/lib/logger";
import { JOB_SCHEDULER_PREFIX, QueueJobs, QueueName, TQueueServiceFactory } from "@app/queue";
Expand All @@ -23,7 +24,8 @@
type TDailyResourceCleanUpQueueServiceFactoryDep = {
auditLogDAL: Pick<TAuditLogDALFactory, "pruneAuditLog">;
auditLogService: Pick<TAuditLogServiceFactory, "checkPostgresAuditLogVolumeMigrationAlert">;
identityAccessTokenDAL: Pick<TIdentityAccessTokenDALFactory, "removeExpiredTokens">;
identityAccessTokenDAL: Pick<TIdentityAccessTokenDALFactory, "removeExpiredTokens" | "removeIdleTokens">;
keyStore: Pick<TKeyStoreFactory, "acquireLock">;
identityUniversalAuthClientSecretDAL: Pick<TIdentityUaClientSecretDALFactory, "removeExpiredClientSecrets">;
secretVersionDAL: Pick<TSecretVersionDALFactory, "pruneExcessVersions">;
secretVersionV2DAL: Pick<TSecretVersionV2DALFactory, "pruneExcessVersions">;
Expand Down Expand Up @@ -63,7 +65,8 @@
approvalRequestDAL,
approvalRequestGrantsDAL,
certificateRequestDAL,
scepTransactionDAL
scepTransactionDAL,
keyStore
}: TDailyResourceCleanUpQueueServiceFactoryDep) => {
const appCfg = getConfig();

Expand Down Expand Up @@ -113,15 +116,30 @@
{ name: QueueJobs.DailyResourceCleanUp }
);

// Hourly cleanup routine
const CLEANUP_LOCK_TTL_MS = 3 * 60 * 60 * 1000; // 3 hours

// Hourly cleanup routine. A distributed Redis lock prevents overlapping
// runs across instances — when a previous run exceeds the cron interval,
// the next tick skips instead of compounding DB load.
queueService.start(QueueName.FrequentResourceCleanUp, async () => {
let lock: Awaited<ReturnType<typeof keyStore.acquireLock>> | undefined;
try {
lock = await keyStore.acquireLock([KeyStorePrefixes.FrequentResourceCleanUpLock], CLEANUP_LOCK_TTL_MS, {
retryCount: 0
});
} catch {
logger.info(`${QueueName.FrequentResourceCleanUp}: another instance holds the lock, skipping this run`);
return;
Comment on lines +130 to +132
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Re-throw non-contention lock acquisition failures

This catch treats every acquireLock exception as "another instance holds the lock" and returns successfully. When Redis is unavailable or returns an unexpected error, the cleanup job is silently skipped and BullMQ records success, so no retry/alert path runs and token cleanup can be missed for extended periods. Only lock-contention errors should be swallowed; other errors should be propagated.

Useful? React with 👍 / 👎.

}
try {
logger.info(`${QueueName.FrequentResourceCleanUp}: queue task started`);
await identityAccessTokenDAL.removeExpiredTokens();
logger.info(`${QueueName.FrequentResourceCleanUp}: queue task completed`);
} catch (error) {
logger.error(error, `${QueueName.FrequentResourceCleanUp}: resource cleanup failed`);
throw error;
} finally {
await lock.release().catch((err) => logger.warn(err, `${QueueName.FrequentResourceCleanUp}: failed to release lock`));

Check failure on line 142 in backend/src/services/resource-cleanup/resource-cleanup-queue.ts

View workflow job for this annotation

GitHub Actions / Lint

Replace `.release()` with `⏎··········.release()⏎··········`
}
});

Expand All @@ -131,6 +149,38 @@
{ pattern: appCfg.isDailyResourceCleanUpDevelopmentMode ? "*/5 * * * *" : "0 * * * *" },
{ name: QueueJobs.FrequentResourceCleanUp }
);

// Weekly cleanup routine. Drains idle access tokens that the hourly job's
// TTL/revoked/uses-exhausted predicates cannot reach. Separate lock from
// the hourly so a long-running hourly run does not starve the weekly job.
queueService.start(QueueName.WeeklyResourceCleanUp, async () => {
let lock: Awaited<ReturnType<typeof keyStore.acquireLock>> | undefined;
try {
lock = await keyStore.acquireLock([KeyStorePrefixes.WeeklyResourceCleanUpLock], CLEANUP_LOCK_TTL_MS, {
retryCount: 0
});
} catch {
logger.info(`${QueueName.WeeklyResourceCleanUp}: another instance holds the lock, skipping this run`);
return;
}
try {
logger.info(`${QueueName.WeeklyResourceCleanUp}: queue task started`);
await identityAccessTokenDAL.removeIdleTokens();
logger.info(`${QueueName.WeeklyResourceCleanUp}: queue task completed`);
} catch (error) {
logger.error(error, `${QueueName.WeeklyResourceCleanUp}: resource cleanup failed`);
throw error;
} finally {
await lock.release().catch((err) => logger.warn(err, `${QueueName.WeeklyResourceCleanUp}: failed to release lock`));

Check failure on line 174 in backend/src/services/resource-cleanup/resource-cleanup-queue.ts

View workflow job for this annotation

GitHub Actions / Lint

Replace `.release()` with `⏎··········.release()⏎··········`
}
});

await queueService.upsertJobScheduler(
QueueName.WeeklyResourceCleanUp,
`${JOB_SCHEDULER_PREFIX}:${QueueJobs.WeeklyResourceCleanUp}`,
{ pattern: appCfg.isDailyResourceCleanUpDevelopmentMode ? "*/5 * * * *" : "0 3 * * 0" },
{ name: QueueJobs.WeeklyResourceCleanUp }
);
};

return {
Expand Down
Loading