Feature/resource management by shubhamxdd · Pull Request #3 · shubhamxdd/pyq-appl

shubhamxdd · 2026-05-17T13:11:52Z

in this pr ive added resources/files upload to digitalocean using boto3
and text extraction using nvidia model from openrouter

Summary by CodeRabbit

Release Notes

New Features
- Upload PDF and text files as resources with background processing
- Automated text extraction from scanned documents
- Resource management: list, delete, and retry extraction
UI/UX Improvements
- New Resources page for file uploads and management
- Updated app layout with sidebar navigation and theme toggle
- Real-time processing status with automatic updates

… OCR

…paces

… logging

… stability" This reverts commit 25fc5a3.

coderabbitai · 2026-05-17T13:12:03Z

Warning

Rate limit exceeded

@shubhamxdd has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 39 minutes and 26 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a1bc3036-666b-4f84-935d-ae57fd32b75d

📥 Commits

Reviewing files that changed from the base of the PR and between effc1a4 and 76410ae.

⛔ Files ignored due to path filters (1)

frontend/package-lock.json is excluded by !**/package-lock.json

📒 Files selected for processing (6)

.env.example
backend/app/config.py
backend/app/routers/resources.py
frontend/package.json
frontend/src/App.tsx
frontend/src/pages/Resources.tsx

📝 Walkthrough

Walkthrough

This PR implements Phase 3 Resource Management by adding authenticated file upload endpoints, DigitalOcean Spaces storage integration, and background vision-based text extraction for PDFs and text files, alongside a React frontend dashboard for resource management with polling support.

Changes

Resource Management with Vision OCR

Layer / File(s)	Summary
Storage configuration and service `.env.example`, `backend/app/config.py`, `backend/app/services/storage.py`, `backend/requirements.txt`	Environment variables and Pydantic settings replace R2 with DigitalOcean Spaces credentials (`SPACES_ENDPOINT`, `SPACES_KEY`, `SPACES_SECRET`, `SPACES_BUCKET`, `SPACES_REGION`, `SPACES_PUBLIC_URL`); `MAX_OCR_PAGES` defaults to 12. `StorageService` wraps boto3 S3-compatible client with `upload_file()` (returns public URL or None) and `delete_file()` (returns True/False) methods. Dependencies `pypdfium2` and `pillow` added.
Backend API schemas and router setup `backend/app/schemas/resource.py`, `backend/app/main.py`, `backend/app/routers/resources.py` (imports)	`ResourceBase` and `ResourceOut` Pydantic models define resource shape with id, user ownership, file URL, status, and creation time. `main.py` imports and mounts `resources.router` under `/api` alongside auth router.
Resource management endpoints `backend/app/routers/resources.py` (endpoints)	`upload_resource` validates PDF/text, uploads to Spaces, creates DB record with `status="processing"`, enqueues extraction task. `list_resources` queries user-owned resources ordered by recency. `delete_resource` verifies ownership, removes from storage and database. `retry_extraction` resets status to `"processing"` and re-enqueues task.
Background extraction task `backend/app/workers/tasks.py`, `backend/app/workers/arq_worker.py`	`extraction_task` downloads PDFs via HTTP (with CDN hostname fallback), processes up to `MAX_OCR_PAGES` with `pypdfium2`, renders pages to JPEG base64, sends each to OpenRouter vision API for OCR, aggregates text, and updates `extracted_text` and status to `"ready"`. Text files are fetched and stored directly. Unsupported types marked failed. Comprehensive error handling logs exceptions and attempts status recovery.
Frontend API client `frontend/src/api/resources.ts`	`Resource` interface mirrors backend schema. `resourcesApi` wraps CRUD methods: `upload()` posts `FormData` with multipart encoding, `list()` fetches all, `delete(id)` removes, `retry(id)` retriggers extraction.
Frontend app setup and layout `frontend/src/App.tsx`, `frontend/src/components/Layout.tsx`	React Query `QueryClientProvider` wraps app. Routing protects `/` and `/resources` with `Layout` wrapper. `Layout` component adds fixed sidebar with nav links, dark mode toggle (via `prefers-color-scheme`), and logout button; `Dashboard` component styling/labels updated.
Frontend resources management page `frontend/src/pages/Resources.tsx`	Page component maintains file selection and upload state, fetches resources via React Query with polling (every 3 seconds while `processing`), defines mutations (upload/delete/retry) that invalidate cache on success. UI renders upload form with file input and type selector, and table showing filename, humanized type, color-coded status icons, formatted date, and conditional action buttons (retry for failed, delete always).
Documentation updates `context/architecture.md`, `context/progress-tracker.md`, `context/project-overview.md`	Architecture marks storage as DigitalOcean Spaces; progress tracker updated to "Phase 3: Resource Management (In Review)" with sub-items, current goals focused on upload/OCR verification, and Phase 4/5 items. Project overview expands "In Scope" to explicitly include OCR/AI extraction and Spaces storage.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Frontend
  participant Backend_API
  participant Spaces
  participant ARQ_Worker
  participant OpenRouter
  participant Database
  User->>Frontend: Upload PDF/TXT
  Frontend->>Backend_API: POST /api/resources (multipart)
  Backend_API->>Spaces: upload_file(bytes)
  Spaces-->>Backend_API: public_url
  Backend_API->>Database: INSERT Resource (processing)
  Database-->>Backend_API: resource_id
  Backend_API->>ARQ_Worker: enqueue extraction_task(id)
  Backend_API-->>Frontend: ResourceOut
  Frontend->>Frontend: Poll /api/resources (3s)
  ARQ_Worker->>Spaces: Download file
  Spaces-->>ARQ_Worker: file bytes
  ARQ_Worker->>ARQ_Worker: Render pages to image
  ARQ_Worker->>OpenRouter: Vision API (per page)
  OpenRouter-->>ARQ_Worker: extracted text
  ARQ_Worker->>Database: UPDATE Resource (ready, text)
  Database-->>ARQ_Worker: ok
  Frontend->>Frontend: Display ready resource

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

shubhamxdd/pyq-appl#2: Establishes the base auth router and /api prefix mounting that this PR extends with the resources router.

Poem

🐰 Files now float in Spaces so bright,
Vision reads PDF and text in the night,
A sidebar appears with a toggle so neat,
Resources upload—the extraction's complete! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'Feature/resource management' directly summarizes the main changes: implementing resource management functionality including file uploads, OCR extraction, and related backend/frontend features across multiple files.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/resource-management

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 13

🧹 Nitpick comments (2)

frontend/src/components/Layout.tsx (1)

26-27: ⚡ Quick win

Avoid sidebar links that immediately redirect away.

At Lines 26-27, /solver and /generator appear as active destinations, but current routing redirects both back to /. Consider hiding or disabling these nav items until those pages are actually available.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/components/Layout.tsx` around lines 26 - 27, The sidebar
currently exposes navigation entries with label 'PYQ Solver' and 'Paper Gen'
that point to paths '/solver' and '/generator' but those routes redirect to '/',
so update the Layout.tsx navigation config to hide or disable these items until
real routes exist: locate the nav array containing { label: 'PYQ Solver', icon:
Zap, path: '/solver' } and { label: 'Paper Gen', icon: FileEdit, path:
'/generator' } and either remove them, wrap them behind a feature flag/prop
(e.g., isSolverEnabled/isGeneratorEnabled) or set a disabled flag and render
them non-clickable with appropriate UI state so users are not redirected
unexpectedly.

backend/app/workers/tasks.py (1)

71-97: ⚡ Quick win

Reuse one httpx.AsyncClient for all OCR page calls.

Opening a new client for each page loses connection pooling and adds repeated setup overhead on the hot path.

Proposed refactor

-                for i in range(pages_to_process):
+                async with httpx.AsyncClient(timeout=90.0) as client:
+                    for i in range(pages_to_process):
                         print(f"DEBUG: Rendering Page {i+1}/{pages_to_process}...")
                         page = pdf[i]
                         # Render page to image
                         bitmap = page.render(scale=2) 
                         pil_image = bitmap.to_pil()
@@
-                    async with httpx.AsyncClient(timeout=90.0) as client:
-                        vision_response = await client.post(
+                        vision_response = await client.post(
                             "https://openrouter.ai/api/v1/chat/completions",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/workers/tasks.py` around lines 71 - 97, The code opens a new
httpx.AsyncClient for each page OCR call (the async with httpx.AsyncClient(...)
block around the client.post that produces vision_response using img_base64),
which loses connection pooling; refactor by creating a single AsyncClient
instance once outside the per-page loop (or accept one as a parameter), reuse
that client.post(...) inside the loop to call the same OpenRouter endpoint for
each image, and close the client after the loop (or rely on context manager at
the higher scope) so connection pooling and performance are preserved.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/routers/resources.py`:
- Around line 54-61: The code currently commits the new Resource with
status="processing" before enqueuing the Redis job; instead, in upload_resource
(and the retry endpoint handler) create the Redis pool and call await
redis.enqueue_job('extraction_task', str(new_resource.id)) before calling
db.commit()/db.refresh(); wrap the enqueue in a try/except so that if
enqueue_job fails you rollback the DB (await db.rollback()) and return/raise an
error (e.g., HTTPException) so the record does not remain stuck in "processing";
only after successful enqueue call await db.commit() and await
db.refresh(new_resource). Ensure the same change is applied to both the
upload_resource flow and the retry endpoint flow that currently call
db.add(new_resource) / await db.commit() / await redis.enqueue_job(...).
- Around line 88-94: The DB row is deleted even if
storage_service.delete_file(object_name) fails; update the delete flow in the
function handling resource deletion to check the boolean return of
storage_service.delete_file(object_name) (use the computed object_name from
resource.file_url), and if it returns False abort the operation instead of
calling await db.delete(resource) / await db.commit(); return or raise an
appropriate error (e.g., HTTP error or custom exception) conveying the storage
deletion failure so the DB row is not orphaned.
- Around line 59-60: Your POST handlers (upload_resource and retry_extraction)
create a new ARQ pool via create_pool/RedisSettings.from_dsn and call
enqueue_job on it per request, causing connection churn; instead add a FastAPI
lifespan that creates the pool once (redis_pool = await create_pool(...)),
stores it as app.state.redis_pool, yields, and closes it on shutdown, then
update the handlers to retrieve the pool from app.state.redis_pool (or a simple
dependency that returns app.state.redis_pool) and call enqueue_job on that
shared pool rather than creating a new one per request.
- Around line 31-33: The current unbounded await file.read() (the line "content
= await file.read()") can OOM; change to a chunked read with a hard cap: define
MAX_UPLOAD_BYTES (e.g. configurable), read in a loop using await
file.read(CHUNK_SIZE) and accumulate into a buffer while tracking total_size,
and if total_size > MAX_UPLOAD_BYTES raise an HTTPException(413) (or return an
appropriate error) and abort processing; update the handler (the upload/resource
endpoint that uses the variable file) to use this pattern and document
MAX_UPLOAD_BYTES.

In `@backend/app/services/storage.py`:
- Around line 19-26: Change object uploads to private ACL instead of
public-read: in the code path that calls self.client.put_object (and uses
settings.SPACES_BUCKET, Key=object_name), set ACL='private' (or omit ACL) and
stop returning the direct public URL (settings.SPACES_PUBLIC_URL/...); instead
return the object key or a stable identifier and generate time-limited access
via self.client.generate_presigned_url('get_object', Params={'Bucket':
settings.SPACES_BUCKET, 'Key': object_name}, ExpiresIn=...) when the API needs
to serve a download. Also update the extraction task to read files with the
boto3 client (self.client.get_object / streaming body) using the same
SPACES_BUCKET and object_name rather than fetching the public HTTP URL.

In `@backend/app/workers/tasks.py`:
- Around line 114-117: The TXT extraction code using httpx.AsyncClient should
validate the HTTP response before storing content: after awaiting
client.get(resource.file_url) call, invoke response.raise_for_status() (or
otherwise check response.status_code == 200) and only then set
resource.extracted_text and resource.status = "ready"; ensure any exceptions
from response.raise_for_status() propagate or are handled similarly to the PDF
extraction path so error responses (4xx/5xx) are not stored as valid text.
- Around line 131-138: In the exception-recovery path update for Resource,
ensure the SQLAlchemy session is rolled back before performing any further
queries or commits: call await db.rollback() (or db.rollback() if sync) before
executing select(Resource).where(Resource.id == resource_id) and before
attempting to set res.status = "failed" and await db.commit(); reference the
existing symbols Resource, db, resource_id and the block that does result =
await db.execute(select(Resource)...) so the rollback happens first to clear the
failed transaction state.

In `@context/architecture.md`:
- Line 31: Update architecture.md so storage references are consistent: replace
all occurrences of "Cloudflare R2" and any other R2-specific wording (e.g.,
initialization, bucket naming, endpoint examples) with "DigitalOcean Spaces" and
corresponding Spaces-specific terms, and ensure the line that currently reads
"**DigitalOcean Spaces**: Original uploaded PDFs and generated PDF exports." is
the single canonical description; verify and update surrounding references at
the earlier "Cloudflare R2" mention and the later storage mention so they use
the same provider name and matching configuration terminology.

In `@context/project-overview.md`:
- Line 48: The project overview currently contradicts itself by listing
"OCR/AI-based text extraction for scanned documents" as in-scope while also
marking image/vision support out of scope; update the scope language to be
explicit and unambiguous (for example replace the line "OCR/AI-based text
extraction for scanned documents." with "Printed scanned PDFs (machine-printed
text) supported via OCR; handwriting recognition and broader image/vision
features are out of scope for the MVP."). Ensure the change appears near the
existing OCR/vision statements so readers clearly see the boundary and consider
adding a short parenthetical like "(handwriting deferred)" to the sentence to
prevent future confusion.

In `@frontend/src/api/resources.ts`:
- Around line 13-16: The upload function's parameter is currently declared as
type: string which weakens type-safety; change the upload signature to use the
Resource union type (e.g., Resource['type']) so only valid resource types are
accepted, update the function declaration for upload and any local references
expecting string to the narrowed type, and adjust any caller sites that pass
literals or variables to ensure they conform to Resource['type'] (or cast
appropriately) so the API contract is enforced at compile time.
- Around line 17-21: The POST call using api.post<Resource>('/resources/',
formData, { headers: { 'Content-Type': 'multipart/form-data' } }) should drop
the manual headers so the browser/Axios can set the multipart boundary
automatically; update the call that sends the formData (the api.post invocation
that assigns to response) to pass formData without the headers object (and
ensure no other wrapper adds Content-Type), leaving Axios to set the correct
Content-Type with boundary.

In `@frontend/src/pages/Resources.tsx`:
- Line 1: The file input retains its previous value after resetting state which
prevents re-selection of the same file; add a ref for the native input (e.g.,
uploadInputRef) on the <input type="file"> used in Resources.tsx and, in the
success/error/reset branches of the upload handler functions (the code paths
using setSelectedFile, setUploading, setUploadProgress around the upload logic
and the handlers near lines 24-27 and 79-83), set uploadInputRef.current.value =
'' (guarded for null) to clear the native input so selecting the same file fires
onChange again.
- Around line 123-136: The table can go blank when resources is undefined —
modify the conditional render in Resources.tsx around the isLoading / resources
checks to explicitly handle resources === undefined (and any error state) before
falling back to resources.length === 0 or resources.map; specifically, add a
branch like resources === undefined || fetchError that returns a <tr><td
colSpan={5}> row with a clear message (e.g., "Failed to load resources" or "No
data available") and appropriate styling, ensuring you still keep the existing
isLoading, resources?.length === 0, and resources?.map branches and use the same
colSpan={5} so table layout stays consistent.

---

Nitpick comments:
In `@backend/app/workers/tasks.py`:
- Around line 71-97: The code opens a new httpx.AsyncClient for each page OCR
call (the async with httpx.AsyncClient(...) block around the client.post that
produces vision_response using img_base64), which loses connection pooling;
refactor by creating a single AsyncClient instance once outside the per-page
loop (or accept one as a parameter), reuse that client.post(...) inside the loop
to call the same OpenRouter endpoint for each image, and close the client after
the loop (or rely on context manager at the higher scope) so connection pooling
and performance are preserved.

In `@frontend/src/components/Layout.tsx`:
- Around line 26-27: The sidebar currently exposes navigation entries with label
'PYQ Solver' and 'Paper Gen' that point to paths '/solver' and '/generator' but
those routes redirect to '/', so update the Layout.tsx navigation config to hide
or disable these items until real routes exist: locate the nav array containing
{ label: 'PYQ Solver', icon: Zap, path: '/solver' } and { label: 'Paper Gen',
icon: FileEdit, path: '/generator' } and either remove them, wrap them behind a
feature flag/prop (e.g., isSolverEnabled/isGeneratorEnabled) or set a disabled
flag and render them non-clickable with appropriate UI state so users are not
redirected unexpectedly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0858e1a6-780e-4492-9a77-2ac99ccc55e8

📥 Commits

Reviewing files that changed from the base of the PR and between 153b9d3 and effc1a4.

⛔ Files ignored due to path filters (1)

frontend/bun.lock is excluded by !**/*.lock

📒 Files selected for processing (16)

.env.example
backend/app/config.py
backend/app/main.py
backend/app/routers/resources.py
backend/app/schemas/resource.py
backend/app/services/storage.py
backend/app/workers/arq_worker.py
backend/app/workers/tasks.py
backend/requirements.txt
context/architecture.md
context/progress-tracker.md
context/project-overview.md
frontend/src/App.tsx
frontend/src/api/resources.ts
frontend/src/components/Layout.tsx
frontend/src/pages/Resources.tsx

coderabbitai · 2026-05-17T13:16:55Z

+    redis = await create_pool(RedisSettings.from_dsn(settings.REDIS_URL))
+    await redis.enqueue_job('extraction_task', str(new_resource.id))


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

find backend -type f -name "*.py" | head -20

Repository: shubhamxdd/pyq-appl

Length of output: 662

🏁 Script executed:

cd backend && find . -type f -name "resources.py" | head -5

Repository: shubhamxdd/pyq-appl

Length of output: 90

🏁 Script executed:

wc -l backend/app/routers/resources.py

Repository: shubhamxdd/pyq-appl

Length of output: 100

🏁 Script executed:

sed -n '50,70p' backend/app/routers/resources.py

Repository: shubhamxdd/pyq-appl

Length of output: 751

🏁 Script executed:

sed -n '110,130p' backend/app/routers/resources.py

Repository: shubhamxdd/pyq-appl

Length of output: 499

🏁 Script executed:

cat -n backend/app/routers/resources.py | head -150

Repository: shubhamxdd/pyq-appl

Length of output: 4948

🏁 Script executed:

rg -n "create_pool|redis" backend/app --type py -A 2 -B 2

Repository: shubhamxdd/pyq-appl

Length of output: 1539

🏁 Script executed:

find backend -type f -name "*.py" -exec rg -l "lifespan|startup|app.state" {} \;

Repository: shubhamxdd/pyq-appl

Length of output: 97

🏁 Script executed:

cat backend/app/main.py

Repository: shubhamxdd/pyq-appl

Length of output: 701

🏁 Script executed:

rg -n "close|cleanup|app.state" backend/app --type py

Repository: shubhamxdd/pyq-appl

Length of output: 45

🏁 Script executed:

rg -n "arq|redis" backend/app/config.py

Repository: shubhamxdd/pyq-appl

Length of output: 45

🏁 Script executed:

cat backend/app/workers/arq_worker.py

Repository: shubhamxdd/pyq-appl

Length of output: 475

🌐 Web query:

ARQ Python library pool creation best practices async FastAPI

💡 Result:

When integrating the ARQ library with FastAPI, managing Redis connection pools effectively is critical for performance and stability. Below are the best practices for pool creation and management. Centralized Pool Management Avoid creating a new Redis pool for every request or task. Instead, initialize a single, long-lived connection pool that is shared across your application [1]. In FastAPI, the recommended approach is to use the lifespan event handler to create the pool on startup and close it on shutdown [2]. Example of Lifespan Management: from contextlib import asynccontextmanager from fastapi import FastAPI from arq import create_pool from arq.connections import RedisSettings redis_pool = None @asynccontextmanager async def lifespan(app: FastAPI): # Startup: Initialize the pool global redis_pool redis_pool = await create_pool(RedisSettings(host="localhost", port=6379)) yield # Shutdown: Close the pool await redis_pool.close app = FastAPI(lifespan=lifespan) Handling Connection Limits If you encounter errors related to connection limits (e.g., max_connections reached), you can explicitly configure the connection pool using redis-py's BlockingConnectionPool [3]. This is particularly useful for both the ARQ worker and the enqueuer to prevent crashes under high load [3]. from redis.asyncio import BlockingConnectionPool from arq.connections import RedisSettings # Configure with a specific pool REDIS_SETTINGS = RedisSettings( connection_pool=BlockingConnectionPool(max_connections=10)) Worker Configuration The ARQ worker should be configured to use the same Redis settings as your enqueuer to ensure consistency [4][5]. When running the worker, pass the settings class directly to the ARQ CLI or your worker entry point [5]. Key Considerations: 1. Avoid Blocking: Ensure all tasks defined for ARQ are async (async def) to maintain the non-blocking nature of the event loop [6][4]. 2. Resource Cleanup: Always ensure that database sessions or other resources opened within an ARQ task are properly closed or returned to their respective pools [7][8]. Using context managers or ARQ's on_job_start/after_job_end hooks can help manage these lifecycles [7]. 3. Resilience: Leverage ARQ's built-in retry mechanisms and exponential backoff for tasks that interact with external services or databases to handle transient failures gracefully [9].

Citations:

1: https://medium.com/@geetansh2k1/setting-up-and-using-an-async-redis-client-in-fastapi-the-right-way-0409ad3812e6

2: https://dev.to/meroline_lizlent/uvicorn-the-lightning-fast-asgi-server-powering-modern-python-apps-5a85

3: Allow connection_pool to be set in RedisSettings python-arq/arq#473

4: https://github.com/davidmuraya/fastapi-arq

5: https://oneuptime.com/blog/post/2026-03-31-redis-fastapi-background-tasks/view

6: https://arq-docs.helpmanual.io/index.html

7: https://wazaari.dev/blog/arq-sqlalchemy-done-right

8: https://stackoverflow.com/questions/70955332/how-to-correctly-use-sqlalchemy-within-fastapi-or-arq-for-mysql

9: https://davidmuraya.com/blog/fastapi-arq-retries/

Initialize Redis pool once at app startup, not per request.

Currently, each POST request to /upload_resource (lines 59-60) and /retry_extraction (lines 118-119) creates a fresh ARQ pool without closing it. Under load, this causes connection churn and risks exhausting Redis connections. Use FastAPI's lifespan context manager to initialize the pool once on startup, close it on shutdown, and inject it via a dependency.

`@asynccontextmanager` async def lifespan(app: FastAPI): redis_pool = await create_pool(RedisSettings.from_dsn(settings.REDIS_URL)) app.state.redis_pool = redis_pool yield await redis_pool.close() app = FastAPI(lifespan=lifespan)

Then in routes, retrieve from app.state.redis_pool instead of creating new pools.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/routers/resources.py` around lines 59 - 60, Your POST handlers (upload_resource and retry_extraction) create a new ARQ pool via create_pool/RedisSettings.from_dsn and call enqueue_job on it per request, causing connection churn; instead add a FastAPI lifespan that creates the pool once (redis_pool = await create_pool(...)), stores it as app.state.redis_pool, yields, and closes it on shutdown, then update the handlers to retrieve the pool from app.state.redis_pool (or a simple dependency that returns app.state.redis_pool) and call enqueue_job on that shared pool rather than creating a new one per request.

coderabbitai · 2026-05-17T13:16:55Z

+            self.client.put_object(
+                Bucket=settings.SPACES_BUCKET,
+                Key=object_name,
+                Body=file_content,
+                ACL='public-read',
+                ContentType=content_type
+            )
+            return f"{settings.SPACES_PUBLIC_URL}/{object_name}"


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

cat backend/app/services/storage.py

Repository: shubhamxdd/pyq-appl

Length of output: 1487

🏁 Script executed:

rg "put_object|upload|storage" backend/app --type py -l | head -20

Repository: shubhamxdd/pyq-appl

Length of output: 157

🏁 Script executed:

rg "public-read\|ACL" backend --type py -A 3 -B 3

Repository: shubhamxdd/pyq-appl

Length of output: 45

🏁 Script executed:

cat backend/app/routers/resources.py

Repository: shubhamxdd/pyq-appl

Length of output: 4101

🏁 Script executed:

cat backend/app/workers/tasks.py

Repository: shubhamxdd/pyq-appl

Length of output: 6629

Use private ACL with pre-signed URLs instead of public-read for uploaded resources.

Files are stored with ACL='public-read' (line 23), making them directly accessible to anyone with the URL regardless of application authentication. Though the API restricts listing/deletion to authenticated users, the URLs themselves (stored in the database and returned to clients) are not secret—if shared or leaked, anyone can access the file without authentication. This is particularly risky for user-uploaded PDFs and documents that may contain sensitive data.

Refactor to use private ACL and generate pre-signed URLs for temporary, authenticated access. Update the extraction task to use the boto3 client directly for internal file access rather than fetching via HTTP.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@backend/app/services/storage.py` around lines 19 - 26, Change object uploads to private ACL instead of public-read: in the code path that calls self.client.put_object (and uses settings.SPACES_BUCKET, Key=object_name), set ACL='private' (or omit ACL) and stop returning the direct public URL (settings.SPACES_PUBLIC_URL/...); instead return the object key or a stable identifier and generate time-limited access via self.client.generate_presigned_url('get_object', Params={'Bucket': settings.SPACES_BUCKET, 'Key': object_name}, ExpiresIn=...) when the API needs to serve a download. Also update the extraction task to read files with the boto3 client (self.client.get_object / streaming body) using the same SPACES_BUCKET and object_name rather than fetching the public HTTP URL.

coderabbitai · 2026-05-17T13:16:55Z

- Long-context LLM via OpenRouter (No vector DB).
+- Long-context LLM via OpenRouter.
 - PDF and Plain Text file support.
+- OCR/AI-based text extraction for scanned documents.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify scope to remove vision/OCR contradiction.

Line 48 puts OCR/AI extraction for scanned docs in scope, but Line 55 marks image/vision support out of scope. Please tighten wording so the intended MVP boundary is unambiguous (e.g., “printed scanned PDFs supported, handwriting deferred”).

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@context/project-overview.md` at line 48, The project overview currently contradicts itself by listing "OCR/AI-based text extraction for scanned documents" as in-scope while also marking image/vision support out of scope; update the scope language to be explicit and unambiguous (for example replace the line "OCR/AI-based text extraction for scanned documents." with "Printed scanned PDFs (machine-printed text) supported via OCR; handwriting recognition and broader image/vision features are out of scope for the MVP."). Ensure the change appears near the existing OCR/vision statements so readers clearly see the boundary and consider adding a short parenthetical like "(handwriting deferred)" to the sentence to prevent future confusion.

coderabbitai · 2026-05-17T13:16:55Z

+  upload: async (file: File, type: string) => {
+    const formData = new FormData();
+    formData.append('file', file);
+    formData.append('type', type);


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Narrow the type parameter to the resource union.

At Line 13, type: string weakens type-safety and allows invalid values to compile. Use Resource['type'] to enforce the API contract at compile time.

Proposed fix

- upload: async (file: File, type: string) => { + upload: async (file: File, type: Resource['type']) => {

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

upload: async (file: File, type: string) => {

const formData = new FormData();

formData.append('file', file);

formData.append('type', type);

upload: async (file: File, type: Resource['type']) => {

const formData = new FormData();

formData.append('file', file);

formData.append('type', type);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/api/resources.ts` around lines 13 - 16, The upload function's parameter is currently declared as type: string which weakens type-safety; change the upload signature to use the Resource union type (e.g., Resource['type']) so only valid resource types are accepted, update the function declaration for upload and any local references expecting string to the narrowed type, and adjust any caller sites that pass literals or variables to ensure they conform to Resource['type'] (or cast appropriately) so the API contract is enforced at compile time.

coderabbitai · 2026-05-17T13:16:55Z

+    const response = await api.post<Resource>('/resources/', formData, {
+      headers: {
+        'Content-Type': 'multipart/form-data',
+      },
+    });


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

For Axios v1.x in browsers, when sending a FormData instance, should Content-Type: multipart/form-data be omitted so the browser can add the boundary automatically?

💡 Result:

Yes, you should omit the Content-Type header when sending a FormData instance in the browser [1][2]. When you pass a FormData object as the data payload in Axios, the browser's underlying XHR or Fetch API automatically sets the Content-Type header to multipart/form-data and appends the necessary boundary string [1][2]. If you manually set the Content-Type header to multipart/form-data, you often omit the required boundary parameter, which causes the server to fail to parse the request [1][3][4]. Key points to remember: - Automatic Handling: Axios detects the FormData instance and allows the browser to handle the header generation, including the unique boundary [2]. - Avoid Manual Overrides: Manually setting Content-Type: multipart/form-data can prevent the browser from injecting the correct boundary, leading to errors [1][4]. - Shortcut Methods: If you prefer to be explicit about the intent, Axios provides helper methods like postForm, putForm, and patchForm, which are designed to handle multipart/form-data requests automatically [5][6][7]. - Troubleshooting: If you find that the Content-Type is not being set correctly (e.g., it defaults to application/x-www-form-urlencoded), ensure you are not accidentally overriding it in your global Axios defaults or instance configuration [3][8][9]. If you must override a default header that is interfering, you can set the header to undefined or an empty string in your request config to allow Axios to re-evaluate the correct type [9].

Citations:

1: https://stackoverflow.com/questions/75742226/cant-set-content-type-header-for-axios-with-formdata

2: https://stackoverflow.com/questions/68643330/pass-data-to-service-in-axios/68643919

3: Form data - start boundary issue axios/axios#381

4: https://stackoverflow.com/questions/49579640/how-to-send-data-correct-axios-error-multipart-boundary-not-found

5: https://github.com/axios/axios/blob/v1.15.0/docs/pages/advanced/multipart-form-data-format.md

6: https://github.com/axios/axios/blob/v1.x/README.md

7: https://axios.rest/pages/advanced/multipart-form-data-format

8: Issue with difference in default behaviour between axios and axiosInstance when content-type is not set (specific case of multipart-/form-data) axios/axios#3857

9: Not automatically change the header content type to multipart form data axios/axios#5556

🏁 Script executed:

cat -n frontend/src/api/resources.ts | head -50

Repository: shubhamxdd/pyq-appl

Length of output: 1311

Remove manual Content-Type header when sending FormData.

When passing a FormData instance to Axios (line 17), manually setting 'Content-Type': 'multipart/form-data' omits the required boundary parameter, causing the server to fail parsing the multipart request. Axios and the browser automatically inject the correct header with the boundary when FormData is detected, so omit the headers object entirely.

Proposed fix

- const response = await api.post<Resource>('/resources/', formData, { - headers: { - 'Content-Type': 'multipart/form-data', - }, - }); + const response = await api.post<Resource>('/resources/', formData);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const response = await api.post<Resource>('/resources/', formData, {

headers: {

'Content-Type': 'multipart/form-data',

},

});

const response = await api.post<Resource>('/resources/', formData);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/api/resources.ts` around lines 17 - 21, The POST call using api.post<Resource>('/resources/', formData, { headers: { 'Content-Type': 'multipart/form-data' } }) should drop the manual headers so the browser/Axios can set the multipart boundary automatically; update the call that sends the formData (the api.post invocation that assigns to response) to pass formData without the headers object (and ensure no other wrapper adds Content-Type), leaving Axios to set the correct Content-Type with boundary.

coderabbitai · 2026-05-17T13:16:55Z

@@ -0,0 +1,182 @@
+import { useState } from 'react';


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clear the native file input after successful upload.

At Lines 26-27, state resets, but the <input type="file"> still retains its previous value. Re-selecting the same file may not fire onChange, blocking immediate re-upload of the same file.

Proposed fix

-import { useState } from 'react'; +import { useRef, useState } from 'react'; @@ const [file, setFile] = useState<File | null>(null); + const fileInputRef = useRef<HTMLInputElement>(null); @@ onSuccess: () => { queryClient.invalidateQueries({ queryKey: ['resources'] }); setFile(null); + if (fileInputRef.current) fileInputRef.current.value = ''; }, @@ <input + ref={fileInputRef} type="file" accept=".pdf,.txt"

Also applies to: 24-27, 79-83

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/pages/Resources.tsx` at line 1, The file input retains its previous value after resetting state which prevents re-selection of the same file; add a ref for the native input (e.g., uploadInputRef) on the <input type="file"> used in Resources.tsx and, in the success/error/reset branches of the upload handler functions (the code paths using setSelectedFile, setUploading, setUploadProgress around the upload logic and the handlers near lines 24-27 and 79-83), set uploadInputRef.current.value = '' (guarded for null) to clear the native input so selecting the same file fires onChange again.

shubhamxdd added 10 commits May 17, 2026 16:11

feat: implement backend resource management with DO Spaces and Vision…

7dcd0fd

… OCR

feat: implement frontend resource management UI and layout

33ef631

feat: complete Phase 3 - Resource Management with Vision OCR and DO S…

c65d8fb

…paces

fix: initialize QueryClientProvider in App.tsx

5c8f1ca

feat: add retry mechanism for failed resource processing

a16dd41

fix: switch to free vision model (Gemini 1.5 Flash) and improve error…

27d9ec6

… logging

fix: switch to authenticated S3 downloads for extraction task stability

25fc5a3

Revert "fix: switch to authenticated S3 downloads for extraction task…

d6a65a6

… stability" This reverts commit 25fc5a3.

fix: automatically bypass inactive CDN for extraction task

2f425e9

use nvidia model cause free hai

effc1a4

coderabbitai Bot reviewed May 17, 2026

View reviewed changes

shubhamxdd added 4 commits May 17, 2026 18:55

feat: add user notifications with react-hot-toast for resource actions

5e532ad

fix: ensure atomic database commit and background job enqueuing

8693850

fix: ensure DB record is preserved if cloud storage deletion fails

f198b7c

fix: handle error and undefined states in resources table

76410ae

shubhamxdd merged commit 2356425 into main May 17, 2026
1 check passed

This was referenced May 17, 2026

Feature/pyq solver #4

Merged

Feature/UI revamp shadcn #7

Merged

feat: Implement Sample Paper Generator with format detection and question generation #8

Merged

Phase 7 #12

Merged

Posthog analytics #14

Merged

		redis = await create_pool(RedisSettings.from_dsn(settings.REDIS_URL))
		await redis.enqueue_job('extraction_task', str(new_resource.id))

Conversation

shubhamxdd commented May 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

shubhamxdd commented May 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 17, 2026 •

edited

Loading