Skip to content

Support cursor-based pagination for large GitHub repos #238

Description

@neuromechanist

Problem

GitHub's REST API limits page-based pagination to page 100 (10,000 items with per_page=100). Repositories with more than 10,000 issues/PRs (like mne-tools/mne-python with 13,000+) trigger an HTTP 422 error when requesting beyond page 100.

Current error:

HTTP 422 on page 101 for mne-tools/mne-python issues

This means we silently miss issues/PRs beyond the 10,000 limit.

Proposed Solution

Switch from page-based to cursor-based pagination using GitHub's GraphQL API or the REST API's Link header with since/after parameters.

Option A: Use since parameter (REST API)

For issues/PRs, use since parameter with ISO 8601 timestamp to paginate by creation/update date instead of page number. This avoids the page limit entirely.

Option B: Switch to GraphQL API

GitHub's GraphQL API uses cursor-based pagination natively and has no page limit.

Implementation

  • Modify src/knowledge/github_sync.py sync functions
  • Replace page=N iteration with cursor-based approach
  • Keep backward compatibility with existing incremental sync logic
  • Add tests for repos with >10,000 items

Context

Discovered during MNE community onboarding. mne-tools/mne-python has 13,000+ issues, exceeding GitHub's page-based pagination limit. The sync currently stops at 10,000 items silently (or errors on page 101).

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Priority 2: Important, fix when possibleenhancementNew feature or requestoperationsOperations, monitoring, and observabilitytestingTesting and quality assurance

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions