Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions pipelines/kubeflow-pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -420,6 +420,24 @@ def github_rag_pipeline(
github_token=github_token
)

issues_task = download_github_issues(
repos=f"{repo_owner}/{repo_name}",
labels="",
state="open",
max_issues_per_repo=50,
github_token=github_token
)
Comment on lines +423 to +429
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repos list for download_github_issues is hard-coded to kubeflow/kubeflow,kubeflow/pipelines, which makes github_rag_pipeline less reusable and inconsistent with the existing repo_owner/repo_name parameters. Consider adding a pipeline parameter (e.g., issues_repos) with this as the default, or deriving it from existing inputs.

Copilot uses AI. Check for mistakes.

issues_chunk_task = chunk_and_embed(
github_data=issues_task.outputs["issues_data"],
repo_name="kubeflow-issues",
base_url="https://github.com",
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For issues, download_github_issues emits records with path like issues/{repo}/{number} and also includes the real url (html_url). However, chunk_and_embed builds citation_url as f"{base_url}/{file_data['path']}" for non-doc paths, so passing base_url="https://github.com" here will generate invalid citation links (e.g., https://github.com/issues/...). Use the issue url field when present (or pass a base URL that matches the emitted paths, though this is tricky with multiple repos).

Suggested change
base_url="https://github.com",
base_url="",

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

You're right that using html_url from the GitHub API would produce more accurate citations for issues.

To keep this PR focused on integrating issues ingestion without modifying existing chunk_and_embed behavior, I’ve kept the current approach unchanged.

I’m happy to open a follow-up PR to update chunk_and_embed to prefer html_url when available.

chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)

issues_chunk_task.after(issues_task)

# Chunk and embed the content
chunk_task = chunk_and_embed(
github_data=download_task.outputs["github_data"],
Expand All @@ -429,6 +447,7 @@ def github_rag_pipeline(
chunk_overlap=chunk_overlap
)


# Store in Milvus
store_task = store_milvus(
embedded_data=chunk_task.outputs["embedded_data"],
Expand Down