Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions pipelines/kubeflow-pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ def chunk_and_embed(
import re
import torch
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_text_splitter import RecursiveCharacterTextSplitter
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_and_embed now imports RecursiveCharacterTextSplitter from langchain_text_splitter, but the installed dependency in this component is langchain (see packages_to_install), and the repository requirements use langchain-text-splitters whose module name is typically langchain_text_splitters (plural). As written, this is likely to fail at runtime with ModuleNotFoundError. Align the import with the dependency you install (either revert to langchain.text_splitter when installing langchain, or install langchain-text-splitters and import from langchain_text_splitters).

Suggested change
from langchain_text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

Copilot uses AI. Check for mistakes.

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2', device=device)
Expand Down Expand Up @@ -420,6 +420,13 @@ def github_rag_pipeline(
github_token=github_token
)

issues_task = download_github_issues(
repos="kubeflow/kubeflow,kubeflow/pipelines",
labels="",
state="open",
max_issues_per_repo=50,
github_token=github_token
)
Comment on lines +423 to +429
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The repos list for download_github_issues is hard-coded to kubeflow/kubeflow,kubeflow/pipelines, which makes github_rag_pipeline less reusable and inconsistent with the existing repo_owner/repo_name parameters. Consider adding a pipeline parameter (e.g., issues_repos) with this as the default, or deriving it from existing inputs.

Copilot uses AI. Check for mistakes.
# Chunk and embed the content
chunk_task = chunk_and_embed(
github_data=download_task.outputs["github_data"],
Expand All @@ -428,7 +435,14 @@ def github_rag_pipeline(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)

issues_chunk_task = chunk_and_embed(
github_data=issues_task.outputs["issues_data"],
repo_name="kubeflow-issues",
base_url="https://github.com",
Copy link

Copilot AI Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For issues, download_github_issues emits records with path like issues/{repo}/{number} and also includes the real url (html_url). However, chunk_and_embed builds citation_url as f"{base_url}/{file_data['path']}" for non-doc paths, so passing base_url="https://github.com" here will generate invalid citation links (e.g., https://github.com/issues/...). Use the issue url field when present (or pass a base URL that matches the emitted paths, though this is tricky with multiple repos).

Suggested change
base_url="https://github.com",
base_url="",

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

You're right that using html_url from the GitHub API would produce more accurate citations for issues.

To keep this PR focused on integrating issues ingestion without modifying existing chunk_and_embed behavior, I’ve kept the current approach unchanged.

I’m happy to open a follow-up PR to update chunk_and_embed to prefer html_url when available.

chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
issues_chunk_task.after(issues_task)
# Store in Milvus
store_task = store_milvus(
embedded_data=chunk_task.outputs["embedded_data"],
Expand Down
Loading