Skip to content

extend indexing for apache#892

Open
chinyeungli wants to merge 14 commits into
mainfrom
631_extend_indexing_for_apache
Open

extend indexing for apache#892
chinyeungli wants to merge 14 commits into
mainfrom
631_extend_indexing_for_apache

Conversation

@chinyeungli

@chinyeungli chinyeungli commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Reference: #631

I implemented the PURL logic based on based on package-url/purl-spec#834 (comment)

In short, the pipeline get the file list from https://archive.apache.org/dist/zzz/find-ls2.txt.gz , filters it to collect only the files and paths we care about (archive‑type files), and then loads the package metadata from https://projects.apache.org/json/foundation/projects.json
It then assembles the information, constructs the PURLs, and passes them to "mine_and_publish_apache_packageurls" for mining/indexing

Following are some of the printed base PURLs and their corresponding constructed PURLs from the "_mine_and_publish_packageurls" function.

==========================================================
Base: pkg:sid/apache.org/age/PG17
PURLS: ['pkg:sid/apache.org/age/PG17@1.7.0?file_name=apache-age-1.7.0-src.tar.gz']
==========================================================
Base: pkg:sid/apache.org/age/PG18
PURLS: ['pkg:sid/apache.org/age/PG18@1.7.0?file_name=apache-age-1.7.0-src.tar.gz']
==========================================================
Base: pkg:sid/apache.org/age/age-viewer
PURLS: ['pkg:sid/apache.org/age/age-viewer?download_url=https://archive.apache.org/dist/age/age-viewer/apache-age-viewer-1.0.0-rc2-incubating-src.tar.gz']
==========================================================
Base: pkg:sid/apache.org/airavata
PURLS: ['pkg:sid/apache.org/airavata@0.17?file_name=airavata-0.17-source-release.zip']
==========================================================
Base: pkg:sid/apache.org/airavata/custos
PURLS: ['pkg:sid/apache.org/airavata/custos@1.1?file_name=apache-airavata-custos-1.1-bin.tar.gz', 'pkg:sid/apache.org/airavata/custos@1.1?file_name=apache-airavata-custos-1.1-bin.zip', 'pkg:sid/apache.org/airavata/custos@1.1?file_name=custos-1.1-source-release.zip']
==========================================================
Base: pkg:sid/apache.org/airflow
PURLS: ['pkg:sid/apache.org/airflow@2.11.2?file_name=apache-airflow-2.11.2-source.tar.gz', 'pkg:sid/apache.org/airflow@2.11.2?file_name=apache_airflow-2.11.2-py3-none-any.whl', 'pkg:sid/apache.org/airflow@2.11.2?file_name=apache_airflow-2.11.2.tar.gz', 'pkg:sid/apache.org/airflow@3.2.2?file_name=apache_airflow-3.2.2-py3-none-any.whl', 'pkg:sid/apache.org/airflow@3.2.2?file_name=apache_airflow-3.2.2-source.tar.gz', 'pkg:sid/apache.org/airflow@3.2.2?file_name=apache_airflow-3.2.2.tar.gz', 'pkg:sid/apache.org/airflow@3.2.2?file_name=apache_airflow_core-3.2.2-py3-none-any.whl', 'pkg:sid/apache.org/airflow@3.2.2?file_name=apache_airflow_core-3.2.2.tar.gz']
==========================================================
Base: pkg:sid/apache.org/airflow/airflow-ctl
PURLS: ['pkg:sid/apache.org/airflow/airflow-ctl@0.1.5?file_name=apache_airflow_ctl-0.1.5-py3-none-any.whl', 'pkg:sid/apache.org/airflow/airflow-ctl@0.1.5?file_name=apache_airflow_ctl-0.1.5-source.tar.gz', 'pkg:sid/apache.org/airflow/airflow-ctl@0.1.5?file_name=apache_airflow_ctl-0.1.5.tar.gz']
==========================================================
Base: pkg:sid/apache.org/airflow/apache-airflow-mypy
PURLS: ['pkg:sid/apache.org/airflow/apache-airflow-mypy@0.1.0?file_name=apache_airflow_mypy-0.1.0-py3-none-any.whl', 'pkg:sid/apache.org/airflow/apache-airflow-mypy@0.1.0?file_name=apache_airflow_mypy-0.1.0.tar.gz']

The base is the versionless PURL of the package.
The purls list contains all PURLs collected/constructed for that specific base package.

Note that when the qualifier is file_name, it follows the common/standard URL construction:

https://archive.apache.org/dist/{namespace}/{name}/{version}/{file_name}

If the actual download url does not follow the above common syntax, for example, the version does not start with a digit, or the URL contains special segments such as "sources" or "binaries" that make it impossible to reconstruct the download URL from the PURL alone, then the PURL will use a download_url qualifier instead.

JonoYang and others added 12 commits June 26, 2026 17:56
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
…e as similar as the debian.py #637

Signed-off-by: Chin Yeung Li <tli@nexb.com>
 - Constructing purls based on package-url/purl-spec#834 (comment)

Signed-off-by: Chin Yeung Li <tli@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
Signed-off-by: Chin Yeung Li <tli@nexb.com>
…npm.py #631

Signed-off-by: Chin Yeung Li <tli@nexb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants