Skip to content

chandar-lab/CoPeP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CoPeP Public Release

HF Checkpoints HF Dataset

This is the public release for continual pretraining for protein language model (CoPeP) training, evaluation, and final paper plot generation. It was built on top of code from the AMPLIFY codebase.

What is included

  • continual_protein/: core training, continual methods, evaluation, plotting, scheduler, and utilities
  • amplify/: AMPLIFY modules required by this release
  • conf/pretrain/: Hydra experiment configs (training and evaluation)
  • conf/accelerate/: Accelerate/DeepSpeed runtime configs
  • scripts/python/pretrain.py: training entrypoint

Environment setup

  • Python: 3.12.x
  • Environment manager: uv
  • GPU runs with DeepSpeed require a system CUDA toolkit installation (including nvcc) and a valid CUDA_HOME path.
uv sync

Run all commands with the managed environment:

uv run <command>

This repository includes both pyproject.toml and uv.lock for reproducible environments.

Runtime paths (environment variables)

If unset, local defaults are used.

  • CPEP_LOG_DIR: Hydra and runtime logs (including WandB local files)
  • CPEP_OUTPUT_DIR: training run directory root (config.yaml, checkpoints, model checkpoints)
  • CPEP_CHECKPOINT_DIR: base directory used by evaluation to resolve checkpoint_path
  • CPEP_EVAL_DIR: directory where evaluation JSON outputs are written
  • CPEP_HF_DATASET_REPO: Hugging Face dataset repo used by train/validation/eval configs

Example:

export CPEP_CHECKPOINT_DIR=/path/to/checkpoints
export CPEP_EVAL_DIR=/path/to/eval_results
export CPEP_LOG_DIR=/path/to/logs
export CPEP_HF_DATASET_REPO=chandar-lab/CoPeP

Hugging Face usage

Load tokenizer and published checkpoints

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained(
    "chandar-lab/AMPLIFY_120M",
    trust_remote_code=True,
)

model = AutoModel.from_pretrained(
    "chandar-lab/copep-checkpoints",
    subfolder="replay/task_5",
    trust_remote_code=True,
)

Load released dataset splits

import polars as pl
from datasets import load_dataset
from huggingface_hub import hf_hub_download

repo_id = "chandar-lab/CoPeP"

# 1) Load train split directly
train_ds = load_dataset(repo_id, split="train")

# 2) Download task index parquet (new approach used by training code)
task0_idx_path = hf_hub_download(
    repo_id=repo_id,
    repo_type="dataset",
    filename="splits/task_0.parquet",
)

# 3) Materialize examples by selecting train rows using row_idx
task0_rows = (
    pl.read_parquet(task0_idx_path, columns=["row_idx"])["row_idx"]
    .cast(pl.Int64)
    .to_list()
)
task0_examples = train_ds.select(task0_rows)

Training uses this same index mechanism internally:

  • Base data comes from load_dataset(<hf_repo_id>, split="train").
  • Task membership is defined by Hub files under splits/task_<k>.parquet.
  • The parquet row_idx column is used to select(...) rows from the train split.
  • Optional filtering for replay/continual methods is done by intersecting multiple index files.

Run experiments

1) Training with Hydra

Inspect available options:

uv run python scripts/python/pretrain.py --help

Run continual learning config:

uv run python scripts/python/pretrain.py \
  --config-name config_cl \
  job_name=cl_run \
  exp_name=cl_run

Select a specific continual method with method=<name> (available options are in conf/pretrain/method/, e.g. continual, replay, joint, gradient_ascent, random_labels, shrink_perturb, hare_tortoise):

uv run python scripts/python/pretrain.py \
  --config-name config_cl \
  method=replay \
  job_name=replay_run \
  exp_name=replay_run

Continual task workflow (task_num)

Continual runs are task-based: each run trains one task_num segment, saves checkpoints under .../task_<task_num>/, then you launch the next task.

Example for a 3-task workflow with steps_per_task=100000:

# Task 0
uv run python scripts/python/pretrain.py \
  --config-name config_cl \
  method=replay \
  trainer.task_num=0 \
  trainer.current_step=0 \
  trainer.max_task_steps=100000 \
  trainer.max_steps=100000 \
  job_name=replay_run \
  exp_name=replay_run

# Task 1
uv run python scripts/python/pretrain.py \
  --config-name config_cl \
  method=replay \
  trainer.task_num=1 \
  trainer.current_step=100000 \
  trainer.max_task_steps=100000 \
  trainer.max_steps=200000 \
  job_name=replay_run \
  exp_name=replay_run

# Task 2
uv run python scripts/python/pretrain.py \
  --config-name config_cl \
  method=replay \
  trainer.task_num=2 \
  trainer.current_step=200000 \
  trainer.max_task_steps=100000 \
  trainer.max_steps=300000 \
  job_name=replay_run \
  exp_name=replay_run

Keep trainer.resume=True (default in config_cl) so each task resumes from the appropriate checkpoint state.

trainer.max_steps should be the cumulative target for the current task boundary (matching worker logic), while trainer.max_task_steps is the per-task budget (usually constant except possibly the final partial task).

2) Basic Accelerate launch (optional)

Use the portable debug runtime config (MULTI_CPU) for a simple launch example:

uv run accelerate launch \
  --config_file conf/accelerate/debug.yaml \
  scripts/python/pretrain.py \
  debug=True \
  job_name=accel_smoke \
  exp_name=accel_smoke \
  wandb.mode=disabled

For GPU/DeepSpeed runs, use one of the provided runtime configs in conf/accelerate/ (for example c2.yaml or node1x4.yaml) based on your hardware.

3) Evaluation

Inspect options:

uv run python -m continual_protein.evaluate.run_evaluation --help

Run evaluation from a checkpoint directory relative to CPEP_CHECKPOINT_DIR:

uv run python -m continual_protein.evaluate.run_evaluation \
  checkpoint_path=replay/task_5/mp_rank_00_model_states.pt \
  benchmarks='["peer","dgeb","uniprot_pppl","proteingym"]'

ProteinGym setup

ProteinGym evaluation expects both:

  1. DMS_substitutions.csv (the substitution manifest)
  2. A folder containing the referenced assay CSV files (DMS_ProteinGym_substitutions/)

Expected layout:

<PROTEINGYM_ROOT>/
  DMS_ProteinGym_substitutions/
    <assay_1>.csv
    <assay_2>.csv
    ...
  reference_files/
    DMS_substitutions.csv

You can pass these to the evaluation script in either of the following ways:

# Option A: set one base path and use config defaults under it
uv run python -m continual_protein.evaluate.run_evaluation \
  checkpoint_path=replay/task_5/mp_rank_00_model_states.pt \
  benchmarks='["proteingym"]' \
  data_home=<PROTEINGYM_ROOT_PARENT>

# Option B: override both paths explicitly
uv run python -m continual_protein.evaluate.run_evaluation \
  checkpoint_path=replay/task_5/mp_rank_00_model_states.pt \
  benchmarks='["proteingym"]' \
  proteingym.dms_folder=<PROTEINGYM_ROOT>/DMS_ProteinGym_substitutions \
  proteingym.substitution_path=<PROTEINGYM_ROOT>/reference_files/DMS_substitutions.csv

4) Final plot generation

Inspect options:

uv run python -m plotting.plots --help

Generate all final plots:

uv run python -m plotting.plots \
  --results-dir outputs/evaluate \
  --output-dir final_plots \
  --benchmark all

To reproduce the paper figures directly from the released results archive:

cd plotting/
unzip copep_results.zip
uv run python plots.py \
  --results-dir copep_results/ \
  --output-dir final_plots \
  --dms-path {path to DMS_substitutions.csv from ProteinGym}

Notes

  • PEER and DGEB evaluation branches rely on optional external benchmark stacks and are loaded lazily.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages