CoPeP Public Release

This is the public release for continual pretraining for protein language model (CoPeP) training, evaluation, and final paper plot generation. It was built on top of code from the AMPLIFY codebase.

What is included

continual_protein/: core training, continual methods, evaluation, plotting, scheduler, and utilities
amplify/: AMPLIFY modules required by this release
conf/pretrain/: Hydra experiment configs (training and evaluation)
conf/accelerate/: Accelerate/DeepSpeed runtime configs
scripts/python/pretrain.py: training entrypoint

Environment setup

Python: 3.12.x
Environment manager: uv
GPU runs with DeepSpeed require a system CUDA toolkit installation (including nvcc) and a valid CUDA_HOME path.

uv sync

Run all commands with the managed environment:

uv run <command>

This repository includes both pyproject.toml and uv.lock for reproducible environments.

Runtime paths (environment variables)

If unset, local defaults are used.

CPEP_LOG_DIR: Hydra and runtime logs (including WandB local files)
CPEP_OUTPUT_DIR: training run directory root (config.yaml, checkpoints, model checkpoints)
CPEP_CHECKPOINT_DIR: base directory used by evaluation to resolve checkpoint_path
CPEP_EVAL_DIR: directory where evaluation JSON outputs are written
CPEP_HF_DATASET_REPO: Hugging Face dataset repo used by train/validation/eval configs

Example:

export CPEP_CHECKPOINT_DIR=/path/to/checkpoints
export CPEP_EVAL_DIR=/path/to/eval_results
export CPEP_LOG_DIR=/path/to/logs
export CPEP_HF_DATASET_REPO=chandar-lab/CoPeP

Hugging Face usage

Load tokenizer and published checkpoints

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained(
    "chandar-lab/AMPLIFY_120M",
    trust_remote_code=True,
)

model = AutoModel.from_pretrained(
    "chandar-lab/copep-checkpoints",
    subfolder="replay/task_5",
    trust_remote_code=True,
)

Load released dataset splits

import polars as pl
from datasets import load_dataset
from huggingface_hub import hf_hub_download

repo_id = "chandar-lab/CoPeP"

# 1) Load train split directly
train_ds = load_dataset(repo_id, split="train")

# 2) Download task index parquet (new approach used by training code)
task0_idx_path = hf_hub_download(
    repo_id=repo_id,
    repo_type="dataset",
    filename="splits/task_0.parquet",
)

# 3) Materialize examples by selecting train rows using row_idx
task0_rows = (
    pl.read_parquet(task0_idx_path, columns=["row_idx"])["row_idx"]
    .cast(pl.Int64)
    .to_list()
)
task0_examples = train_ds.select(task0_rows)

Training uses this same index mechanism internally:

Base data comes from load_dataset(<hf_repo_id>, split="train").
Task membership is defined by Hub files under splits/task_<k>.parquet.
The parquet row_idx column is used to select(...) rows from the train split.
Optional filtering for replay/continual methods is done by intersecting multiple index files.

Run experiments

1) Training with Hydra

Inspect available options:

uv run python scripts/python/pretrain.py --help

Run continual learning config:

uv run python scripts/python/pretrain.py \
  --config-name config_cl \
  job_name=cl_run \
  exp_name=cl_run

Select a specific continual method with method=<name> (available options are in conf/pretrain/method/, e.g. continual, replay, joint, gradient_ascent, random_labels, shrink_perturb, hare_tortoise):

uv run python scripts/python/pretrain.py \
  --config-name config_cl \
  method=replay \
  job_name=replay_run \
  exp_name=replay_run

Continual task workflow (`task_num`)

Continual runs are task-based: each run trains one task_num segment, saves checkpoints under .../task_<task_num>/, then you launch the next task.

Example for a 3-task workflow with steps_per_task=100000:

# Task 0
uv run python scripts/python/pretrain.py \
  --config-name config_cl \
  method=replay \
  trainer.task_num=0 \
  trainer.current_step=0 \
  trainer.max_task_steps=100000 \
  trainer.max_steps=100000 \
  job_name=replay_run \
  exp_name=replay_run

# Task 1
uv run python scripts/python/pretrain.py \
  --config-name config_cl \
  method=replay \
  trainer.task_num=1 \
  trainer.current_step=100000 \
  trainer.max_task_steps=100000 \
  trainer.max_steps=200000 \
  job_name=replay_run \
  exp_name=replay_run

# Task 2
uv run python scripts/python/pretrain.py \
  --config-name config_cl \
  method=replay \
  trainer.task_num=2 \
  trainer.current_step=200000 \
  trainer.max_task_steps=100000 \
  trainer.max_steps=300000 \
  job_name=replay_run \
  exp_name=replay_run

Keep trainer.resume=True (default in config_cl) so each task resumes from the appropriate checkpoint state.

trainer.max_steps should be the cumulative target for the current task boundary (matching worker logic), while trainer.max_task_steps is the per-task budget (usually constant except possibly the final partial task).

2) Basic Accelerate launch (optional)

Use the portable debug runtime config (MULTI_CPU) for a simple launch example:

uv run accelerate launch \
  --config_file conf/accelerate/debug.yaml \
  scripts/python/pretrain.py \
  debug=True \
  job_name=accel_smoke \
  exp_name=accel_smoke \
  wandb.mode=disabled

For GPU/DeepSpeed runs, use one of the provided runtime configs in conf/accelerate/ (for example c2.yaml or node1x4.yaml) based on your hardware.

3) Evaluation

Inspect options:

uv run python -m continual_protein.evaluate.run_evaluation --help

Run evaluation from a checkpoint directory relative to CPEP_CHECKPOINT_DIR:

uv run python -m continual_protein.evaluate.run_evaluation \
  checkpoint_path=replay/task_5/mp_rank_00_model_states.pt \
  benchmarks='["peer","dgeb","uniprot_pppl","proteingym"]'

ProteinGym setup

ProteinGym evaluation expects both:

DMS_substitutions.csv (the substitution manifest)
A folder containing the referenced assay CSV files (DMS_ProteinGym_substitutions/)

Expected layout:

<PROTEINGYM_ROOT>/
  DMS_ProteinGym_substitutions/
    <assay_1>.csv
    <assay_2>.csv
    ...
  reference_files/
    DMS_substitutions.csv

You can pass these to the evaluation script in either of the following ways:

# Option A: set one base path and use config defaults under it
uv run python -m continual_protein.evaluate.run_evaluation \
  checkpoint_path=replay/task_5/mp_rank_00_model_states.pt \
  benchmarks='["proteingym"]' \
  data_home=<PROTEINGYM_ROOT_PARENT>

# Option B: override both paths explicitly
uv run python -m continual_protein.evaluate.run_evaluation \
  checkpoint_path=replay/task_5/mp_rank_00_model_states.pt \
  benchmarks='["proteingym"]' \
  proteingym.dms_folder=<PROTEINGYM_ROOT>/DMS_ProteinGym_substitutions \
  proteingym.substitution_path=<PROTEINGYM_ROOT>/reference_files/DMS_substitutions.csv

4) Final plot generation

Inspect options:

uv run python -m plotting.plots --help

Generate all final plots:

uv run python -m plotting.plots \
  --results-dir outputs/evaluate \
  --output-dir final_plots \
  --benchmark all

To reproduce the paper figures directly from the released results archive:

cd plotting/
unzip copep_results.zip
uv run python plots.py \
  --results-dir copep_results/ \
  --output-dir final_plots \
  --dms-path {path to DMS_substitutions.csv from ProteinGym}

Notes

PEER and DGEB evaluation branches rely on optional external benchmark stacks and are loaded lazily.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoPeP Public Release

What is included

Environment setup

Runtime paths (environment variables)

Hugging Face usage

Load tokenizer and published checkpoints

Load released dataset splits

Run experiments

1) Training with Hydra

Continual task workflow (`task_num`)

2) Basic Accelerate launch (optional)

3) Evaluation

ProteinGym setup

4) Final plot generation

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
amplify		amplify
conf		conf
continual_protein		continual_protein
plotting		plotting
scripts/python		scripts/python
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

CoPeP Public Release

What is included

Environment setup

Runtime paths (environment variables)

Hugging Face usage

Load tokenizer and published checkpoints

Load released dataset splits

Run experiments

1) Training with Hydra

Continual task workflow (task_num)

2) Basic Accelerate launch (optional)

3) Evaluation

ProteinGym setup

4) Final plot generation

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Continual task workflow (`task_num`)

Packages