This is the public release for continual pretraining for protein language model (CoPeP) training, evaluation, and final paper plot generation. It was built on top of code from the AMPLIFY codebase.
continual_protein/: core training, continual methods, evaluation, plotting, scheduler, and utilitiesamplify/: AMPLIFY modules required by this releaseconf/pretrain/: Hydra experiment configs (training and evaluation)conf/accelerate/: Accelerate/DeepSpeed runtime configsscripts/python/pretrain.py: training entrypoint
- Python:
3.12.x - Environment manager:
uv - GPU runs with DeepSpeed require a system CUDA toolkit installation (including
nvcc) and a validCUDA_HOMEpath.
uv syncRun all commands with the managed environment:
uv run <command>This repository includes both pyproject.toml and uv.lock for reproducible environments.
If unset, local defaults are used.
CPEP_LOG_DIR: Hydra and runtime logs (including WandB local files)CPEP_OUTPUT_DIR: training run directory root (config.yaml, checkpoints, model checkpoints)CPEP_CHECKPOINT_DIR: base directory used by evaluation to resolvecheckpoint_pathCPEP_EVAL_DIR: directory where evaluation JSON outputs are writtenCPEP_HF_DATASET_REPO: Hugging Face dataset repo used by train/validation/eval configs
Example:
export CPEP_CHECKPOINT_DIR=/path/to/checkpoints
export CPEP_EVAL_DIR=/path/to/eval_results
export CPEP_LOG_DIR=/path/to/logs
export CPEP_HF_DATASET_REPO=chandar-lab/CoPePfrom transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(
"chandar-lab/AMPLIFY_120M",
trust_remote_code=True,
)
model = AutoModel.from_pretrained(
"chandar-lab/copep-checkpoints",
subfolder="replay/task_5",
trust_remote_code=True,
)import polars as pl
from datasets import load_dataset
from huggingface_hub import hf_hub_download
repo_id = "chandar-lab/CoPeP"
# 1) Load train split directly
train_ds = load_dataset(repo_id, split="train")
# 2) Download task index parquet (new approach used by training code)
task0_idx_path = hf_hub_download(
repo_id=repo_id,
repo_type="dataset",
filename="splits/task_0.parquet",
)
# 3) Materialize examples by selecting train rows using row_idx
task0_rows = (
pl.read_parquet(task0_idx_path, columns=["row_idx"])["row_idx"]
.cast(pl.Int64)
.to_list()
)
task0_examples = train_ds.select(task0_rows)Training uses this same index mechanism internally:
- Base data comes from
load_dataset(<hf_repo_id>, split="train"). - Task membership is defined by Hub files under
splits/task_<k>.parquet. - The parquet
row_idxcolumn is used toselect(...)rows from the train split. - Optional filtering for replay/continual methods is done by intersecting multiple index files.
Inspect available options:
uv run python scripts/python/pretrain.py --helpRun continual learning config:
uv run python scripts/python/pretrain.py \
--config-name config_cl \
job_name=cl_run \
exp_name=cl_runSelect a specific continual method with method=<name> (available options are in conf/pretrain/method/, e.g. continual, replay, joint, gradient_ascent, random_labels, shrink_perturb, hare_tortoise):
uv run python scripts/python/pretrain.py \
--config-name config_cl \
method=replay \
job_name=replay_run \
exp_name=replay_runContinual runs are task-based: each run trains one task_num segment, saves checkpoints under .../task_<task_num>/, then you launch the next task.
Example for a 3-task workflow with steps_per_task=100000:
# Task 0
uv run python scripts/python/pretrain.py \
--config-name config_cl \
method=replay \
trainer.task_num=0 \
trainer.current_step=0 \
trainer.max_task_steps=100000 \
trainer.max_steps=100000 \
job_name=replay_run \
exp_name=replay_run
# Task 1
uv run python scripts/python/pretrain.py \
--config-name config_cl \
method=replay \
trainer.task_num=1 \
trainer.current_step=100000 \
trainer.max_task_steps=100000 \
trainer.max_steps=200000 \
job_name=replay_run \
exp_name=replay_run
# Task 2
uv run python scripts/python/pretrain.py \
--config-name config_cl \
method=replay \
trainer.task_num=2 \
trainer.current_step=200000 \
trainer.max_task_steps=100000 \
trainer.max_steps=300000 \
job_name=replay_run \
exp_name=replay_runKeep trainer.resume=True (default in config_cl) so each task resumes from the appropriate checkpoint state.
trainer.max_steps should be the cumulative target for the current task boundary (matching worker logic), while trainer.max_task_steps is the per-task budget (usually constant except possibly the final partial task).
Use the portable debug runtime config (MULTI_CPU) for a simple launch example:
uv run accelerate launch \
--config_file conf/accelerate/debug.yaml \
scripts/python/pretrain.py \
debug=True \
job_name=accel_smoke \
exp_name=accel_smoke \
wandb.mode=disabledFor GPU/DeepSpeed runs, use one of the provided runtime configs in conf/accelerate/ (for example c2.yaml or node1x4.yaml) based on your hardware.
Inspect options:
uv run python -m continual_protein.evaluate.run_evaluation --helpRun evaluation from a checkpoint directory relative to CPEP_CHECKPOINT_DIR:
uv run python -m continual_protein.evaluate.run_evaluation \
checkpoint_path=replay/task_5/mp_rank_00_model_states.pt \
benchmarks='["peer","dgeb","uniprot_pppl","proteingym"]'ProteinGym evaluation expects both:
DMS_substitutions.csv(the substitution manifest)- A folder containing the referenced assay CSV files (
DMS_ProteinGym_substitutions/)
Expected layout:
<PROTEINGYM_ROOT>/
DMS_ProteinGym_substitutions/
<assay_1>.csv
<assay_2>.csv
...
reference_files/
DMS_substitutions.csv
You can pass these to the evaluation script in either of the following ways:
# Option A: set one base path and use config defaults under it
uv run python -m continual_protein.evaluate.run_evaluation \
checkpoint_path=replay/task_5/mp_rank_00_model_states.pt \
benchmarks='["proteingym"]' \
data_home=<PROTEINGYM_ROOT_PARENT>
# Option B: override both paths explicitly
uv run python -m continual_protein.evaluate.run_evaluation \
checkpoint_path=replay/task_5/mp_rank_00_model_states.pt \
benchmarks='["proteingym"]' \
proteingym.dms_folder=<PROTEINGYM_ROOT>/DMS_ProteinGym_substitutions \
proteingym.substitution_path=<PROTEINGYM_ROOT>/reference_files/DMS_substitutions.csvInspect options:
uv run python -m plotting.plots --helpGenerate all final plots:
uv run python -m plotting.plots \
--results-dir outputs/evaluate \
--output-dir final_plots \
--benchmark allTo reproduce the paper figures directly from the released results archive:
cd plotting/
unzip copep_results.zip
uv run python plots.py \
--results-dir copep_results/ \
--output-dir final_plots \
--dms-path {path to DMS_substitutions.csv from ProteinGym}- PEER and DGEB evaluation branches rely on optional external benchmark stacks and are loaded lazily.