DOMINO learns a minimal sufficient representation of a target domain from limited reference examples via soft prompt tuning and contrastive learning.
DOMINO trains two types of learnable continuous soft tokens:
- Domain-level soft tokens: Shared across all reference samples, capturing generalizable domain patterns.
- Sample-level soft tokens: Unique per sample, encoding sample-specific information.
The training objective consists of two parts:
-
Domain-level learning: Trains the domain-level soft tokens to reconstruct all reference samples, capturing common features/patterns across the domain.
-
Contrastive learning: Each sample has its own private soft tokens. The model learns to use private tokens only for its own sample. This forces domain-level tokens to focus on shared knowledge, not sample-specific details.
The final loss balances both objectives. During synthesis, only domain-level tokens are used, producing diverse in-domain samples rather than memorized copies of the reference data.
DOMINO/
├── domino/
│ ├── contrastive/ # Core method: public+private soft token contrastive learning
│ │ ├── model.py # PublicPrivateContrastiveModel
│ │ ├── train.py # Training script (HuggingFace Trainer)
│ │ ├── dataset.py # Dataset loader
│ │ └── generate.py # Synthetic data generation (transformers/vLLM)
│ │
│ ├── soft_prompt/ # Baseline: shared public soft prompt tuning
│ │ ├── model.py, train.py, dataset.py, generate.py
│ │
│ ├── pipeline/ # Synthetic data quality control pipeline
│ │ ├── code_generation/ # Quality assessment, response generation, filtering
│ │ └── code_execution/ # Quality assessment, response generation, correctness check
│ │
│ ├── evaluation/ # LiveCodeBench evaluation harness
│ │ ├── code_generation/ # Pass@k computation, sandboxed testing
│ │ └── code_execution/ # Execution correctness, metrics
│ │
│ └── utils/ # Embedding extraction, statistics, LLM query
│
├── scripts/ # Example training and pipeline scripts
├── configs/ # DeepSpeed ZeRO configurations
└── examples/ # Visualization and analysis scripts
- Python 3.10+
- PyTorch 2.0+
- transformers
- vLLM
- DeepSpeed
- tree-sitter + tree-sitter-python
- scikit-learn, matplotlib, numpy, scipy
- pyext (for code generation evaluation)
Install dependencies:
pip install torch transformers vllm deepspeed tree-sitter tree-sitter-python scikit-learn matplotlib numpy scipy pyext datasetsPrepare your reference dataset in JSONL format:
{"question_content": "Write a function to ...", ...}cd scripts
bash train_contrastive.shThis trains public and private soft tokens via contrastive learning on the reference dataset.
bash pipeline_codegen.shThis generates synthetic code problems, filters by quality, and produces instruction-response pairs.
Use the generated instruction-response pairs to fine-tune a base LLM using your preferred framework (e.g., LLaMA-Factory).
python -m domino.evaluation.code_generation.evaluate \
--data_path ./data/test.jsonl \
--model_path ./path/to/your/modelApache 2.0
