Hello, and thank you very much for open-sourcing your work.
I was able to reproduce a PPL close to the paper's results using your Llama-3-8B script. However, when I directly switched the model to Qwen-2-7B, the PPL became abnormally high (PPL=243978.58). From my experiments, it seems related to k_bits; setting k_bits to 16-bit gives a normal PPL (PPL=8.07).
I wonder if there are any recommended parameter adjustments or code changes specifically for the Qwen series models. My test script is as follows:
python main.py \
--model_path ~/models/Qwen2-7B/ \
--model_name qwen-2-7b \
--output_dir ./log/tmp \
--wbits 4 \
--input_bits 4 \
--input_mode static \
--v_bits 4 \
--k_bits 16 \
--kv_group_size 128 \
--kv_mode static \
--mse_init \
--pre_rotate \
--down_online_had \
--qk_online_had \
--set_prefixed_tokens \
--eval_ppl \
--eval_tasks piqa,arc_easy,arc_challenge,hellaswag,winogrande \
--save_quant_dir ./pre_quantized_models/tmp
Thank you very much for your time and any guidance you can provide.
Hello, and thank you very much for open-sourcing your work.
I was able to reproduce a PPL close to the paper's results using your Llama-3-8B script. However, when I directly switched the model to Qwen-2-7B, the PPL became abnormally high (PPL=243978.58). From my experiments, it seems related to
k_bits; settingk_bitsto 16-bit gives a normal PPL (PPL=8.07).I wonder if there are any recommended parameter adjustments or code changes specifically for the Qwen series models. My test script is as follows:
Thank you very much for your time and any guidance you can provide.