Skip to content

Is it possible to smooth FC2 input? #108

@MaxwellWjj

Description

@MaxwellWjj

Hi Developers,

Recently when I try to apply smoothquant in my side (Qwen3-1.7B), I found that the FC2 (or the down_proj in Qwen-like definition) is not concluded in the smoothed layers. However, I observed that the static per-tensor scaling factors in this layer input, can be extremely large if no smooth is applied.

Layer 0: {'q_proj_input': 0.009227362204724409, 'o_proj_input': 0.021776574803149606, 'gate_input': 0.010765255905511811, 'down_input': 0.15748031496062992}

Layer 1: {'q_proj_input': 0.008427657480314961, 'o_proj_input': 0.011441929133858268, 'gate_input': 0.015071358267716535, 'down_input': 1.236220472440945}

Layer 2: {'q_proj_input': 0.009781003937007874, 'o_proj_input': 0.018331692913385825, 'gate_input': 0.023375984251968504, 'down_input': 133.03937007874015}

...

Layer 26: {'q_proj_input': 0.03297244094488189, 'o_proj_input': 2.031496062992126, 'gate_input': 0.022637795275590553, 'down_input': 11.21259842519685}

Layer 27: {'q_proj_input': 0.03641732283464567, 'o_proj_input': 3.0078740157480315, 'gate_input': 0.035679133858267716, 'down_input': 23.433070866141733}

As you can see, the down_input here refers to the per-tensor scale in down_proj (should be the same to fc2 in OPT). When accumulated through layers, the down_input scale becomes extremely large, which means the outliers here explode! Then of course, the final ppl does not look acceptable (from original ~16 to quantized ~90)

If we can apply smooth to this layer, I believe the result can improve a lot. May I know if you have tried to implement that? Many thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions