feat: add onerec 3b performance optimization and support old model.#1286
Conversation
dcbb93e to
b11c793
Compare
There was a problem hiding this comment.
Code Review
This pull request enhances the NPU OneRec block layer by introducing ACLNN-based attention linear support, a decoder prefill-only execution mode, and improved handling for fused FFN and MoE expert weights. It also adds a compatibility path for specific checkpoint prefixes and updates the MoE attention mask logic. Review feedback highlights a critical memory safety issue where a local tensor is referenced via a dangling pointer in the variant pack. Furthermore, multiple style guide violations were noted, specifically regarding the use of auto for simple types, missing vector reserve() calls, and the need for parameter annotations on constant arguments.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request updates the OneRec NPU implementation to support ACLNN-based attention linear operations, introduces a prefill-only execution path for decoders, and enhances weight loading and merging for MoE layers, including support for fused expert weights and dynamic quantization. Feedback highlights a critical thread-safety concern regarding the storage of request-specific data as member variables in the layer implementation, which could lead to data corruption in concurrent environments. Additionally, several style guide violations were noted, specifically the improper use of 'auto' for simple types and the use of plain 'int' instead of fixed-width integers.
Summary
96d3deb2Notes