Thanks to Sasha Rush, Llama 2 now has a new one-file Rust implementation. It is a Rust port of llama2.c by Karpathy. The following features are already supported by it:
a 4-bit GPT-Q quantization support
Support for SIMD for quick CPU inference
Assistance with Grouped Query Pay close attention (for large llamas).
Memory mapping quickly loads 70B.
Checks for static size, no pointers
Even while this project is obviously only in the early stages of development, it is already incredibly amazing. With both quantized using GPTQ, it obtains 7.9 tokens/sec for Llama 2 7B and 0.9 tokens/sec for Llama 2 70B.
On X (Twitter…), Sasha asserted that he could run the 70B version of Llama 2 using nothing more than his laptop’s CPU. Of course, it moves quite slowly (5 tokens each minute). You may get a significantly faster pace of 1 token/sec using an Intel i9.
Reading the code is advised if you are familiar with Rust. It offers numerous suggestions for how to effectively handle the quantization and dequantization of LLMs.