Effort

With Effort you can adjust smoothly – and in real time – how many calculations you’d like to do during inference of an LLM model.

At 50% calculations it is as fast as regular matrix multiplications on Apple Silicon chips. At 25% effort it’s twice as fast and still retains most of the quality.

You can also freely choose to skip loading the least important weights.

It is implemented for Mistral now, it should work for all the other models just as well. No retraining needed, just conversion to a different format and some precomputation. — Read More

#nlp