LLM Zoo at Home: LLaMA & Alpaca

I guess it was not only my social media stream that was flooded with GPT4 news in the last few days. The results are impressive but not surprising to me. A combination of events, which are more important from my point of view, didn’t get that much attention: The LLM LLaMA runs on a Raspberry Pi, and Stanford students fine-tuned the model for $100.


  • 2023-02-24: Facebook released multiple LLMs called LLaMA - the smallest one with 7B parameters.
  • 2023-03-10: The llama.cpp open-source project on GitHub allows running the model with 4-bit quantized weights on the CPU of standard PC or Apple hardware.
  • 2023-03-12: llama.cpp runs on a Raspberry Pi 4.
  • 2023-03-13: Stanford University students released training data to fine-tune the LLaMA model for instruction-following called Alpaca. They archived results comparable to OpenAI text-davinci-003 model. The costs to fine-tune the model are around $100.
  • 2023-03-17: The Alpaca dataset was used to fine-tune the BLOOM model, called AlpacOOM.


Now we are able to run LLMs on our machines, deploy them on our clusters, and maybe soon also use them on our mobile devices. All of that can be done our own hardware. Applications that forbid sending data to an external LLM service are now possible. Some more details:

The LLaMA model and the llama.cpp project showed that running an LLM on one’s own hardware is technically possible. Building llama.cpp is simple and doesn’t require additional dependencies besides the C++ compiler and the standard library. Only the CPU is used, so there are no special hardware requirements - you only need 4GB RAM. One should not expect wounders - the performance on a Raspberry Pi 4 could be better. 4GB of RAM is also what any PCIe graphics card of the last four years offers.

The important aspect of Alpaca is not just that it’s comparable to OpenAI text-davinci-003. It showed that it’s possible to fine-tune an LLM for an affordable price.

The license of LLaMA and, therefore, also for the derived Alpaca limits the usage for research and doesn’t allow commercial use. BLOOM, on the other hand, doesn’t have that limitation.

Context vs. Training

Being able to fine-tune a model allows us to look at a very common problem from a different perspective: Injecting our own data, documents, and knowledge into an LLM.

It’s not possible to change a hosted LLM, so everyone injects their own data as a context in the prompt. But the prompt has a token limit. There are smart ways to work around that problem. Libraries like LlamaIndex index documents and dynamically build the context. There is now a 32K token version of GPT4, but API calls have to be paid per token.

But now we are able to adapt the model to our needs! Training needs to be paid only once. A fine-tuned model reduces the number of tokens required to be processed in the inference. At some point, the training costs are paid off by the lower inference costs.

Here is an example calculation. I didn’t spend much time searching for correct numbers as they would be outdated very fast.

  • Average tokens per call: 200
  • Average tokens per context: 1000
  • Costs per token: $0.002/1K
  • Training costs: $100


  • context
  • training

As you can see, at 50 000 calls, the costs are the same, and the additional training step pays off afterward.

Transferring data from a knowledge graph could be even more efficient. The mass editing memory in a Transform paper shows how triples can be edited in the weights of a transformer-based LLM.


For comments, please follow the GitHub link.