Inference at the Edge: Running a Large Language Model Chatbot on Consumer Hardware Updated

Overview

Generative artificial intelligence and conversational chatbots like ChatGPT have made headlines in recent months. These virtual assistants sound nearly human because they are based on extremely large datasets that contain real human conversations or other data sources that contain the desired output for a given input. Examples of proprietary chatbots available as online services include OpenAI’s ChatGPT Plus, Microsoft’s Bing, and Google’s Bard. Hobbyists and open source enthusiasts are experimenting with ways to run size-optimized, open-source versions of large language models that have been tuned for instruction-based interaction with humans and that run locally on consumer-grade hardware.

Foundation Large Language Models

Meta’s LLaMA Model

The genesis of their recent efforts began with Meta’s release of LLaMA, a large foundation language model using a transformer architecture, with four variations: 7, 13, 30 and 65 billion parameters.1 One of Meta’s key observations, recognizing the tradeoffs between training compute budget and inference budget, is that “given a target level of performance, the preferred model is not the fastest to train but the fastest at inference”.2 Therefore, Meta trained the models on more tokens than are typically used to obtain the best possible performance at different inference budgets. For example, Meta states that LLaMA-13B outperforms GPT-3 on most benchmarks even though it is 10x smaller.

Like other large language models, LLaMA was trained on text from 20 languages including a mixture of publicly available data sources such as  English CommonCrawl, C4, Github, Wikipedia, Guttenberg, ArXiv, Stack Exchange and Tokenizer.3 The LLaMA model was then benchmarked for both zero-short and few-shot tasks. It was also assessed across both free-form generation and multiple choice tasks including common sense reasoning, closed book question answering, reading comprehension, mathematical comprehension, code generation, multi-task language understanding, bias, toxicity and misinformation.

Although Meta released LLaMA to researchers and the academic community, its use is limited to noncommercial research only. This has led a number of both companies and individual researchers to pursue alternatives that are open-sourced and capable of being commercialized.

Together’s RedPajama Model

RedPajama is an effort by Together and several universities to reproduce the LLaMA dataset as a fully open-sourced model. This effort includes the pre-training data, the base model, and instruction tuning data and models. RedPajama’s data set uses similar sources to LLaMA, with an equivalent number of total tokens (1.2 trillion). In the future, it may be possible to combine RedPajama’s base model with Dolly’s instructions to have a a fully open-sourced, instruction fine-tuned, large language model.

Instruction Fine-Tuning the Foundation LLM4

Stanford Alpaca Instructions

A team of students at Stanford University initially released a proof of concept, Alpaca 7B, that fine-tuned the LLaMA 7B model on 52K instruction-following demonstrations. These were generated in the style of the Self-Instruct framework, which enables pretrained models to improve their capabilities by bootstrapping off prior output. The five person Stanford team used 175 human-written instruction pairs to create self-instruct seed tasks. Then, they prompted text-davinci-003 to generate 52K instruction following examples. They used these examples to fine-tune the LLaMA dataset.5 This resulted in an online chatbot that performed qualitatively similar to OpenAI’s text-davinci-003. After the Stanford team released their dataset, the open source community released a number of derivatives that scrubbed and extended the original Alpaca instruction set using low-rank adaptation (LORA).

Vicuna Instructions

Another team of researchers from Berkeley, CMU, Stanford & UCSD, inspired by the Meta LLaMA and Stanford Alpaca projects, created a new fine-tuned model called Vicuna. This model used 70K crowd-sourced conversations with OpenAI’s GPT4 that were shared on ShareGPT.com. They also tweaked the Alpaca training scripts for both multi-round conversations and long sequences to create an improved fine-tuned model based on LLaMA. The team claims Vicuna both to be on par with Bard and to have achieved 90% of the quality of ChatGPT, as evaluated by GPT-4.

Koala Instructions

Berkeley Artificial Intelligence Research (BAIR) created an alternative fine-tuned model called Koala. Similar to Vicuna, Koala used 60K ShareGPT dialogues. It also used 60K human answers and 27K ChatGPT answers in the HC3 corpus. Koala included components from the Open Instruction Generalist, Stanford Alpaca, Anthropic HH, OpenAI WebGPT, and OpenAI Summarization datasets.6 These datasets were then conditioned for positive or negative markers (e.g. a ‘helpful’ answer) including 100 human evaluators.

Finally, the BAIR team benchmarked the ChatGPT test set against the test set that included both ShareGPT dialogues and open source data. The researchers were surprised to learn that the ChatGPT test set outperformed the larger test set containing open source data. They concluded that the ChatGPT data is so high quality that adding twice as much open source data didn’t lead to an improvement. They hypothesized that effective instruction models could be sourced from larger, more powerful models, provided that the instructions are representative of the prompts that users provide in real-life.

Dolly 2.0 Instructions

Databricks took a different approach and generated a fully open-sourced, human generated instruction dataset with 15k prompt-response pairs. These were optimized for natural expression across a wide range of tasks including Q&A, brainstorming, classification, creative writing, and information extraction and summation.

Llama.cpp

Running a large language model locally is optimal for privacy, security, cost and experimentation. While smaller, local models’ performance currently fall short of larger, proprietary online systems, their accuracy and capabilities are rapidly advancing. As LLaMA evidences, smaller models soon may be able to match the performance of larger closed-source platforms through carefully selected training data.

A group of open-source software developers developed a software program called llama.cpp, which enables users to run not only the 7B but also larger LLaMA models, as well as the Alpaca, Vicuna and Koala variants, using consumer grade hardware. For example, Alpaca 7B can run on a M1 Macbook Pro with only 8GB of memory. The 30B model can run on a Mac Studio Max with an 64GB of memory. Importantly, Llama.cpp is a real-time chatbot that is free and runs without any Internet connection.

Other Open Source Research Efforts

Separate from Dolly, there is an effort underway sponsored by Open Assistant to create a lengthier human-generated set of instructions. All of these efforts will eventually result in virtual assistants that have more accurate results for a broader array of instructions, including support for more languages, with more permissive licenses.

Meanwhile, researchers are continuing to investigate how to shrink the size of the local data sets while maintaining output quality. Current implementations use 4-bit quantization; but alternatives may enable substantial reductions in size with similar perplexity.7 This will enable either less capable devices to run these large language model chatbots (for a given model parameter) or for a particular machine to run a larger model size.



Updated on April 24th, 2023


  1. Meta trained LLaMA 65B and LLaMA 33B on 1.4 trillion and LLaMA 7B on one trillion tokens (which are pieces of words).

  2. See LLaMA: Open and Efficient Foundation Language Models.

  3. Additional details are in this whitepaper, “LLaMA: Open and Efficient Foundation Language Models

  4. For detailed video reviews of each of these models, I recommend watching Sam Witteveen’s YouTube Channel.

  5. Alpaca fine-tuned LlaMA using weakly supervised or knowledge distillation.

  6. Like Alpaca, both Vicuna and Koala are weakly supervised or knowledge distillation-based.

  7. See also, SparseGPT and GPTQ.