Understanding Size Tradeoffs with Generative Models - How to Select the Right Model (GPT4 vs LLAMA2)?
Issue #13 | With multiple LLM providers (GPT4, PaLM) and 1000s of OSS Models, how do you select the right model?
Research and practice have shown that scaling models in terms of parameter count, can significantly improve task performance. For instance, GPT-2 had 1.5 billion parameters (Radford et al. 2019), while its successor, GPT-3, had 175 billion parameters (Brown et al. 2020). Also, training these larger models on more data and compute yields even additional performance improvement (Chincilla Scaling Laws). However, the increase in model size also leads to higher compute and storage needs, making it infeasible for many businesses to deploy these models on their own infrastructure.
Naturally, these challenges have prompted research into the development of “small” generative models with early results indicating that strategies in data curation, model fine-tuning, and memory/compute optimization (e.g. LLAMA, GPTQ , GGML etc) can produce competitive models, with only a fraction of the parameters.
Despite these advances, there remain differences between the performance of large LLMs and small LLMs (Figure above), business constraints that may preclude the use of large models etc, making it unclear when it is appropriate to use a small LLM or a large cloud hosted model api.
This post aims to provide some guidance and covers the following:
Categorizes generative models into small and large size classes, based on parameter count and ability to finetune/deploy each model on a single GPU.
Provide a discussion of small generative models (LLAMA et al), current approaches to building them (data curation for performant base models, finetuning optimizations, inference optimizations) and advantages they offer (privacy and compliance, domain adaptation, latency/cost, scaling and uptime, data efficiency, interpretability).
A simple framework for choosing which model to use based on the functional and non-functional requirements of your business problem.
Note: The article assumes familiarity with LLMs and uses the LLAMA model series as the canonical example of small models (i.e., does not explore its derivatives such as Vicuna, FreeWilly etc.).
‘Small’ Self-Hosted Generative Models
We categorize ‘small generative models’ as models that can be fine-tuned and deployed on a single consumer GPU or CPU, making them more practical for solving business problems at scale. This categorization can be better understood by examining the memory requirement for neural network models. Generally, each parameter in a model, if stored as a 16-bit floating point number, requires 2 bytes of memory (16-bit is usually sufficient resolution for most use cases). For instance, a model with 1 billion parameters would need 2GB of memory. Considering that most consumer GPU devices have between 8GB to 80GB (e.g., Nvidia A100 GPUTs) of memory, our definition of small generative models includes models with parameter sizes from ~4 billion to 40 billion parameters. Advances in model quantization (GPTQ Frantar et al. (2023); 4-bit precision Dettmers and Zettlemoyer (2023)) and parameter-efficient fine-tuning LORA Hu et al. (2021); QLORA Dettmers, Pagnoni, et al. (2023);, make slightly larger models, e.g., LLAMA-2 70B (Touvron, Martin, et al. 2023), feasible in the single GPU regime.
Approaches to building Small LLMs
A survey of recent papers suggests there are 3 main approaches to building small generative models - data curation, finetuning base models, and memory/compute optimization.
Data Curation (Base Models) focuses on refining the content of the dataset used to train base models as the vehicle to highly capable models with small parameter count. The LLAMA model series, for example, demonstrated that a significantly scaled and curated dataset, combined with an extended training period, can produce competitive results. A 13B LLAMA model outperformed GPT-3, a 175B parameter model. Another instance is the Phi-1 model, which showed that training a 1.5B code generation model on a meticulously curated dataset, composed of textbook-quality data and synthetic exercises, can achieve performance on par with GPT-3, a 175B parameter model.
Finetuning Optimization focuses on techniques to upskill an existing small model via fine-tuning. ORCA, for instance, proposes the curation of “process supervision” examples where they train a smaller model on ‘reasoning traces and explanations’ data generated by a larger model. They find that this approach improves the reasoning capabilities of the smaller model with results at par with much larger models. Similar approaches are used in models such as Vicuna, Alpaca, FreeWilly etc.
Inference Optimization (Memory and Compute) explores approaches that reduce the memory and compute requirements with minimal loss in performance. This includes efforts to quantize foundation models to enable them to run on consumer GPU and CPU devices. One example is GPTQ, which proposes a weight quantization (3-4 bit) method based on approximate second-order information that is both highly-accurate and highly-efficient. This enables inference on a single GPU for very large models, e.g., a 175B model. Other related GPU quantization work include those by Dettmers, Svirschevski, et al (SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression)., and Dettmers and Zettlemoyer (The Case for 4-Bit Precision: K-Bit Inference Scaling Laws). Other approaches such as GGML offer similar quantization and optimizations for CPU devices
Inference optimization tools, such as the NVIDIA Triton server, provides a suite of optimizations (e.g., dynamic batching, multiple model instances, TensorRT for quantization, layer/tensor fusion, kernel tuning, etc.) to decrease latency and increase throughput for a deployed model. DeepSpeed also offers similar serving optimizations such as model parallelism, customized kernels, and quantization methods.Model Merging: An interesting (slightly surprising) new trend in improving LLM performance (in addition to finetuning) is to combine weights of existing pretrained models. Tools like MergeKit [1] make this process more straightforward. Merge method can be as straightforward as spherical interpolation between weights from multiple models (e.g., SLERP).
Advantages of of self-Hosted, small models
Working with self-hosted small models implies access to the low level weights, layers, and mechanisms of the model (e.g, decoding and sampling strategies). This additional level of control can yield several benefits.
Privacy, Security, and Compliance: Some applications may have stringent data privacy, security, and compliance requirements. For instance, there may be constraints on the allowed sources of data used to train or finetune the model, how inference data can be processed or stored , etc. that prevent the use of third-party APIs. Self-hosted solutions can address these issues by providing control over most of the model building steps.
Finetuning, Domain Adaptation and Rapid Research Prototyping: Adapting models to your specific task may take multiple forms including finetuning, decoding schemes, prompting schemes etc. Small models are amenable to multiple fine tuning regimes that can serve to adapt base models to business context. Emerging techniques for parameter efficient fine tuning (PEFT) such as LORA (Hu et al. 2021), QLORA (Dettmers, Pagnoni, et al. 2023) make this process feasible for most teams once the right data has been assembled. Reinforcement Learning From Human Feedback (RLHF) has also proved to be a valuable tool for aligning model output with human intent and values.
‘s series on RLHF and how it works.)
(I definitely recommend
Tractable fine tuning also enables rapid research and innovation both from the community as well as within organizations. In addition, access to model weights can enable low level control like constrained decoding that can adapt model behavior to specific tasks.Model Optimization, Cost and Latency: Small models can be optimized to efficiently run on existing CPU or GPU infrastructure. This may include techniques for inference quantization, GPU kernel optimizations etc. While these optimizations require some deep ML engineering expertise, they can lead to cost and latency benefits in the long run.
Scaling and Uptime: Controlling the deployment stack provides more control over system scaling and uptime. This is particularly relevant for larger organizations with experienced engineering teams and strong product SLA requirements, as it helps avoid API timeouts.
Data Efficiency: Large-scale models like ChatGPT/GPT4 are impressive due to their expertise in various domains. However, depending on your business or product objectives, you might only need a portion of these models’ capabilities, meaning these models are overpowered and hence data inefficient. Moreover, if your business operates in a niche domain, it’s likely that you possess more specific knowledge than a multi-domain expert like ChatGPT. In this scenario, you could derive more value from a system design that heavily utilizes your domain expertise. This could involve fine tuning a smaller model on your dataset or investing on a retrieval augmented generation setup.
Interpretability: Access to the weights and layers of a generative model provides opportunities to implement interpretability techniques that are faithful to the model’s data generation behavior and are therefore more reliable (for example, gradient-based techniques).
Caveats of Small Models
When trained on similarly-sized datasets compared to larger models, small models simply have less capacity to store information. This means they struggle to learn efficient knowledge compression mechanisms or model complex dependencies across data (which imitate reasoning capabilities). As a result, we observe more severe hallucination (compression artifacts), limited reasoning capabilities (Gudibande et al. 2023, The False Promise of Imitating Proprietary LLMs.), and limited instruction-following abilities. In addition, smaller models tend to have smaller context windows, limiting their ability to process lengthy text e.g., while LLAMA has between 2k - 4k tokens, models like GPT-4 and Claude support between 32k - 100k tokens. Addressing these limitations is an active area of research (as seen in ORCA and Phi-1).
Cloud-Hosted ‘Large’ Generative Models
Currently, the most competitive generative models e.g., GPT4, ChatGPT are very large, with hundreds of billions of parameters, and require GPU clusters for fine-tuning and inference on a single model. As a result, these models tend to be only accessible as APIs through providers (e.g., OpenAI, Google, Microsoft, Anthropic, Cohere) who have the resources to train, optimize, and host these models. We categorize these as cloud hosted large models. At the time of writing, the top-performing model is GPT-4 (OpenAI 2023); while exact details of its size are not available, it is believed to be a Mixture of Experts (Y. Zhou et al. 2022) model with 1.8 trillion parameters1.
Choosing the Right Generative AI Model
With multiple Generative AI model providers (e.g., OpenAI, Google offering large models like GPT4, PaLM as cloud services), and thousands of small open-source models, choosing the right model can be challenging.
For most teams, creating a generative model from scratch is neither feasible nor recommended due to the technical expertise, data acquisition costs for fine-tuning, and GPU training costs required, particularly for the pretraining step ( even pre-training a small model like LLAMA costs between 5M - 20M USD). Beyond the choice of training a model, there are several other decisions to be made. Should you use an API or self-host a model? Should you use an open-source model as-is or invest in fine-tuning it? These questions are often a source of confusion for teams looking to adopt generative models.
Functional Requirements: The first step in the ML workflow is to translate the business requirement into a generative modeling task and data modality. For instance, if the business requirement is to generate a document summary, the problem can be framed as a summarization task within text modality.
For tasks with high precision requirements (e.g., strong instruction following capabilities, or factual accuracy), larger models, which tend to perform better, are recommended. For instance, generating code from natural language instructions or extracting insights from patient records requires high precision, hence larger models are recommended. Some tasks may be niche, such as generating code that uses custom programming languages or libraries. In this case, fine-tuning a small model on custom data may yield better performance. Fine-tuning small models also offers efficiency benefits, especially for tasks that do not inherently require the broad world knowledge or facts that larger models encode.Non-Functional Requirements: Non-functional requirements like privacy, security, compliance, uptime, latency, cost, etc., can influence the choice of model. Some requirements, such as security, data compliance, and privacy, may prevent the use of 3rd party hosted APIs (e.g., sensitive client data cannot be sent to systems outside an organization’s boundaries). Other requirements, such as uptime, latency, cost, may be better addressed by carefully optimized self-hosted model infrastructure (e.g., optimized CPU).
Task-Specific Benchmarks: Although there are established benchmarks like MMLU (Hendrycks et al. 2020) which evaluate the performance of LLMs across various tasks, these might not fully capture the proficiency of a model on your distinct task. For instance, a model that performs impressively on MMLU might struggle with tasks involving following instructions (e.g., rewriting text from the perspective of a specific character). Therefore, it’s essential to develop a benchmark that accurately represents your specific task, and evaluate different models based on it. You can do this by compiling a set of task-representative examples and testing the performance of various models against it. Arguably, building a benchmark is the most important step in choosing the right model.
Conclusion - Will Small Model’s Rule the Future?
In this article, we have explored the differences between small generative models and large cloud-hosted models, and how their performance and properties can affect their suitability for different business problems. We have discussed the advantages of using self-hosted small models, which include privacy, security, compliance, fine-tuning, domain adaptation, prototyping, model optimization, cost, latency, scaling, uptime, data efficiency, and interpretability. We also touched upon the caveats of small models, such as the limitations in their knowledge compression mechanisms, hallucination, limited context lengths and instruction-following abilities.
So … will small models rule the future?
In my opinion the short answer is No.
As always, the more nuanced answer is that it depends on your business use case. Small models offer many advantages (and I expect a floodgate of innovation given the recent commercial licenses with LLAMA 2). On the other hand, very large models (e.g. GPT-4) continue to provide the best performance, across all tasks.
By translating business requirements into generative modeling tasks, considering non-functional aspects like privacy and cost, and creating custom benchmarks to evaluate performance on specific tasks, businesses can make informed decisions on which model best fits their needs.
References
MergeKit: Tools for merging pretrained large language models. https://lnkd.in/gC3JZQmP