At this point, I hope most of my audience have had experience with the publicly available large language models - either running the software yourself, investing in a subscription to one of the many online services, or any of the free and beta solutions currently out there. For the most part, these large language models are by definition large - billions of parameters, often trained on lots of unstructured language data. For most of the industry, the number of the parameters is often analogous to the accuracy of these models - the more data you train with, and the more parameters in the design, the wider the scope of information these general models can hold and recall or generate. However that's not always the case, and there's one big problem with this market right now: hallucinations.
This week, startup Lamini has published a paper showcasing new fundamental methodology to decrease hallucinations in LLMs by a conservative 95%. Lamini is headed up by co-founders CEO Sharon Zhou (PhD and Faculty in Gen AI from Andrew Ng's group, MIT award winning Gen AI research, largest Gen AI Coursera courses) and CTO Greg Diamos (NVIDIA/CUDA architect, 14000+ citations, AI scaling laws, ML Perf co-founder), and broke into the mold by being one of the first companies to offer fine-tuning as a service for LLMs. What made them different was the preference for AMD Instinct MI200/MI300 GPUs, even with one of the NVIDIA Tensor Core architects as a co-founder. The company completed Series A in early 2024, with $25m in funding, having lead investors such as Amplify Partners and First Round Capital. Other investors include Lip-Bu Tan, Andrej Karpathy, and Andrew Ng. Lamini already holds Fortune 500 companies as customers, and offers per-GPU licensed based middle-layer software as well as cloud inference services.
The Problem of Hallucinations
Large language models right now fit into the category of 'generative AI' - you feed it a prompt of tokens/words, and you get some tokens/words back. But what you get back is generated based on the input, and due to the probabilistic functions in the design, the output is 'generated' and can appear to give you detail on topics that were initially part of the dataset, but abstracted away into an embedding space inside the model. For example, the concept of 'parent' could be embedded as a vector between son and father, and a similar vector could also be used to describe a country that has changed its name.
However, models hallucinate. It's not necessarily confined to large models, but generative AI has natively been built with hallucinations in mind. This is ultimately where it gives the wrong information, or creates a relationship in that embedding space that shouldn't exist, resulting in erroneous output.
The problem of hallucinations is derived from a number of areas, but I'll pick two here. First is simply facts - large general models are poor about holding facts. They're good at concepts and explaining concepts, but asking a general model about a person's birthday is often a no-go area. The reason is because in the dataset, even if the right answer is the most likely, there will be lots of similar pieces of information which could be chosen as part of the response from the model. A good example here is when I asked a general Llama2-7B model for AMD CEO Lisa Su's birthday - it got the year correct, but the date was actually the date attributed to the discovery of the transistor. Lisa Su is closely linked with chips and transistors, and so in the embedding space it was chosen as a likely candidate to fit the answer. The model hallucinated.
Second comes from how these general models are trained. The dataset may be public information, correct or incorrect (cough, reddit, Wikipedia), or even contradictory information, but these models are designed to give you an answer, right or wrong. Unless the question is caught in the guard rails of 'don't answer questions about this topic', almost all language models are predisposed to give answers, regardless of if they're actually correct. This not only applies to facts, but concepts that weren't directly in the dataset but may be derived from the dataset. With a specific model, LiDAR and RADAR might be similar, or the number 10 million might be the same weight as 3 million - makes a lot of difference if you're using a model for employment contracts.
Part of the issue with the models is that general training data is just that - general. A well formed dataset (which most aren't) will provide output at a similar level across many topics. The loss function (a level of accuracy, where lower numbers are better) across a wide array of tests will typically come out similar regardless of the topic in the test. So hallucinations can occur across many different concepts in the model, regardless of the parameter size of the model. Typically training a large model with a dataset from scratch is a one shot event, simply because the dataset is massive, and the cost to train on that data is immense - we're fast approaching billions of dollars for the largest models today already, and that's not the cost of the GPUs.
There are a number of ways to help deal with hallucinations that are already put into practice.
First choice is to have a domain specific model, where it is only trained on the data it needs. This has some edge case quirks, and won't be able to generalise outside of its field very well, but comes into the same issues of not knowing which facts might be contextually related. The embedding function for multiple dates from the dataset to one topic can easily get confusing.
An initial way to dealing with hallucinations was to engage in co-prompting. This would involve, for example, pairing the user prompt with the related accurate materials in the background. For example, an assistant designed to help with support for a given product could be co-prompted with all the relevant PDFs or databases of information of that product or the companies products with whatever the user asks. The model can be designed to take the co-prompt as a higher standard of accuracy compared to the generalised information, however it still relies on the model selecting the co-prompt as the right answer. Also, it requires the input into the model to accept thousands, or perhaps millions of tokens, which increases the compute requirements for any inference design by several orders of magnitude, making it cost ineffective in the long run, especially if the co-prompt was multi-modal (images, audio, or video instead of text). Results were better than raw models, but still lacked precision.
Next we have fine-tuning. This is similar to taking a from-scratch domain specific model, but instead we start with a general model and fine-tune some of the embedding tables on known, curated data. Fine tuned models gets some of the way there - it's how we get ChatGPT from GPT-3 after all. In fine tuning there is preference for the correct data, and it can generalise to a number of topics as it came from a general model, however there are markets where fine-tuning of data simply isn't accurate enough for the use case. Fine tuning can also be computationally intensive.
One common tactic mentioned in the industry today is RAG, or retrieval augmented generation. This is almost similar to co-prompting, but changes the way in which the model accesses the data. Instead of it being attached to the user prompt, it acts as a validated database that the model can use to assist in generating output from the model. This means that a legal model, for example, can have databases of cases on hand of which specific ones could be retrieved to provide answers and context. RAG has been shown to vary in performance, because it's still relying on data outside of the model embedding. It can be as bad as co-prompting, or as good as the best fine tuning.
We should also tip a nod here to the concept of experts in language models. Mixture of Expert (MoE) models rely on multiple optimized smaller models, each with a more refined and specified dataset, and then a hierarchal decision vector (or tree) to route information to specific experts to get a relevant answer. Most of the online super LLMs use MoE structures to help improve accuracy, and an additional benefit is in performance and cost - Mixtral 7x7B is nominally a 49 billion parameter MoE, but I've seen in a number of locations that an average input only initiates an average of ~30 billion parameters from that set, reducing compute and memory requirements but ultimately giving better and more accurate output than a 49 billion parameter model.
All of these techniques work on the principle that generalised knowledge, when trained with sufficient data or in the correct way, increase accuracy, decrease hallucinations, and offer a minimal loss function (as described above). After general training, the loss function is further reduced by fine-tuning, RAG, MoE. However they all suffer from the fact that even with MoE, the overall goal is reducing the average loss function across the whole array of knowledge with verified data.
A modern LLM not only has to be general, but for a lot of commercial use cases, hold specific knowledge. This is where Lamini comes in, and they claim the ability to almost completely remove hallucinations on given topics. The method is itself interesting, but offers a good question mark that might define the future of how machine learning compute profiles could change - perhaps sizably, in the same way transformers did compared to previous convolutional neural networks.
The Lamini-1 Solution: Memory Tuning
In a paper published on 13th June 2024, Lamini published a research paper on Memory Tuning - an aggressive way to embed specific data into models even as small as 3 billion parameters. The problem (and the way CTO Greg Diamos explained to me) is one of methodology but also going against some historic ML optimization thinking.
Lamini Memory Tuning takes the concept of MoE and turbo charges it in a very specific way. Each expert is routed into an adapter which is tuned to curated data at a rate of 100x compared to fine-tuning. The tuning is easier than fine-tuning because these are adaptors (like LoRA) rather than optimizing a full embedding table of weights. As a result, each adaptor can hold raw random-string style information as part of its own dataset, and as it gets trained on that at a 100x rate, it stays there. Over the entire model, this creates a '1-million-way' mixture of experts so to speak, which Lamini is coining a Mixture of Memory Experts (MoME).
Simply put, it's like putting hard facts into your model.
There’s a debate in ML as to how many hard facts can a large model actually know with near certainty. It’s well beyond the scope of this article, but a fun rabbit hole to go down.
If we go back to that loss function concept from earlier, it looks very much akin to overfitting specific domain knowledge into the model. The loss function for that concept becomes orders of magnitude substantially better, at a practically zero expense to the general knowledge of the model. Now the model can recall exact data - in practice this could be information on a company's product portfolio, or a help desk dealing with support documentation, or even language models dealing with code. The fact that it works on low-billion parameter models effectively will help bring this MoME to edge use cases.
I mentioned earlier that in order to do this, some of the conventional thinking about model training had to go out the window. In the ML world, there's usually a reluctance to 'over-fit' data, with the expectation that it will ruin the general reasoning of the rest of the model. Ultimately the perception is that a model can only hold 'so-much' data (akin to the internet being a series of tubes), and so by over-fitting data you end up degrading performance in other areas. In discussing with Greg, Lamini's methodology has a practically zero effect on the rest of the model. This is important as LLMs have to have general reasoning, but for domain specific MoME, it's less of an issue.
Lamini's explanation of MoME on their website is a really good read, but it also talks about how the computational requirements for this are much lower than regular fine-tuning. This is because the knowledge domain areas that are being optimized to remove hallucinations are narrow by definition - you're not re-tuning the whole embedding table in a full sweep, but a super small portion many dozens of times. What isn't mentioned, and I put to the team, is if they've considered what it does to the computational change in the inference.
In ML, we saw a big shift in computational requirements moving from convolutional neural networks (CNNs) and computer vision over to transformers. Transformers were a big breakthrough, however it changed the compute and memory requirements for these models. Any computer hardware built specifically to optimize for CNNs was often left behind when it came to Transformers because it lacked the additional compute functionality required, or it didn't have the right amount of compute to memory to memory bandwidth to pipeline full utilization. On the inference side, especially given that longer term the revenue from inference is expected to eclipse training costs by many orders of magnitude, this matters.
The question is that compared to a standard model, say a Llama3-8B, does a new Llama3-8B+1MxMoME (that's with 1 million MoME) have a substantially different compute profile to drive a shift in computer architecture? The answer is that the research needs to be done. If there's one thing in the AI space that might cause an upheaval in the silicon players, then it's another Transformer-like evolution of the market, and if any of the hardware manufacturers see it coming and/or can pivot quickly to supporting it at speed and at scale.
Lamini says that their Memory Tuning / MoME capability has already been implemented to a number of customers, including a Fortune 500 company now experiencing 10x fewer hallucinations in text-to-SQL code generation.
For further details, we have these links:
Case study: http://www.lamini.ai/blog/llm-text-to-sql
Paper: https://github.com/lamini-ai/Lamini-Memory-Tuning/blob/main/research-paper.pdf
Model weights: https://huggingface.co/engineering-lamini/lamini-1-random
To finish up, I must point out something I didn’t know until I’d already written most of this article. I’m being told 'MoME’ is pronounced ‘mommy’. Do with that information what you will. I’m British, so unless there’s a ‘mummy’ down the line, it makes no sense to me 😁
LLMs (Large Language Models) may exhibit "hallucinations" when generating text, meaning the content seems plausible but does not align with reality. There are various methods to mitigate LLM hallucinations, and here are some strategies and practices:
1. Data Quality
2. Retrieval-Augmented Generation (RAG)
3. Fine-tuning and Supervised Learning
4. Tool Integration and API Calls
5. Post-processing and Filtering Mechanisms
6. Reinforcement Learning