Efficiency meets performance: comparing open-source LLMs - DBRX, Jamba, Qwen

Apr 1

Written By Ritesh Vajariya

Last week of the March 2024 was an LLM frenzy week.

Databricks/Mosaic launched their DBRX as new SOTA model:

AI21 launched Jamba combining best of the both world - Transformer + Mamba:

Alibaba Qwen release Qwen1.5-MoE-A2.7B:

During the same time, Mistral mentioned that they are releasing their 7B v0.2 base model - mainly for one of the hackathon they ran in San Francisco: https://twitter.com/MistralAILabs/status/1771670765521281370

New release: Mistral 7B v0.2 Base (Raw pretrained model used to train Mistral-7B-Instruct-v0.2)
🔸 https://t.co/wL0579WY2e
🔸 32k context window
🔸 Rope Theta = 1e6
🔸 No sliding window
🔸 How to fine-tune: https://t.co/wtULmchBf7
— Mistral AI Labs (@MistralAILabs) March 23, 2024

As you can see 4 major models coming in the open-source community by four different companies, AI21, Alibaba, Databricks and Mistral - all not only competing against each other but also competing with other SOTA open-source models like LLaMa 2 as well as proprietary models like Claude 3, GPT3.5 and GPT-4. Some of them even beating those model in benchmark testing or getting closer to the performance of those proprietary models.

In this article, I compare three of these models against each other and see which one fares better. Have a read!

Benchmarks:

Before we get started on comparison of these newly minted models, let's first understand the benchmark which all of these model creators are comparing - but let's understand them in normal human (non-scientific) language. There are many benchmarks but for brevity, I am explaining five of them:

1. MMLU (large language understanding):

MMLU is a comprehensive benchmark that evaluates the performance of advanced AI language models across 57 different tasks, covering a wide range of subjects like humanities, social sciences, and STEM. The benchmark uses multiple-choice questions sourced from real exams, such as AP, SAT, GRE, and professional licensing exams, to test the models' ability to understand and apply knowledge without any specific training or fine-tuning.

Imagine you're taking a challenging exam that covers a wide range of topics, such as math, science, history, and literature. The MMLU benchmark is designed to test a model's ability to understand and solve problems across all these different areas.

Now, if you were to take this exam and get 73.7% of the questions right, you'd be doing pretty well! It would mean that you have a strong, well-rounded knowledge base and problem-solving skills.

Similarly, when DBRX scores 73.7% on the MMLU benchmark, it demonstrates that the model has a robust understanding of language and can apply this knowledge effectively to answer questions and solve problems across a broad range of subjects. This score suggests that DBRX could be a powerful tool for tasks that require general language understanding and reasoning abilities.

2. Programming (Humaneval):

HumanEval is a benchmark that assesses the ability of advanced AI language models to write functional Python code based on human-written programming problems.

Imagine you're learning to code and your teacher gives you a set of 100 coding exercises to complete. These exercises cover various aspects of programming, such as writing functions, loops, and conditional statements. The HumanEval benchmark is essentially a collection of such coding problems designed to test a model's ability to generate working code.

Now, if you were to attempt these 100 coding exercises and successfully solve 70 of them, you'd be doing quite well! It would demonstrate that you have a solid grasp of programming concepts and can apply them to solve real-world coding problems.

When model like DBRX scores 70.1% on the HumanEval benchmark, it means that the model can generate code that solves a significant portion of the programming problems in the benchmark. This is impressive because it shows that DBRX can not only understand and analyze code but also create new code that works.

In practical terms, this suggests that DBRX could be a valuable tool for tasks like code generation, code completion, and even automated programming assistance. It could potentially help developers write code more efficiently and accurately by providing intelligent suggestions and automating routine coding tasks.

3. Math (GSM8K):

GSM8K (Grade School Math 8K) is a benchmark that evaluates the mathematical problem-solving abilities of AI language models.

Imagine you have a math workbook filled with 100 word problems that cover a range of topics, like algebra, geometry, and probability. These problems are designed to test your ability to understand the question, identify the relevant information, and then solve the problem step by step. The GSM8K benchmark is essentially a collection of such math word problems, but they're meant to evaluate a model's mathematical problem-solving skills.

Now, if you were to work through this math workbook and successfully solve around 67 out of the 100 problems, you'd be doing pretty well! It would show that you have a strong foundation in math and can apply your knowledge to solve real-world problems.

Similarly, when DBRX scores 66.9% on the GSM8K benchmark, it demonstrates that the model can understand and solve a significant portion of these math word problems. This is impressive because it means that DBRX can not only comprehend the natural language description of the problem but also perform the necessary mathematical reasoning to arrive at the correct solution.

In practical terms, this suggests that DBRX could be a useful tool for tasks that involve mathematical problem-solving, such as helping students with their math homework, assisting educators in creating and solving math problems, or even aiding professionals in fields like finance or engineering where mathematical reasoning is crucial.

4. HellaSwag:

HellaSwag is a benchmark that assesses the ability of AI language models to understand and reason about complex, real-world situations described in natural language.

Imagine you're reading a collection of short stories, but each story is missing its ending. Your task is to choose the most appropriate or likely ending from a set of options provided. The HellaSwag benchmark works similarly, but it's designed to test a model's ability to understand context and make reasonable inferences based on the given information.

For example, consider this simple scenario:

"Raj went to the beach for a swim. He brought his towel, sunscreen, and _____."
Options: a) a book, b) a pizza, c) a snowboard, d) a beach ball

If you were to choose the most likely ending for this scenario, you'd probably pick option (a) or (d) because they make the most sense given the context of going to the beach for a swim.

Now, if you were to complete 100 such scenarios and choose the correct ending for 89 of them, you'd be doing exceptionally well! It would demonstrate that you have a strong understanding of different situations and can make logical inferences based on the given context.

Similarly, when DBRX scores 89% on the HellaSwag benchmark, it means that the model can comprehend various scenarios and choose the most appropriate continuation or ending for a vast majority of them. This is impressive because it shows that DBRX has a deep understanding of context and can reason about different situations in a way that aligns with human expectations.

In practical terms, this suggests that DBRX could be a valuable tool for tasks that involve understanding and generating coherent narratives, such as story writing, content creation, or even conversational AI systems that need to maintain context across multiple turns of dialogue.

5. ARC-challenge:

ARC (AI2 Reasoning Challenge) is a benchmark that evaluates the reasoning and knowledge application capabilities of AI language models in the context of grade-school science questions.

Think of it this way: imagine you're studying for a really tough science exam that covers a wide range of topics, like biology, chemistry, and physics. The questions on this exam are designed to test your understanding of complex scientific concepts and your ability to apply that knowledge to solve problems. The ARC-challenge benchmark is essentially a collection of such challenging science questions, but they're meant to evaluate a model's scientific reasoning capabilities.

Now, if you were to take this difficult science exam and correctly answer around 69 out of 100 questions, you'd be doing really well! It would demonstrate that you have a strong grasp of scientific principles and can apply that knowledge to answer complex questions.

Similarly, when DBRX scores 68.9% on the ARC-challenge benchmark, it means that the model can understand and correctly answer a significant portion of these challenging scientific questions. This is impressive because it shows that DBRX has a deep understanding of various scientific concepts and can reason about them in a way that leads to correct answers.

In practical terms, this suggests that DBRX could be a powerful tool for tasks that involve scientific reasoning and problem-solving. For example, it could potentially assist students in learning and understanding complex scientific topics, help researchers analyze and interpret scientific data, or even aid in the discovery of new scientific insights by connecting ideas across different domains.

Models:

In my review, I am only comparing DBRX, Jamba and Qwen as the benchmark information is only available publicly for these three models. I will update this article to include new Mistral model once more data is available.

DBRX:

DBRX is a powerhouse model that optimizes the standard Transformer architecture with the mixture-of-experts (MoE) approach, resulting in impressive efficiency gains and strong performance across a wide range of benchmarks. The model shines in tasks involving language understanding, reasoning, and code generation, consistently achieving top scores on benchmarks like MMLU, HellaSwag, and HumanEval. DBRX's ability to match the performance of larger dense models while using 4 times less pretraining compute and delivering 2-3 times faster inference makes it a highly attractive option for various applications. Its versatility and efficiency make DBRX a go-to choice for many natural language processing tasks.

Jamba:

Jamba is an innovative model that pushes the boundaries of efficiency and long-context processing by combining Transformer, Mamba, and mixture-of-experts (MoE) layers in a novel hybrid architecture. Its standout feature is the ability to handle extremely long sequences, offering a massive 256K token context window while still fitting 140K tokens on a single GPU. This makes Jamba particularly well-suited for tasks that require processing and understanding long-form content, such as document summarization or extended conversations. The model's ability to deliver 3 times faster inference on long sequences compared to similarly-sized models further highlights its efficiency advantages. Jamba's unique architecture and long-context capabilities make it an exciting option for tackling challenging tasks that require processing and understanding large amounts of information.

Qwen1.5-MoE-A2.7B:

Qwen1.5-MoE-A2.7B is an impressive model that demonstrates the power of the mixture-of-experts (MoE) architecture. Its standout feature is the ability to match the performance of much larger 7B models while using only a third of the active parameters. This makes Qwen1.5-MoE-A2.7B incredibly efficient, reducing training costs by 75% and speeding up inference by 1.74 times compared to its larger counterpart. Despite its smaller size, the model maintains competitive performance across various benchmarks and excels in multilingual tasks. Qwen1.5-MoE-A2.7B showcases the potential of MoE architectures to deliver high performance with improved efficiency.

Comparison:

Now with basic understanding of benchmarks and quick information about each models out of the way, let's compare them to see how they perform against each other (each provider has their own way to compare against other open-source models as well as proprietary models - you can review it on their appropriate web page)

Observations:

Qwen1.5-MoE-A2.7B demonstrates competitive performance on MMLU and GSM8K benchmarks, despite its smaller size. However, it lags behind DBRX Instruct and Jamba on these tasks.
On the HumanEval benchmark for code generation, Qwen1.5-MoE-A2.7B achieves a score of 34.2, which is lower than DBRX Instruct's impressive 70.1.
Qwen1.5-MoE-A2.7B provides a multilingual benchmark score of 40.8, a capability not highlighted by Databricks (DBRX) or AI21 (Jamba) but I based on my past experience with AI21, I am going to assume that Jamba is multi-lingual with at least five other languages - though I don't have the benchmark to compare.
DBRX Instruct maintains its lead on most benchmarks, showcasing strong performance across various tasks such as language understanding (MMLU), reasoning (HellaSwag, WinoGrande), and code generation (HumanEval).
Jamba continues to demonstrate competitive performance, particularly on reasoning tasks like HellaSwag, WinoGrande, and PIQA.

All three models leverage the MoE architecture in different ways to improve efficiency and performance. DBRX excels in overall benchmark performance and efficiency gains, Jamba shines in handling long context tasks with its unique hybrid architecture, and Qwen1.5-MoE-A2.7B stands out for its ability to match larger model performance with fewer active parameters.

Use cases

But what about real life scenarios? Which model should I use in real life based on these benchmarks? While in real-life you may have 100s of different tasks but let me cover three tasks performed by most companies: 1) chatbot, 2) questions and answers, 3) document summary:

chatbot:

Jamba could be a great choice for chat applications, especially if the conversations involve long context or require maintaining context over extended exchanges. Its hybrid architecture and ability to handle long sequences efficiently make it well-suited for this task.
DBRX's strong performance across various benchmarks, including reasoning and language understanding, could also make it a solid contender for chat applications.

questions and answers:

DBRX's leading performance on benchmarks like MMLU (language understanding) and WinoGrande (reasoning) suggests it could be a top choice for question-answering tasks. Its efficiency gains and fast inference times could also be beneficial for handling high volumes of questions.
Qwen1.5-MoE-A2.7B's competitive performance on MMLU and its multilingual capabilities could make it a good fit for question-answering tasks, particularly if the questions are in multiple languages.

document summary:

Jamba's ability to handle long context tasks with its massive context window and efficient processing of long sequences could be a significant advantage for document summarization. It can potentially process and summarize longer documents more effectively.
DBRX's overall strong performance and efficiency gains could also make it a viable option for document summarization tasks, especially if the documents are relatively shorter.

It's important to note that my suggestions are based on general strengths observed from their benchmarks. In practice, the best model for each task would depend on factors such as the specific dataset, the desired output format, and the performance metrics prioritized (e.g., accuracy, efficiency, inference speed).

Ideally, you would want to test each model on a representative sample of the actual task data and compare their performance to make a more informed decision.

Conclusion

In conclusion, the emergence of these three models - Qwen1.5-MoE-A2.7B, DBRX, and Jamba - represents a significant milestone in the development of large language models. Each model brings unique strengths and innovations to the table, pushing the boundaries of what's possible in terms of performance, efficiency, and specialized capabilities.

Qwen1.5-MoE-A2.7B showcases the power of the mixture-of-experts architecture to deliver high performance with a fraction of the parameters, DBRX demonstrates impressive versatility and efficiency gains across a wide range of benchmarks, and Jamba introduces a novel hybrid architecture that excels in processing long-form content.

As we continue to explore and refine these models, it's clear that they have the potential to revolutionize various applications. The efficiency gains and specialized capabilities offered by these models make them increasingly accessible and practical for real-world deployment. As the field of AI continues to evolve at a rapid pace, we can expect to see even more impressive advancements in the coming months. The open-source nature of these models encourages collaboration and innovation within the AI community, paving the way for further breakthroughs.

What do you all think?

Ritesh Vajariya