Is 3 Llama better than 2? Better be!
A new baby has been born in META family, exactly after 9 months! META released their Llama 2 family of models on July 18th, 2023 and now Llama 3 models (at least 2 models — one more is still training!) were released yesterday, April 18th, 2024.
We are well aware of Mark Zuckerberg’s intentions to create open-source AGI (Artificial General Intelligence) when he shared updates on Instagram about 13 weeks ago. These two models in Llama 3 family are their iterative approach to reach open-source AGI in few years.
In this post, I will review what META released, how it compare to their previous babies, Llama 2. Additionally, I have started a tracker where I track the scale and performance of various models - starting with most promising 10 open-source and closed-source models.
Models:
The architecture of model remains more or less same decoder-only transformer architecture (read more on decoder-only transformer architecture). They have created a tokenizer with vocabulary of 128K tokens that gives them improved model performance.
meta-llama/Meta-Llama-3-8B
meta-llama/Meta-Llama-3-8B-Instruct
meta-llama/Meta-Llama-3-70B
meta-llama/Meta-Llama-3-70B-Instruct
Datasets:
They have used more than 15 trillion tokens to train Llama 3 model.
In comparison, Llama 2 used mere 2 trillion tokens for their training.
In layman terms, 2 trillion tokens are approx. 20 million books so with 15 trillion tokens in Llama 3, we are looking at equivalent of whopping 140 million books.
New York public library has approx. 45 million research items (including books) so to train this Llama 3 models, META cobbled together at least 4 New York public libraries worth of information - I wonder from where this data coming… It’s definitely not coming from you and me using Facebook, WhatsApp and Instagram - at least META is saying that publicly that they are not using their user data to train these models.
They do include 5% of their data (750B tokens) high-quality non-English data that covers over 30 languages
Does that mean this is a multi-lingual model? No. It’s English only model. But you can fine-tune this model to make it multi-lingual.
Compute:
META used their world-class 24K GPU clusters to train these models.
They trained it on 16K GPU simultaneously. I think this is the highest GPU cluster size known publicly for model training.
From performance point of view, they were able to achieve 400 TFLOPs per GPU (H100-80G) which is 40% utilization as NVIDIA H100 has 1000 TFLOPs of FP16 performance. This is a very good utilization factor - in fact there are only handful of engineers who are able to extract such performance from GPUs at this scale. Bravo!
Cost:
8B model was trained with 1.3M GPU hours and 70B model was trained with 6.4M GPU hours. At $12.5/GPU (on-demand cost at AWS) 8B model cost would have been USD $16 million and for 70B the cost would have been USD $80 million (not bad compared to Google’s Gemini Ultra cost USD $191 million!!) This is just a compute cost to train the model — not including any experiments, data preparation cost and the biggest of all, employees cost.
let’s look at the employee cost: Based on the public information, there are about 270 people has been credited to build these Llama 3 models, including Mark Zuckerberg himself. With an average salary of $250K/year, we are looking at USD $50 million to deliver this baby (9 months!) - wow! just wow!!
Performance:
clearly Llama 3 has an edge over Llama 2 models - in some cases significant gain of 15 to 20 basis point when we are comparing general benchmarks, such MMLU, etc. (read more about what all these benchmarks are)
I believe two reasons for improved performance: 1) 7x increase in the datasets, so model has learned a lot more than it’s predecessor. 2) smaller model increased from 7B to 7B. That 1B parameters is a big jump - at least 13.5% so that should contribute to the increase.
In comparison to non-META models, in open-source game, Llama 3 8B has an edge compared to it’s 7B cousins from Mistral and Google. Similarly, Llama 3 70B fares well compared to their cousins from Google Gemini Pro 1.0 and Mixtral 8x22B (which was just released two days ago!)
Other things:
context window of base models has doubled from 4K tokens to 8K tokens. This context size seems very small in comparison to closed-source models where 200K is a standard with up to 1M tokens. But IYKYK that Llama 3 being an open model (will talk about license in a sec) you can extend the token length to your desire - kind of after-market hack 🙂
On license, it’s still remain META special with permission to use for research and commercial - as long as you don’t have monthly active users over 700 million. Most enterprise won’t fit in this scenario but with Llama 3, there are certain change of terms like, push the branding of “built with Llama 3” as the Meta brand strategy.
Another thing which they missed to do last time is simultaneous release. If I remember, with Llama 2, they launched on Hugging Face and then Microsoft and in parallel on AWS, or day after. But this time, it's launched everywhere at the same time - great job SageMaker team and Ankur Mehrotra on getting this on the release date itself.
First-hand unboxing:
Along with the launch of Llama 3, they also enabled their meta.ai website with chat powered by Llama 3 70B model. In my test, I am satisfied with the result (read: very good for a free model)
Here are my result with meta.ai (it covered the gist of what Cerebras does with some minor omissions)
Here is my result with same prompt for Claude Opus (missed model service offering which was covered by Llama 3):
LLM Tracker:
With so many models being released every week, it’s not easy for me to track what’s happening in the world and I am sure many of you as well.. so to track each other’s relative performance, I have started an LLM tracker. For now, I am tracking MMLU score of last 10 or so most performant model and as we get more, will keep updating them.
Conclusion
It's amazing to see META can deliver back-to-back higher quality models in less than 12 months. They are still training their biggest 400 billion parameter model. That may break many other records. With all the advancement happening in the open-source LLM, we are not far behind the closed-source and the rivalry has just begin. All these means is future is bright and full of artificial intelligence driven advancement! What do you think?