The Cost of AI: Should You Build or Buy Your Foundation Model?

Aug 23

With the advancements of generative AI functionality across all the industries, leaders at many organizations are under pressure to deliver functionalities which incorporate generative AI to their users - internal and external. Like any other technology, generative AI tech stacks are deep, wide and provided by multiple provider. Depends on use cases, leaders have choices to :

1 .Buy: Like a SaaS model, buy it as a subscription/pay-as-you-go.

2. Build: You can build it from scratch.

3. Neither: You can leverage someone’s model and build on top of it.

In this article, I describe some of the mental models to use when considering these choices. Additionally, I go over some of the high level cost estimates to operate in each of the area.

When making decisions about whether to build or buy foundation models, consider the following factors:

time-to-market: how soon you would like to see the functionality gets deployed to production (not proof-of-concept). Is your CEO and board asking you to be ready by next week or next month?
skills: does your team have required science skills or would you be able to quickly hire/outsource it? You don’t need hundreds of them but you do need few. If BloombergGPT and Falcon LLMs are any clues, their teams included less than 10 people (including leaders!)
$$$: how much money you want to spend? Is it opex you want to incur or capex?
IP: Do you want to own the intellectual property of the foundation model?
Security and Privacy: You may not want to expose any of your data to 3rd party model providers due to security and privacy reasons. (biased side note: Amazon Bedrock do not use your data to train the models and it’s always isolated per customer - so your data privacy and security remains intact.)

Buy:

Let’s take a scenario of you buying a service as a subscription pay-as-you-go.

A word of caution before you decide to go this route: when you use third-party service to perform your generative AI operations/tasks (e.g. summarization, text generation), you are required to send certain information (e.g. your organization’s data) in form of a prompt to the provider. Not all the providers will keep your data private, by default. They may use the data you send to perform various generative AI tasks to re-train their foundation models. As an enterprise, you want to be cautious of what data you expose to such providers as many times it may contain your own proprietary information. (biased side note: Amazon Bedrock do not use your data to train the models and it’s always isolated per customer - so your data privacy and security remains intact).

Ok, with security covered, let’s first see how one would use the subscription model. It typically works based on the variety of tasks you want to perform. For example, your users want to summarize a large article as they only want to read the summary of the article and not the whole. In that case, you build a mechanism for your users to send the article to the model/service and within few seconds, model will send them a summary version of that article. It’s depicted in this picture below where, top portion of the text is the sample article (in this case 563 words) and the bottom portion in ‘bullet points’ is the summarized version of that article (in this case 91 words).

So what does it cost you to perform this simple task of summarization?
$0.00128 - less than a penny!

Calculation assumed that you are using OpenAI’s GPT-3.5 Turbo model with 4K context. Model providers uses a token based pricing (not words - typically 750 words in english ~= 1,000 tokens) and the tokens you sent as prompt has different pricing compared the response you receive from the model. Coming to our example, 563 words ~= 750 token and thus price for prompt == $0.00105 (Input $0.0015/1K tokens) and 91 words ~=125 token output the price for response is $0.0002275 (Output $0.002/1K tokens)

What if your users are required to review various financial documents, such as public companies quarterly or annual reports? As per American Accounting Association, mean annual reports are 55,000 words (~ 75K tokens). Rough calculations tells me it costs around $0.12 to summarize one of the annual report. There are 58,200 publicly listed companies in the world. If your users need to summarize all of their annual reports, it will merely takes approx. $6,730. Do they also need to summarize quarterly reports? let’s add another $5,000 for additional 3 quarterly reports (assuming quarterly are 1/4 the size of annual).

Consider users may also need call transcript summarization and sentiment analysis from management discussions to understand the emotional tenor and key points. Typical earning calls are 45 to 60 minutes long and usually 7,500-10,000 words are spoken. By going similar math above, for all the public listed companies, it would cost approx. $1,250 per quarter for summarization and another $1,250 for sentiment analysis.

In around $14,000, you are able to let your users summarize entire corpus of the world’s financial reports and summarize/generate sentiments on earning call transcripts - just by using generative AI as a service — without building your own models.

Using foundation models in a consume-as-a-service mode can significantly reduce time-to-market, enabling production deployment in weeks rather than months. However, as use cases expand, licensing external services will accrue ongoing costs over time.

ps. if you decide to buy the service, my foundation model (as you can tell by now) is biased towards Amazon Bedrock where you can easily build multitude of generative AI application in a very simple, fast and secure way.

Build:

Now, let’s take a scenario of you building a foundation model.

Building/Training large-scale foundation models from scratch or pre-training them requires oceans of data, computing power, and financial resources.

When it comes to price for building/pre-training a model, we need to consider multiple factor: 1) fixed price (to train the model), 2) variable price (to serve the model), 3) time $ value of skilled scientists and engineers to build and evaluate such models, 4) ability to process the data or acquire the data.

Fixed price: There are many large models trained in recent history where we have some insight on the pre-training cost, like Google’s PaLM model training cost would have been around $12M though that’s almost over a year ago! Things are moving fast in this LLM world - newer hardware, newer research and better ways to train model, likely to reduce the time and cost to pre-train models.

Most recently, in LLaMA 2 paper (July 2023), META AI team highlighted they trained multiple models (7B, 13B, 34B and 70B) comprised of total 3.3M GPU hours. These models outperforms some of the other open source model by fair margin. For our calculations, we will use their biggest model - 70 billion parameter - where they used 1,720,320 hours of NVIDIA A100-80G to train. The cheapest hardware price for NVIDIA single A100-80G (outside of data centers) as of today is $2.21/hr. Using that as our baseline, it would have cost META ~ at least $3.8M to train 70B parameter model.

Variable price: In the earlier case of buying service from OpenAI, they charge it by tokens. When you build your own models, you will need to do the hosting/serving of models either in your data centers or somewhere on the cloud. When running models internally, costs are typically billed by hardware/software usage hours rather than by tokens processed.. Let’s still assume that you are able to procure the A100-80G at $2.21/hr but procure the entire host (which comes with 8 GPU cards), so your base cost to host is $17.68/hr. The maximum sequence length we can process for inference would depends on few factors such as, batch size, compute and memory usage, model parallelism, sequence length etc. In general, the memory requirements for processing a prompt is about 2xNo.of parameters (in our case 2*40 = 80). The compute requirement is combination of prompt size (tokens) and no. of parameters.

The NVIDIA A100-80G can handle 312 TFLOPs and has 2 TB/s of memory bandwidth. Even if we assume that FLOPs utilization will max out at 70%, we get 200 TFLOPs. The host costs is $17.68/hr which comes down to $0.0049/sec. The FLOPs requirement for LLaMA is 140 TFLOPs/token. With that we can calculate token/sec would look like:

8*200*10e12/140*10e9 = 11.42*10e3 tokens/sec

That gives us $0.000429/1K tokens

(The computation of per 1M token price is inspired by great work performed here)

Using this as our base price, for the 55,000 words (~75K tokens) summarization of one annual report comes down to $0.03. For 58,200 public companies annual report summarization approx. $1,900. For rest of the calculation, I have summarized it in below table:

Remember, in above table, the LLaMA 2 price is only reflecting the hardware cost based on the certain assumptions. If any of these assumptions changes, the cost would vary. Also it doesn’t include any additional services you may need to run the LLaMA 2 model - though it would be trivial.

People/Skills: To build a model like LLaMA 2, you will need highly skilled scientists and engineers. META credits to 67 people from their team to build all the 4 models of LLaMA 2. I am assuming that each model were built by 17 people for simple math - that’s still high number of people.

We have seen that other model creators have built large models (50B from Bloomberg, 40B Falcon models from TII) where they were able to build models with less than 10 people team - partly due to the fact that both companies choose to use managed service offering such as Amazon SagMaker which removes most of the infrastructure management heavy lifting and training orchestration.

Regardless, let’s assume that average salary of each person from META is $250,000/year and they were occupied for 3-months to build the model, approximate people cost comes down to ~ $1M.

Time-to-market: META team calls out that they trained LLaMA 2 model between January 2023 and July 2023, so it does take on an average 3-6 months to build and train such large model. Your time-to-market will increase if you prefer to build in comparison to buy a service.

In summary, building a custom foundation model similar to LLaMA 2 for summarization and sentiment analysis of public companies could cost around $3.8 million in fixed expenses, $4,000 in variable inference costs, and $1 million for staffing, totaling approximately $4.8 million. However, the benefit of building a powerful general-purpose model like LLaMA 2 is that it enables numerous additional tasks beyond those initial use cases, requiring only incremental variable costs for more inferences. In this scenario, periodic retraining of the model would likely be necessary to keep it current and may incur additional expenses. The substantial upfront investment for developing a custom foundation model can be justified if the model is versatile enough to amortize the fixed costs across many production applications.
(The above estimates do not account for any data processing requirements, as those can vary widely depending on each organization's unique data and workflows.)

What about neither?

Rather than building from scratch or buying services from a provider, you could fine-tune a pre-trained model using novel approaches if permitted by the license terms or enhance model with external knowledge source,

RAG (Retrieval Augmented Generation): RAG is an approach to enhance large language models with an external knowledge source to improve their accuracy and factual grounding.
Fine-tuning:
1. additional pre-training
2. Task-specific tuning/domain-specific tuning
3. LoRA (Low-Rank Adaptation)
4. QLoRA (Quantized Low-Rank Adaptation)
5. Prefix tune
6. etc.
7. etc.

Conclusion:

The decision of whether to build a foundation model in-house, fine-tune a pre-trained model, or consume an external API depends on many factors including cost, staffing, risk tolerance, IP ownership, and available data. This article highlighted the investments required for both building and licensing foundation model services to accomplish certain tasks. However, the race to develop improved foundation models is far from over, as companies continue building ever-larger and more accurate models with greater efficiencies. Organizations must weigh factors such as open sourcing versus commercialization when evaluating newly developed models. Ultimately, leaders need to carefully consider the tradeoffs around building, buying, or subscribing in order to determine the best path forward for their foundation model needs.

Ritesh Vajariya

The Cost of AI: Should You Build or Buy Your Foundation Model?

Buy:

Build:

What about neither?

Conclusion:

India's Generative AI Moment: Poised for Global Impact

14 must know generative AI terms