If you plan to or have already started your AI transformation journey for your business or products, it is key to note the possible cost pitfalls your business might encounter while implementing Generative AI solutions.
In this article, I will outline those pitfalls and how you can achieve AI transformation at an optimal cost for your business.
Understanding Gen AI Costing Model
For large language models (LLMs), interactions are often served via Application Programming Interfaces (APIs) provided by several vendors such as OpenAI, Anthropic, Amazon Bedrock, Groq, and others.
- This API approach has democratized access to these AI models, making it easy for businesses to integrate them into their solutions. However, understanding how these API calls are priced is essential to avoid cost pitfalls.
- Messages sent to and from these LLMs are converted into input and output tokens. Tokenization is usually one of the first pre-processing steps performed during Natural Language Processing (NLP) by an AI model. It involves breaking down sentences into sub-units called tokens before other processes like stemming and lemmatization.
- AI providers offer pricing plans per million tokens (mTok).
For example, as of now, OpenAI’s GPT-4 model API calls cost $2.5/mTok for input tokens and $10/mTok for output tokens. Output tokens generally cost more than input tokens because LLMs utilize significantly more computing resources when generating responses.
Cost-Saving Methods
1. Caching
If you are building your AI solution internally, leveraging the LLM’s built-in prompt or context caching is highly effective. OpenAI, for instance, automatically implements caching for input tokens exceeding 1024 tokens. Other providers like Anthropic and Google also have caching mechanisms detailed in their API documentation.
Caching means that if 10 users ask your AI assistant or chatbot the same question, the response from the first request can be reused for the other 9 users. This approach eliminates the need for 9 additional API calls, thereby significantly reducing costs.
Additionally, you can enhance caching with a distributed framework like Redis. Alternatively, using an AI gateway such as Cloudflare’s AI Gateway at the edge of your deployment provides built-in caching and rate-limiting functionalities.
2. Better Prompting
Optimizing your base prompts using prompt fine-tuning tools can substantially reduce input token length.
Furthermore, depending on your use case, always include rules in your input prompts instructing the LLM to summarize its output in the most concise and understandable manner.
3. Do Not Reinvent the Wheel
If a solution already exists and is affordable, there is no need to rebuild it. Small and medium-sized businesses aiming to adopt Generative AI quickly, especially without strong engineering teams, should explore tools like Amazon Q (AWS), Google’s NotebookLM, or Microsoft’s Co-pilot. These tools lower the barrier to entry and deliver quick wins.
For businesses with highly sensitive data requirements, consider implementing on-premise AI solutions like AnythingLLM with Ollama. This allows interaction with a Small Language Model (SLM) running on a CPU within your office, without requiring internet access. Although SLMs offer a lower level of intelligence compared to LLMs, they are often sufficient for most business use cases and come at a much lower infrastructure cost.
Conclusion
Optimizing costs for your AI transformation project is critical to its success. Cost-effective solutions not only make implementation more feasible but also help win over stakeholders in organizations where resistance may hinder adoption.
Explore the available tools to accelerate adoption and always review the data usage policies of the solutions you choose.