Just as humans communicate using words, Large Language Models (LLMs) communicate using tokens. A token represents a small unit of text on average, roughly ¾ of an English word, depending on language and encoding.For example, a 1,000-word document may translate to 1,300–1,400 tokens when processed by an LLM. As enterprise adoption of GenAI accelerates, one crucial factor often overlooked is token efficiency. Frontier models such as GPT, Gemini, and Claude charge based on the total input + output tokens per request.
In other words: Tokens have become the new currency of AI.
To keep AI budgets predictable, organizations must learn to optimize token usage without compromising quality. Below are three key pillars for reducing Tokens:
Pillar 1: Prompt Optimization: Reduce Iterations and Output Overhead
Strengthen Prompting Skills — Reduce Iterations, Save Tokens
Many users assume prompting is the same as everyday conversation. It isn’t.
Prompting is structured communication where clarity and precision reduce unnecessary follow-up queries directly lowering token usage.
A practical framework is RISEN, which adds discipline to prompting:
- R — Role
- I — Instructions
- S — Steps
- E — Expectations
- N — Narrowing (Nuances)
Example Prompt: Recipe Generation
Role: Act as a beginner-friendly home cook.
Instructions: Create a recipe for chocolate chip cookies using basic pantry ingredients.
Steps:
- List ingredients with measurements.
- Describe mixing and baking steps.
- Add tips for common mistakes.
Expectations: Keep the output under 300 words and suitable for a 10-year-old.
Narrowing: Max 20 minutes prep time; exclude nuts and advanced tools.
This structure consistently yields clear, high-quality results with fewer revisions, reducing token usage.
Other frameworks like RTF (Role, Task, Format) and CoT (Chain-of-Thought prompting) are equally useful. The fewer the iterations, the fewer the tokens and the lower your GenAI cost.
Use Few-Shot Examples Sparingly
Few-shot prompting is powerful but token-expensive.
Reduce token usage by:
- Providing one high-quality example instead of several
- Using zero-shot prompting with clear rules
- Letting the model infer patterns through descriptions instead of examples
Savings: Hundreds of tokens per call
Use Strict Output Formats to Avoid Regeneration
Ambiguous output leads to retries, which doubles token usage.
Add constraints like:
- “Return valid JSON only.”
- “Follow this exact schema.”
- “Respond in bullet points under 100 tokens.”
Fewer retries → fewer tokens.
Conclusion: Token Optimization Is Now a Core AI Competency
Pillar 2: Payload Optimization — Minimize Input Tokens
Move Beyond JSON — Explore Compact Data Formats like TOON and VSC
JSON is widely used for structured data, but it is token-heavy due to repetition of keys, quotes, and formatting. LLMs do not require human-friendly syntax; they only need machine-readable structure. This has led to compact, emerging formats such as TOON and VSC, designed to minimize token usage.
Example: Same Data in Three Formats
JSON (Most Token Heavy)
{
“name”: “Arun”,
“age”: 34,
“city”: “Chennai”,
“skills”: [“Python”, “SQL”, “Azure”]
}
TOON (Token-Optimized Object Notation)
name: Arun
age: 34
city: Chennai
skills: Python, SQL, Azure
VSC (Very Simple Compact)
Arun;34;Chennai;Python|SQL|Azure
Formats like VSC can reduce tokens by 30–70% for simple payloads.
However, TOON and VSC are still evolving and may not support deeply nested or complex structures yet.
Use Model Context Windows Wisely (Avoid Overfeeding History)
LLMs charge for every token, including conversation history. A common mistake is passing the entire previous conversation into every new request.
How to optimize:
- Pass only the minimum required
- Replace long transcripts with summaries.
- Store long-term memory externally (e.g., vector database) and retrieve only relevant chunks.
Example:
Instead of sending a 2,000-word meeting transcript, compress it into a 200-token summary. Token reduction: 70–90%
Offload Computation to Code Instead of the Model
LLMs should not perform tasks that normal programming handles efficiently:
- Sorting
- Filtering
- Basic math
- Data reformatting
These operations bloat prompts and consume unnecessary tokens.
Example:
Don’t send a 300-token table to sort; sort it in code and pass only the final result.
Pillar 3: Execution Optimization — System-Level Efficiencies
These techniques ensure your AI architecture and workflow make optimal use of tokens.
Use Smaller Models for Non-Critical Tasks
Not every task requires GPT-4, Claude Opus, or Gemini Ultra.
Use lighter models when:
- Tasks are routine: extraction, classification, summarization
- High reasoning or creativity isn’t required
- You need high-frequency or real-time responses
Usage of lower model, reduces cost of tokens as Higher the model, expensive is the token
Example:
Use GPT-4o mini / Gemini Flash for data extraction instead of a premium model.
Cost savings can reach up to 95%.
Cache Intermediate Results (LLM Caching)
Many enterprise prompts repeat the same supporting information:
- Long product descriptions
- User profiles
- Company policies
- Guardrails and instructions
Cache results and reuse them instead of re-sending.
Cache hit rates of 60–80% can dramatically reduce recurring token costs.
As GenAI becomes deeply embedded in enterprise workflows, token efficiency is no longer a technical detail it is a strategic differentiator. Organizations that master token optimization will deploy AI faster, smarter, and significantly cheaper than those who don’t. Just like cloud cost governance evolved into a discipline, token governance will define the next era of AI maturity. By refining prompts, adopting compact formats, choosing the right model sizes, and eliminating unnecessary computation, enterprises can unlock maximum value at minimum cost. In a world where every token counts, optimization isn’t optional it’s a competitive advantage.
Shammy Narayanan is the Vice President of Platform, Data, and AI at Welldoc. Holding 11 cloud certifications, he combines deep technical expertise with a strong passion for artificial intelligence. With over two decades of experience, he focuses on helping organizations navigate the evolving AI landscape. He can be reached at shammy45@gmail.com
