Google has rolled out a significant cost-saving enhancement for its Gemini API, introducing implicit caching for its Gemini 2.5 Pro and Gemini 2.5 Flash models.
This ‘always on’ system is designed to automatically lower costs for developers by up to 75% on repetitive prompt data by identifying and reusing common prefixes in API requests, thereby passing savings directly to users without requiring manual cache設置。
該計劃旨在使利用Google強大的生成AI模型在經濟上更容易訪問,尤其是對於經常處理大型,經常性上下文的應用程序,例如廣泛的系統說明或冗長的文檔。
這個新的自動化功能補充了現有的顯式緩存機制, Google首次在5月2024年5月2024年首次介紹。雖然顯式緩存為保證成本降低提供了途徑,但它要求開發人員手動配置和管理緩存的內容。相反,隱式緩存在沒有直接干預的情況下運行。 Google指出,它“直接將緩存成本節省傳遞給開發人員而無需創建明確的緩存。”
To optimize for these automatic savings, Google advises developers to structure their prompts by placing stable, common content at the beginning, followed by variable elements like user-specific questions.
The company also specified minimum token counts for a request to be eligible for implicit caching: 1,024 tokens for Gemini 2.5 Flash and 2,048 tokens for Gemini 2.5 Pro. Developers using Gemini 2.5 models will now see a `cached_content_token_count` in the API response’s usage metadata, indicating the extent of cached tokens used and billed at the reduced rate.
This move is a direct response to developer feedback on the complexities and sometimes unexpected costs of the earlier manual caching system for Gemini 2.5 Pro.
How Implicit and顯式緩存比較
官方 gemini api文檔進一步闡明隱式caching不需要默認型號和開發程序。除了及時的結構外,在快速連續的情況下發送類似前綴的請求也可以增加緩存命中的可能性。
對於需要保證成本節省的方案,明確的緩存API仍然是可行的選擇,支持Gemini 2.5和2.0型號。此方法允許用戶定義緩存的特定內容並設置時間(TTL)(如果未指定),則指示存儲持續時間。顯式緩存的計費取決於緩存令牌和所選TTL的數量。 As Google AI for Developers explains, “At certain volumes, using cached tokens is lower cost than passing in the same corpus of tokens repeatedly.”
Contextualizing Cost-Saving Measures in AI
Google’s introduction of implicit caching reflects a broader industry-wide effort to enhance the efficiency and reduce the financial barriers associated with deploying large-scale AI模型。
其他公司也從各個角度解決了這些挑戰。例如,IBM Research最近推出了其BAMBA-9B-V2模型,這是一種混合變壓器-SSM架構,旨在應對傳統變壓器的計算需求,尤其是與KV Cache降低有關。 Raghu Ganti from IBM highlighted that for Bamba, “Everything comes back to the KV cache reduction… More throughput, lower latency, longer context length.”
In the realm of training efficiency, Alibaba’s ZeroSearch framework offers a method to train LLMs for information retrieval by simulating search engine interactions, which, according to a 科學論文,可以將與API相關的培訓成本降低多達88%。但是,這種方法需要用於仿真的GPU服務器。
另一種效率策略來自萊斯大學和xmad.ai,其DFLOAT11技術可為LLM重量提供約30%的無損壓縮。 This method focuses on reducing model memory requirements without altering the output, a crucial factor for applications where bit-for-bit accuracy is paramount, thereby avoiding the “complexities that some end-users would prefer to avoid”with lossy quantization.
Furthering KV Cache Optimization and Future Directions
Sakana AI has also contributed to memory optimization with its Neural Attention Memory Models (NAMMS),旨在提高變壓器效率高達75%。 NAMMS在推理過程中動態修剪KV緩存的關鍵令牌較少,特別有益於管理長上下文窗口。 The system utilizes neural networks trained via evolutionary optimization, a method Sakana AI researchers say, “Evolution inherently overcomes the non-differentiability of our memory management operations, which involve binary ‘remember’ or ‘forget’ outcomes,”
While Google claims up to 75% cost savings with its new implicit caching, third-party verification of these figures are currently not yet available , and actual savings could取決於特定的用法模式不同:
以前的手動緩存系統有時面臨批評,有時很難使用,偶爾會導致超過預期的成本。儘管考慮了這些考慮,隱性緩存的自動化本質是朝著簡化Gemini建造開發人員的成本管理的明確一步。 Opentools將功能描述為“突破性”