Lädt...

🔧 Stream LLM Responses from Cache


Nachrichtenbereich: 🔧 Programmierung
🔗 Quelle: dev.to

LLMs can become more expensive as your app consumes more tokens. Portkey's AI gateway allows you to cache LLM responses and serve users from the cache to save costs. Here's the best part: now, with streams enabled.

Streams are an efficient way to work with large responses because:

  • They reduce the perceived latency when users are using your app.
  • Your app doesn't have to buffer it in the memory.

Let's check out how to get cached responses to your app through streams, chunk by chunk. Every time portkey serves requests from the cache, we save costs for tokens.

With streaming and caching enabled, we will make a chat completion call to OpenAI through Portkey.

Import and instantiate the portkey.

import Portkey from "portkey-ai";

const portkey = new Portkey({
  apiKey: process.env.PORTKEYAI_API_KEY,
  virtualKey: process.env.OPENAI_API_KEY,
  config: {
    cache:{
      mode: "semantic"
    }
  }
});
apiKey Sign up for Portkey and copy API key
virtualKey Securely store in the vault and reference it using Virtual Keys
config Pass configurations to enable caching

Our app will list the tasks to help with planning a birthday party.

const messages = [{
        role: "system",
        content: "You are very good program manager and have organised many events before. You can break every task in simple and means for others to pick it up.",
    },
    {
        role: "user",
        content: "Help me plan a birthday party for my 8 yr old kid?",
    },
];

Portkey follows same signature as OpenAI's, hence enabling streams in responses is passing stream:true option.

try {
    var response = await portkey.chat.completions.create({
        messages,
        model: "gpt-3.5-turbo",
        stream: true,
    });
    for await (const chunk of response) {
        process.stdout.write(chunk.choices[0]?.delta?.content || "");
    }
} catch (error) {
    console.error("Errors usually happen:", error);
}

You can iterate over the response object, processing each word (chunk) and presenting it to the user as soon as it's received.

Here's a tip: You can skip cache HIT by passing cacheForceRefresh

  var response = await portkey.chat.completions.create({
      messages,
      model: "gpt-3.5-turbo",
      stream: true,
  }, {
      cacheForceRefresh: true
  });

Streaming becomes more effective in providing a smoother user experience and efficiently managing your memory.

Put this into practice today!

...

🔧 Stream LLM Responses from Cache


📈 38.92 Punkte
🔧 Programmierung

🔧 How to stream LLM responses using AWS API Gateway Websocket and Lambda


📈 29.29 Punkte
🔧 Programmierung

🔧 Cache-Control, Netlify-CDN-Cache-Control, Cache Invalidation, Oh My


📈 28.88 Punkte
🔧 Programmierung

📰 ST-LLM: An Effective Video-LLM Baseline with Spatial-Temporal Sequence Modeling Inside LLM


📈 27.16 Punkte
🔧 AI Nachrichten

🔧 Overcoming LLM Testing Challenges with Pytest and Trulens: Ensuring Reliable Responses


📈 24.04 Punkte
🔧 Programmierung

🔧 Iterative Thought Refiner: Enhancing LLM Responses via Dynamic Adaptive Reasoning


📈 24.04 Punkte
🔧 Programmierung

🔧 Crafting Structured {JSON} Responses: Ensuring Consistent Output from any LLM 🦙🤖


📈 24.04 Punkte
🔧 Programmierung

📰 How to Improve LLM Responses With Better Sampling Parameters


📈 24.04 Punkte
🔧 AI Nachrichten

🔧 StructuredRAG: Enhancing LLM JSON Responses with Hierarchical Organization and Type Annotations


📈 24.04 Punkte
🔧 Programmierung

🔧 Validating LLM Responses: Pydantic & Instructor Integration with LLMs (Part II)


📈 24.04 Punkte
🔧 Programmierung

📰 Deciphering Doubt: Navigating Uncertainty in LLM Responses


📈 24.04 Punkte
🔧 AI Nachrichten

📰 Improve LLM responses in RAG use cases by interacting with the user


📈 24.04 Punkte
🔧 AI Nachrichten

🔧 Streaming LLM Responses — Tutorial For Dummies (Using PocketFlow!)


📈 24.04 Punkte
🔧 Programmierung

🔧 Ollama Model Comparator: Compare LLM Responses Side-by-Side


📈 24.04 Punkte
🔧 Programmierung

🔧 Ensuring Reliable JSON from LLM Responses in PHP


📈 24.04 Punkte
🔧 Programmierung

🔧 How to Implement LLM Grounding for Better Responses


📈 24.04 Punkte
🔧 Programmierung

🔧 Formatting LLM Responses: From Unstructured Text to Structured Outputs


📈 24.04 Punkte
🔧 Programmierung

🔧 Tailor LLM’s responses to your personal style in Microsoft Word


📈 24.04 Punkte
🔧 Programmierung

🔧 How to Handle HTTP Responses with the Stream+JSON Content Type Using PHP Generators


📈 20.24 Punkte
🔧 Programmierung

🔧 Stopping the Stream: A Pythonic Guide to Controlling OpenAI Responses


📈 20.24 Punkte
🔧 Programmierung

🔧 How to Stream Responses from the Langflow API in Node.js


📈 20.24 Punkte
🔧 Programmierung

🔧 How to Stream DeepSeek API Responses Using Server-Sent Events (SSE)


📈 20.24 Punkte
🔧 Programmierung

🔧 Stream multiple responses in the Resource API


📈 20.24 Punkte
🔧 Programmierung

📰 Stream large language model responses in Amazon SageMaker JumpStart


📈 20.24 Punkte
🔧 AI Nachrichten

🎥 Gotta Cache Em All: Bending the Rules of Web Cache Exploitation


📈 19.26 Punkte
🎥 IT Security Video

📰 wipe cache partition - So leert man den Android-Cache


📈 19.26 Punkte
🖥️ Betriebssysteme