How to Continue a Chat Session with Ollama in Your App

Learn how to maintain conversation context across multiple turns when integrating Ollama's chat API into your application, enabling smooth multi‑message dialogues without losing historical context.

How Ollama's Chat API Handles Message History

Understanding the /api/chat Endpoint Structure

POST /api/chat
Content-Type: application/json

{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "What's the weather like today?"
    }
  ],
  "stream": false
}

This request represents a single‑turn conversation. The model field selects the model, messages holds the conversation history, and stream toggles streaming responses. Maintaining context relies on correctly populating the messages array.

The Role of the Messages Array in Maintaining Context

{
  "model": "llama3.2",
  "messages": [
    {
      "role": "user",
      "content": "How do I declare a variable in JavaScript?"
    },
    {
      "role": "assistant",
      "content": "You can declare variables using let, const, or var. Use const for values that won't change, let for values that will, and var for older code compatibility."
    },
    {
      "role": "user",
      "content": "What's the difference between let and const?"
    }
  ]
}

const messages = [];

function addUserMessage(content) {
  messages.push({ role: "user", content });
}

function addAssistantMessage(content) {
  messages.push({ role: "assistant", content });
}

async function sendMessage(userInput) {
  addUserMessage(userInput);
  
  const response = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "llama3.2",
      messages: messages,
      stream: false
    })
  });
  
  const data = await response.json();
  addAssistantMessage(data.message.content);
  return data.message.content;
}

Context Window Limitations and Token Management

As the conversation grows, the messages array consumes more tokens. A long dialogue can reach the model's context limit quickly, especially with verbose responses.

const MAX_MESSAGES = 10;

function getMessagesForRequest() {
  if (messages.length > MAX_MESSAGES) {
    // Keep system prompt + last N messages
    const systemMsg = messages.find(m => m.role === "system");
    const recent = messages.slice(-MAX_MESSAGES);
    return systemMsg ? [systemMsg, ...recent] : recent;
  }
  return messages;
}

import { encoding_for_model } from "tiktoken";

const MAX_TOKENS = 120000; // Leave room for response
const MODEL = "llama3.2";

function truncateToTokenLimit(allMessages) {
  const encoder = encoding_for_model(MODEL);
  let tokens = 0;
  const truncated = [];
  
  // Process in reverse (newest first)
  for (let i = allMessages.length - 1; i >= 0; i--) {
    const msgTokens = encoder.encode(allMessages[i].content).length + 4;
    if (tokens + msgTokens > MAX_TOKENS) break;
    truncated.unshift(allMessages[i]);
    tokens += msgTokens;
  }
  
  return truncated;
}

Summarizing older messages into a brief note before discarding them is another approach. It preserves essential context while staying within token limits.

The appropriate strategy depends on your use case. For a simple chatbot, truncation works well. For applications where users reference earlier parts of the conversation—such as document analysis tools—token‑aware management or summarization is advisable.

Include a system message with any behavioral instructions at the start of the array. It remains constant across requests and influences the model's persona, but it also counts toward the token budget.

Implementing Session Continuation in Your App

Appending New Messages to the Existing Conversation

The straightforward approach is to keep a message array on the server, appending each new user input and the model's response. When a user sends a message, add it to the history, call the Ollama API with the full array, then store the assistant's reply back into the same array.

// In-memory session storage (use Redis or a database in production)
const sessions = new Map();

async function handleChat(sessionId, userMessage) {
  // Get or initialize session history
  if (!sessions.has(sessionId)) {
    sessions.set(sessionId, []);
  }
  
  const history = sessions.get(sessionId);
  
  // Append user's new message
  history.push({ role: 'user', content: userMessage });
  
  // Call Ollama with full conversation history
  const response = await fetch('http://localhost:11434/api/chat', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      model: 'llama3',
      messages: history,
      stream: false
    })
  });
  
  const data = await response.json();
  
  // Append assistant's response to history
  history.push({ role: 'assistant', content: data.message.content });
  
  return data.message.content;
}

This pattern works for simple applications, but the history array grows with every exchange. Long conversations eventually hit the context window limit. A common solution is a sliding window that retains only the most recent N messages, or summarizing older exchanges into a compact summary.

Managing Session State on the Client Side

For frontend clients, you need a strategy to store session data. localStorage is sufficient for quick demos, but production applications typically use more strong storage such as IndexedDB or synchronize the state with a backend service.