Skip to content
igardev edited this page Jun 2, 2026 · 39 revisions

Version 0.0.48 is released (01.06.2026)

What is new

Thanks to @danielrobbins we have the following improvements:

  • Fix chat context budgeting and overflow handling
    This is the main fix. The extension now behaves as a modern llama.cpp client that trusts and leverages live runtime information that the server makes available, including actual calls to get real tokenization numbers from the server when possible. Uses the full server API surface, and fails correctly when the budget is wrong.

  • Fix empty streamed chat responses after compaction
    This fixes a streaming bug where a valid final response could be dropped and show up as empty.

  • Polish empty-response diagnostics UI
    This improves the user-facing error and logging when the provider gets an empty response (clear reason rather than general error)

  • Handle reasoning-only empty chat responses
    This handles the case where a reasoning model uses up the whole response budget internally and returns no visible text, causing an error.

  • Raise chat output cap to quarter context window
    This removes an overly small fixed output limit and replaces it with a more reasonable cap based on the model’s context size (output cap is 0.25 of context window, which is reasonable as an emergency failsafe.)

  • Log per-turn chat token usage
    This adds per-turn token usage logging so it is easier to see what each request and response is actually consuming.

  • Fix shared-context model metadata display
    This fixes the context size shown in the VS Code model UI so shared-context llama.cpp models no longer appear to have roughly double their real window size (original PR fixed this to not report 12K context window)

Version 0.0.47 is released (04.05.2026)

What is new

  • Multiline field for Edit with AI
  • Qwen3.5 models added as predefined (2B, 4B, 9B) - good for tools and chat
  • API Key is used (if needed and provided) on getting list of models on adding OpenAI Compatible model

Version 0.0.46 is released (29.04.2026)

What is new

llama.vscode could provide models for VS Code Copilot now:

  1. Start tools model from llama-vscode (local or external)
  2. In VS Code Copilot show the models list -> Other Models -> Manage Models
  3. Make the models (all models available by the application serving the tools model are shown) you want to use visible (click on the left of the model name)
  4. Select the desired model from Copilot and start using it

Not needed tools from Copilot could be unchecked to reduce contex size if local model is used.

Version 0.0.45 is released (04.03.2026)

What is new

  • Configurable debounce for inline completion requests - setting debounce_ms. llama-vscode will wait debounce_ms after a keystroke before sending a request to the LLM for inline code completion. If in the meantime there is another keystroke, the request for the previous keystroke is cancelled. Useful on low end hardware to avoid triggering code completion on every keystroke.

  • Notification "Extension is updated" is shown only on version change, not on every setting change (as was before)

Version 0.0.44 is released (03.03.2026)

What is new

  • Subagents implemented (with tool delegate_task) - now each agent, which has "Available as Subagent" checked could be used as a subagent

  • new agent - Unit Test Writer

  • new tool create_agent

  • new agent "Agent creator"

  • Files SOUL.md and USER.md (if available in the project root) will be added to the context

Setup instructions for llama.cpp server

More details about llama.cpp server

Features

Clone this wiki locally