Home

Version 0.0.48 is released (01.06.2026)

What is new

Thanks to @danielrobbins we have the following improvements:

Fix chat context budgeting and overflow handling
This is the main fix. The extension now behaves as a modern llama.cpp client that trusts and leverages live runtime information that the server makes available, including actual calls to get real tokenization numbers from the server when possible. Uses the full server API surface, and fails correctly when the budget is wrong.
Fix empty streamed chat responses after compaction
This fixes a streaming bug where a valid final response could be dropped and show up as empty.
Polish empty-response diagnostics UI
This improves the user-facing error and logging when the provider gets an empty response (clear reason rather than general error)
Handle reasoning-only empty chat responses
This handles the case where a reasoning model uses up the whole response budget internally and returns no visible text, causing an error.
Raise chat output cap to quarter context window
This removes an overly small fixed output limit and replaces it with a more reasonable cap based on the model’s context size (output cap is 0.25 of context window, which is reasonable as an emergency failsafe.)
Log per-turn chat token usage
This adds per-turn token usage logging so it is easier to see what each request and response is actually consuming.
Fix shared-context model metadata display
This fixes the context size shown in the VS Code model UI so shared-context llama.cpp models no longer appear to have roughly double their real window size (original PR fixed this to not report 12K context window)

Version 0.0.47 is released (04.05.2026)

What is new

Multiline field for Edit with AI
Qwen3.5 models added as predefined (2B, 4B, 9B) - good for tools and chat
API Key is used (if needed and provided) on getting list of models on adding OpenAI Compatible model

Version 0.0.46 is released (29.04.2026)

What is new

llama.vscode could provide models for VS Code Copilot now:

Start tools model from llama-vscode (local or external)
In VS Code Copilot show the models list -> Other Models -> Manage Models
Make the models (all models available by the application serving the tools model are shown) you want to use visible (click on the left of the model name)
Select the desired model from Copilot and start using it

Not needed tools from Copilot could be unchecked to reduce contex size if local model is used.

Version 0.0.45 is released (04.03.2026)

What is new

Configurable debounce for inline completion requests - setting debounce_ms. llama-vscode will wait debounce_ms after a keystroke before sending a request to the LLM for inline code completion. If in the meantime there is another keystroke, the request for the previous keystroke is cancelled. Useful on low end hardware to avoid triggering code completion on every keystroke.
Notification "Extension is updated" is shown only on version change, not on every setting change (as was before)

Version 0.0.44 is released (03.03.2026)

What is new

Subagents implemented (with tool delegate_task) - now each agent, which has "Available as Subagent" checked could be used as a subagent
new agent - Unit Test Writer
new tool create_agent
new agent "Agent creator"
Files SOUL.md and USER.md (if available in the project root) will be added to the context

Home

Version 0.0.48 is released (01.06.2026)

What is new

Version 0.0.47 is released (04.05.2026)

What is new

Version 0.0.46 is released (29.04.2026)

What is new

Version 0.0.45 is released (04.03.2026)

What is new

Version 0.0.44 is released (03.03.2026)

What is new

Setup instructions for llama.cpp server

Features

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally