LM Studio REST API v1 Integration
Overview
The benchmark tool now supports LM Studio's native REST API v1 (/api/v1/*)
in addition to the existing Python SDK/CLI mode. This enables advanced
features such as stateful chats, parallel requests, and more precise metrics.
New Features
1. REST API Mode (--use-rest-api)
- Uses
/api/v1/chatfor inference instead of the Python SDK - Stateful chat management (response_id tracking)
- Detailed stats in the response (TTF, tokens/s, tokens in/out)
- Streaming events for more accurate measurement
2. Model Management via API
GET /api/v1/models— list with capabilities (vision, tool-use)POST /api/v1/models/load— explicit load with configurationPOST /api/v1/models/unload— explicit unloadPOST /api/v1/models/download— download model via API
3. Improved Capabilities Detection
- Vision support:
capabilities.visionflag from the API - Tool calling:
capabilities.trained_for_tool_useflag - Use the
--only-visionor--only-toolsfilters
4. Parallel Inference (LM Studio 0.4.0+)
--n-parallel N— max concurrent predictions (default: 4)--unified-kv-cache— optimizes VRAM usage for parallel requests- Continuous batching support (llama.cpp 2.0+)
5. API Authentication
--api-token TOKEN— permission key for protected servers- Config:
lmstudio.api_tokeninconfig/defaults.json
Usage
Basic usage (REST API mode)
# REST API with default settings
./run.py --use-rest-api --limit 1
# With API token
./run.py --use-rest-api --api-token "your-token-here" --limit 1
# With parallel requests (LM Studio 0.4.0+)
./run.py --use-rest-api --n-parallel 8 --unified-kv-cache --limit 1
Filter by capabilities
# Test only vision-capable models
./run.py --use-rest-api --only-vision --runs 2
# Test only tool-calling models
./run.py --use-rest-api --only-tools --runs 2
Config file (persistent)
config/defaults.json:
{
"lmstudio": {
"host": "localhost",
"ports": [1234, 1235],
"api_token": "your-token-here",
"use_rest_api": true
}
}
Then simply:
./run.py --limit 1 # will automatically use REST API from config
Comparison: SDK vs. REST API
| Feature | SDK/CLI Mode | REST API Mode |
|---|---|---|
| Model Loading | lms load CLI | POST /api/v1/models/load |
| Inference | lmstudio.llm() | POST /api/v1/chat |
| Stats | SDK stats object | Detailed response stats |
| Streaming | SDK stream | SSE stream (Server-Sent Events) |
| Parallel Requests | ❌ | ✅ (with --n-parallel) |
| Stateful Chats | ❌ | ✅ (response_id tracking) |
| Capabilities | Metadata parsing | Native API fields |
| Authentication | ❌ | ✅ (permission keys) |
API Response Format
Dashboard summary API (/api/dashboard/stats)
The web dashboard now exposes additional summary fields for quick visual analysis of benchmark history. The endpoint is consumed by the Home and Results views to render KPI cards and charts.
New response fields:
speed_summary:min,p50,avg,p95,maxtokens/stop_models_extended: Top 10 models by speed (model, quantization, speed, VRAM, architecture)quantization_distribution: count per quantizationarchitecture_distribution: count per architectureefficiency_top: top models ranked bytokens_per_sec_per_gb
Example (excerpt):
{
"speed_summary": {
"min": 22.44,
"p50": 48.17,
"avg": 51.26,
"p95": 86.11,
"max": 93.88
},
"top_models_extended": [
{
"model_name": "qwen/qwen3-4b@q4_k_m",
"quantization": "q4_k_m",
"speed": 93.88,
"vram_mb": "6144",
"architecture": "qwen3"
}
],
"quantization_distribution": {
"q4_k_m": 22,
"q5_k_m": 13
}
}
/api/v1/chat stats
{
"text": "... generated text ...",
"stats": {
"tokens_in": 42,
"tokens_out": 128,
"time_to_first_token_ms": 234.5,
"total_time_ms": 1523.8,
"tokens_per_second": 84.02
}
}
/api/v1/models capabilities
{
"models": [
{
"key": "llava-1.6-vicuna-7b-q4_k_m",
"capabilities": {
"vision": true,
"trained_for_tool_use": false
}
},
{
"key": "qwen-2.5-coder-14b-instruct-q5_k_m",
"capabilities": {
"vision": false,
"trained_for_tool_use": true
}
}
]
}
Implementation details
New files
core/client.py: REST API client with wrapper functionsLMStudioRESTClient: main classModelInfo,ModelCapabilities,ChatStats: data classesis_vision_model(),is_tool_model(): helpers
Modified files
-
cli/benchmark.py:_run_inference(): dispatcher (SDK vs REST)_run_inference_rest(): REST-based inference_run_inference_sdk(): SDK-based inference (renamed)_load_model_rest(),_unload_model_rest(): REST model management
-
config/defaults.json: addedapi_token,use_rest_apifields -
core/config.py: new config fields inBASE_DEFAULT_CONFIG
CLI flags
--use-rest-api Enable REST API mode
--api-token TOKEN API permission token
--n-parallel N Max parallel predictions (REST only)
--unified-kv-cache Unified KV cache (REST only)
Troubleshooting
Server unreachable
# Check whether LM Studio is running
curl http://localhost:1234/
# Healthcheck via CLI
lms server status
API token errors
# Generate token in Settings > Server
# Save it in config or pass via CLI
./run.py --use-rest-api --api-token "lms_..."
REST vs SDK performance
- REST: more precise stats, more features
- SDK: slightly faster (direct Python access)
- For benchmarking, REST is recommended (better metrics)
Additional REST Client Features
1. Download Progress Tracking
The REST client now supports real-time download progress monitoring:
from rest_client import LMStudioRESTClient
client = LMStudioRESTClient()
def on_progress(status):
if status["state"] == "downloading":
print(f"Progress: {status['progress'] * 100:.1f}%")
# Wait for download to complete with progress updates
success = client.download_model(
model_key="qwen/qwen3-1.7b",
wait_for_completion=True,
progress_callback=on_progress
)
API: Polls /api/v1/models/download/status every 2 seconds until completion.
2. MCP Integration
Model Context Protocol (MCP) servers can now be attached to chat requests:
# LM Studio v1 API format
mcp_integrations = [
{
"type": "ephemeral_mcp",
"server_label": "filesystem",
"server_url": "http://localhost:3001/mcp"
}
]
result = client.chat_stream(
messages=[{"role": "user", "content": "List files in /tmp"}],
model="qwen/qwen3-4b",
mcp_integrations=mcp_integrations
)
Note: Requires MCP server running. Integrations are passed in the integrations array field.
3. Stateful Chat History
Enable multi-turn conversations with automatic response_id tracking:
client = LMStudioRESTClient()
# First message
result1 = client.chat_stream(
messages=[{"role": "user", "content": "What is 2+2?"}],
model="qwen/qwen3-4b",
use_stateful=True
)
# response_id stored automatically
# Second message - automatically includes previous_response_id
result2 = client.chat_stream(
messages=[{"role": "user", "content": "Add 3 to that."}],
model="qwen/qwen3-4b",
use_stateful=True
)
# Server can maintain conversation context
# Reset state when starting new conversation
client.reset_stateful_chat()
API: Extracts response_id from chat.end event, sends previous_response_id in subsequent requests.
4. Response Caching
Identical requests are cached in memory for instant responses:
client = LMStudioRESTClient(enable_cache=True)
# First request - hits API (slow)
result1 = client.chat_stream(
messages=[{"role": "user", "content": "Count to 5"}],
model="qwen/qwen3-4b",
temperature=0.5
)
# Time: ~0.5s
# Second identical request - hits cache (instant)
result2 = client.chat_stream(
messages=[{"role": "user", "content": "Count to 5"}],
model="qwen/qwen3-4b",
temperature=0.5
)
# Time: ~0.0s (10,000x faster!)
# Cache management
cache_size = len(client._RESPONSE_CACHE) # Check cache size
cleared = client.clear_cache() # Clear all cached responses
Cache Key: MD5 hash of (messages, model, temperature)
Bypassed: When using use_stateful=True or mcp_integrations (non-deterministic)