LM Studio Benchmark Documentation
Welcome to the LM Studio Benchmark documentation! This tool helps you measure and compare token/s performance across all your locally installed LLM models and their quantizations.
What is this?
A Python benchmark tool for LM Studio with a modern web dashboard that:
- Automatically tests all local LLM models and quantizations
- Measures token/s speeds with warmup and multiple runs
- Exports results in JSON, CSV, PDF, and interactive HTML formats
- Detects GPU capabilities (NVIDIA, AMD, Intel) and monitors VRAM usage
- Provides a web dashboard with live charts and filtering options
- Includes Linux tray controls with live status icons and quick actions
Quick Links
- Quickstart Guide — Get started in 5 minutes
- Configuration Reference — All CLI arguments and config file options
- Architecture Documentation — System architecture with Mermaid diagrams, including testing architecture
- REST API Integration — Advanced features with LM Studio API v1
- Hardware Monitoring — GPU, CPU, RAM tracking
- LLM Metadata Guide — Model capabilities and metadata
- User Data & Configuration — XDG directory structure and config management
- Agent Integration — How to integrate with LM Studio Agents
Features at a Glance
✅ Multi-model benchmarking with intelligent GPU offload ✅ Vision & tool-calling model detection ✅ Progressive VRAM management (automatic fallback) ✅ Caching system (skip already-tested models) ✅ Filter by quantization, architecture, params, context length ✅ Live web dashboard with 27 themes ✅ Linux tray controller with dynamic benchmark status icons ✅ REST API mode with parallel inference support ✅ Download progress tracking, MCP integration, stateful chats ✅ Response caching with 10,000x+ speedup for repeated prompts
Getting Started
Check out the Quickstart Guide to begin benchmarking your models!
🚀 Quick Start Guide - LM Studio Benchmark Tool
Installation
cd ~/LM-Studio-Bench
# 1) Preview setup (no changes)
./setup.sh --dry-run
# 2) Prepare system + Python environment (recommended)
./setup.sh
# 3) Activate virtual environment
source .venv/bin/activate
If you skip setup.sh, use this manual fallback:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
🌐 Web Dashboard (Recommended)
Start Web UI
./run.py --webapp
✅ Opens browser automatically at http://localhost:8080
✅ Live streaming of benchmark output via WebSocket
✅ Browse all cached results with interactive tables
✅ System info (GPU model detection, LM Studio health, hardware details)
✅ Dark mode by default with 27 theme options
✅ All CLI parameters available as web form with tooltips
✅ Advanced filtering (quantization, architecture, size, context-length)
✅ Separate logs:
~/.local/share/lm-studio-bench/logs/webapp_*.log and
~/.local/share/lm-studio-bench/logs/benchmark_*.log
✅ Linux tray control with dynamic status icon and quick actions
Dashboard Features:
- Start Benchmark: Configure and run benchmarks from web interface
- Filter by quantization, architecture, parameter size
- Rank results by speed, efficiency, TTFT, or VRAM
- Set hardware limits (max GPU temp, max power draw)
- Tooltip help for all options
- System Info: OS, Kernel, CPU, GPU (with detailed model names)
- LM Studio Health: Live healthcheck status (HTTP API + CLI fallback)
- Live Output: Real-time streaming with colored logs and progress
- Results Browser: Filter and sort all cached benchmark results
- Export: Download JSON/CSV/PDF/HTML reports
- Network Access: Access from other devices on same network
Linux Tray Control
When GTK/AppIndicator dependencies are installed, a tray controller starts with the web app.
- Dynamic status icon:
- Gray: idle
- Green: running
- Yellow: paused
- Red: API unreachable/error
- Smart controls:
- Start enabled in idle/error states
- Pause/Stop enabled only in running/paused states
- Auto refresh: status and controls refresh every 3 seconds
- Quit behavior: tray
Quittriggers graceful full shutdown
Network Access
# Access dashboard from other devices
http://your-ip:8080
# Example:
http://192.168.1.100:8080
💻 Command Line (CLI)
Simple Benchmark (All Models)
./run.py
✅ Tests all installed models with 3 runs each (~1-2 hours)
✅ Automatically saves results to ~/.local/share/lm-studio-bench/results/
✅ Clean output with emoji icons and formatted model lists
✅ Detailed logs saved to
~/.local/share/lm-studio-bench/logs/benchmark_YYYYMMDD_HHMMSS.log
Monitor Logs in Real-Time
# Watch benchmark execution
tail -f ~/.local/share/lm-studio-bench/logs/benchmark_*.log
# Watch web dashboard
tail -f ~/.local/share/lm-studio-bench/logs/webapp_*.log
# Search for errors
grep ERROR ~/.local/share/lm-studio-bench/logs/benchmark_*.log
Quick Test (3 NEW Models)
./run.py --limit 3 --runs 1
✅ Fast test with 3 NEW untested models (~5-10 minutes) ✅ Already tested models automatically loaded from cache ✅ Limit applies ONLY to new models, all cached models included
Development Mode (Fastest)
./run.py --dev-mode
✅ Automatically selects smallest model ✅ Single run for quick validation (~30 seconds) ✅ Perfect for testing changes
Test Single Model
./run.py --limit 1 --runs 1
✅ Single model benchmark (~1-2 minutes)
Advanced Features
1️⃣ Hardware Profiling (6 Live Charts)
Enable Complete Hardware Monitoring:
./run.py --enable-profiling --runs 1 --limit 3
Monitored Metrics:
- 🌡️ GPU Temperature (°C)
- ⚡ GPU Power (W)
- 💾 GPU VRAM (GB)
- 🧠 GPU GTT (GB) - AMD only
- 🖥️ System CPU usage (%)
- 💾 System RAM usage (GB)
✅ All metrics are displayed live in the WebApp ✅ 6 interactive Plotly.js charts with Min/Max/Avg stats ✅ Moving average for stable RAM curves ✅ Each metric is measured every second
With Safety Limits:
./run.py --enable-profiling --max-temp 85 --max-power 350
✅ Interrupts benchmark when limits are exceeded
2️⃣ AMD GTT Support (Shared System RAM)
Enable GTT (Default):
./run.py --limit 3
✅ Automatically uses VRAM + GTT (e.g. 2GB VRAM + 46GB GTT = 48GB) ✅ Enables larger models on AMD APUs/iGPUs ✅ Shown in logs: "💾 Memory: 0.4GB VRAM + 44.7GB GTT = 45.1GB total"
Disable GTT (VRAM-only):
./run.py --disable-gtt --limit 3
✅ Only uses dedicated VRAM ✅ More conservative offload levels ✅ Useful for benchmarking VRAM-only performance
3️⃣ Filtering Models
By Quantization:
./run.py --quants q4,q5 --limit 5
By Architecture:
./run.py --arch llama,mistral --limit 5
By Parameter Size:
./run.py --params 7B,8B --limit 5
By Context Length:
./run.py --min-context 32000 --limit 3
By Model Size:
./run.py --max-size 10 --limit 5
Vision Models Only:
./run.py --only-vision --runs 1
Regex-based Filtering (Include):
# Only Qwen or Phi models
./run.py --include-models "qwen|phi" --runs 1
# Only Llama 7B models
./run.py --include-models "llama.*7b" --runs 1
# Only Q4 quantizations
./run.py --include-models ".*q4.*" --runs 1
Regex-based Filtering (Exclude):
# Exclude uncensored models
./run.py --exclude-models "uncensored" --runs 1
# Exclude Q2 and Q3 quantizations
./run.py --exclude-models "q2|q3" --runs 1
# Exclude all vision models
./run.py --exclude-models ".*vision.*" --runs 1
Combined Filters (AND logic):
# Include llama, exclude q2, only tools
./run.py --include-models "llama" --exclude-models "q2" --only-tools --runs 1
# Vision models, 7B params, max 12GB
./run.py --only-vision --params 7B --max-size 12 --runs 1
3️⃣ Ranking & Sorting
Sort by Efficiency (Default: Speed):
./run.py --limit 5 --rank-by efficiency
Sort by TTFT (Lower = Better):
./run.py --limit 5 --rank-by ttft
Sort by VRAM Usage (Lower = Better):
./run.py --limit 5 --rank-by vram
4️⃣ Cache Management
View Cached Results:
./run.py --list-cache
✅ Shows all cached models with performance metrics
Force Retest (Ignore Cache):
./run.py --retest --limit 3
✅ Re-runs benchmarks even if cached
Regenerate Reports from Database:
./run.py --export-only
✅ Generates JSON/CSV/PDF/HTML from cached results in <1s ✅ No benchmarking - instant report generation ✅ Supports all filters (--params, --quants, --arch, etc.)
Examples:
# All cached models
./run.py --export-only
# Only 7B models from cache
./run.py --export-only --params 7B
# Q4 quantizations with historical comparison
./run.py --export-only --quants q4 --compare-with latest
✅ Retests models even if cached
Export Cache as JSON:
./run.py --export-cache my_backup.json
✅ Exports entire cache database
Cache Behavior:
- First run: Tests all models (~2 hours for 20 models)
- Second run: Loads from cache (~1 second!)
- Automatic invalidation on parameter changes (prompt, context, temperature)
- Shows "X of Y models cached" before starting
5️⃣ Historical Comparison & Trends
Compare with Latest Benchmark:
./run.py --limit 3 --runs 1 --compare-with latest
📊 Shows performance delta (%) vs previous run
Compare with Specific Benchmark:
./run.py --limit 3 --runs 1 --compare-with benchmark_results_20260104_170000.json
6️⃣ Custom Configuration
Adjust Number of Runs:
./run.py --runs 5 --limit 2
Custom Context Length:
./run.py --context 4096 --limit 2 --runs 1
Custom Prompt:
./run.py -P "Your custom prompt here" --limit 2 --runs 1
7️⃣ Presets (Fast Setup)
Show available presets:
./run.py --list-presets
Load a built-in preset:
# Default presets (readonly)
./run.py --preset default_classic # Classic benchmark (default)
./run.py --preset default_compatibility_test # Capability-driven test
# Other presets
./run.py --preset quick_test
./run.py --preset high_quality
./run.py --preset resource_limited
Load preset and override values:
./run.py --preset quick_test --runs 2 --context 2048
./run.py --preset default_classic --runs 5 --context 4096
Backwards Compatibility:
./run.py --preset default # Automatically loads default_classic
Notes:
- Default presets include explicit values for all benchmark form fields, so
preset comparisons do not show
nullvalues for missing keys. default_classicis optimized for full model benchmarking (3 runs)default_compatibility_test(alias:default_compatability_test) is optimized for focused capability testing (1 run)- Capability-driven runs over many installed models continue when a single model fails to load; the failed model is logged and skipped.
- Embedding models are retried automatically without KV-cache offload if LM Studio rejects that load option.
- Legacy keys in imported/user presets are normalized automatically
(
context_length/top_k/top_p/min_p-> current key names).
📊 Output Formats
Each benchmark generates 4 files:
JSON Format
{
"model_name": "qwen/qwen3-8b",
"quantization": "q4_k_m",
"avg_tokens_per_sec": 8.15,
"tokens_per_sec_per_gb": 1.74,
"speed_delta_pct": -0.2,
...
}
✅ Structured data for analysis
CSV Format
model_name,quantization,avg_tokens_per_sec,tokens_per_sec_per_gb,speed_delta_pct
qwen/qwen3-8b,q4_k_m,8.15,1.74,-0.2
✅ Excel/Sheets compatible
PDF Report
- Model rankings (sortable)
- Best-of-Quantization analysis
- Quantization comparison tables (Q4 vs Q5 vs Q6)
- Performance statistics & percentiles
- Delta display (Δ% column)
HTML Report (Interactive Plotly)
- Bar chart: Top 10 models
- Scatter plot: Size vs Performance
- Scatter plot: Efficiency analysis
- NEW: Trend chart showing performance over time
- Summary statistics with gradient backgrounds
📈 Feature Showcase
Example: Complete Analysis
./run.py \
--quants q4,q5,q6 \
--limit 5 \
--runs 1 \
--rank-by efficiency \
--compare-with latest
Output:
- ✅ Filters to 5 models with 3 quantizations each
- ✅ Ranks by efficiency (Tokens/s per GB)
- ✅ Shows delta vs previous benchmark
- ✅ Generates all 4 export formats
- ✅ Includes percentile statistics (P50, P95, P99)
- ✅ Shows quantization comparison
- ✅ Displays performance trends if history available
🎯 Key Metrics
| Metric | Description | Unit |
|---|---|---|
| Speed | Throughput | tokens/s |
| Efficiency | Speed per GB model size | tokens/s/GB |
| TTFT | Time to First Token | ms |
| Delta | Change vs previous | % |
| VRAM | Memory used | MB |
📁 File Structure
results/
├── benchmark_results_20260104_170000.json
├── benchmark_results_20260104_170000.csv
├── benchmark_results_20260104_170000.pdf
└── benchmark_results_20260104_170000.html
🐛 Troubleshooting
No models found
- Ensure LM Studio is installed and running
- Check
lms ls --jsonoutput
Server not responding
- Start LM Studio server manually
- Check
~/.lmstudio/server-logs/
Permission denied on results/
mkdir -p results/
chmod 755 results/
🔗 Related Files
FEATURES.md- Complete feature listPLAN.md- Implementation roadmaprequirements.txt- Python dependencieserrors.log- Debug information
Version: 1.0 (Phases 1-4 Complete) | Updated: 2026-01-04
Configuration Reference
Complete documentation of all CLI arguments and configuration options for the LM Studio Benchmark Tool.
Table of Contents
Overview
The benchmark tool can be configured in three ways:
- Project Defaults:
config/defaults.json(in Git) - User Configuration:
~/.config/lm-studio-bench/defaults.json(optional overrides) - CLI Arguments: Override all config values
Priority: CLI Arguments > User Config > Project Defaults > Hard-coded Defaults
Configuration Files
Project Configuration (config/defaults.json)
The project configuration file contains all default settings for the benchmark. This file is shipped with the project and tracked in Git.
Location: <project_root>/config/defaults.json
User Configuration (~/.config/lm-studio-bench/defaults.json)
Optional user-specific configuration overrides. Only specify fields you want to customize.
Location: ~/.config/lm-studio-bench/defaults.json
Example (minimal user config):
{
"num_runs": 5,
"lmstudio": {
"use_rest_api": true
}
}
This overrides only num_runs and use_rest_api, all other values come from project defaults.
Complete Structure
{
"prompt": "Is the sky blue?",
"context_length": 2048,
"num_runs": 3,
"retest": false,
"enable_profiling": false,
"lmstudio": {
"host": "localhost",
"ports": [1234, 1235],
"api_token": null,
"use_rest_api": true
},
"inference": {
"temperature": 0.1,
"top_k_sampling": 40,
"top_p_sampling": 0.9,
"min_p_sampling": 0.05,
"repeat_penalty": 1.2,
"max_tokens": 256
},
"load": {
"n_gpu_layers": -1,
"n_batch": 512,
"n_threads": -1,
"flash_attention": true,
"rope_freq_base": 10000,
"rope_freq_scale": 1.0,
"use_mmap": true,
"use_mlock": false,
"kv_cache_quant": "f16"
}
}
Field Descriptions
Basic Settings
| Field | Type | Default | Description |
|---|---|---|---|
prompt | string | "Is the sky blue?" | Default test prompt for all benchmarks |
context_length | integer | 2048 | Context length in tokens |
num_runs | integer | 3 | Number of measurements per model/quantization |
retest | boolean | false | Ignore cache and benchmark all selected models again |
enable_profiling | boolean | false | Enable temperature/power monitoring |
LM Studio Server (lmstudio)
| Field | Type | Default | Description |
|---|---|---|---|
host | string | "localhost" | LM Studio server hostname |
ports | array | [1234, 1235] | Ports for server discovery (tries both) |
api_token | string/null | null | API permission token (REST API authentication) |
use_rest_api | boolean | true | Use REST API v1 instead of SDK/CLI |
Inference Parameters (inference)
| Field | Type | Default | Description |
|---|---|---|---|
temperature | float | 0.1 | Sampling temperature (0.0-2.0, low=deterministic) |
top_k_sampling | integer | 40 | Top-K sampling (limits choice to K most likely tokens) |
top_p_sampling | float | 0.9 | Top-P / Nucleus sampling (cumulative probability) |
min_p_sampling | float | 0.05 | Min-P sampling (minimum probability threshold) |
repeat_penalty | float | 1.2 | Repeat penalty (prevents repetitions, 1.0=off) |
max_tokens | integer | 256 | Maximum output tokens |
Load Config (load)
| Field | Type | Default | Description |
|---|---|---|---|
n_gpu_layers | integer | -1 | GPU layers (-1=auto/all, 0=CPU only, >0=specific) |
n_batch | integer | 512 | Batch size for prompt processing |
n_threads | integer | -1 | CPU threads (-1=auto/all) |
flash_attention | boolean | true | Flash attention (faster computation) |
rope_freq_base | float | 10000 | RoPE frequency base |
rope_freq_scale | float | 1.0 | RoPE frequency scaling |
use_mmap | boolean | true | Memory mapping (faster model load) |
use_mlock | boolean | false | Memory locking (prevents swapping) |
kv_cache_quant | string | "f16" | KV cache quantization (f32/f16/q8_0/q4_0/etc.) |
Preset Defaults and Compatibility
The tool includes two readonly default presets:
default_classic - Classic Benchmark Mode
Default preset for standard model benchmarking. Contains explicit values for all benchmark
fields to avoid null values in preset comparisons.
- benchmark_mode:
classic - preset_mode:
classic - runs: 3
- context: 2048
- Capability fields (agent_model, agent_capabilities, agent_max_tests):
null
Backwards Compatibility: Loading --preset default automatically loads default_classic.
default_compatibility_test - Capability-Driven Test Mode
Default preset for focused capability testing of a single model.
Alias: The legacy name default_compatability_test is accepted as an alias
for this preset for backward compatibility.
- benchmark_mode:
capability - preset_mode:
capability - runs: 1
- context: 2048
- agent_model:
qwen2.5-7b-instruct - agent_capabilities:
general_text,reasoning - agent_max_tests:
10 - No
nullvalues - all fields have explicit defaults
Compatibility mapping is applied automatically when loading and comparing presets with legacy keys:
context_length->contextnum_runs->runstop_k->top_k_samplingtop_p->top_p_samplingmin_p->min_p_sampling
CLI Arguments
All CLI arguments override the corresponding values from both config files.
Basic Options
--runs, -r (integer)
Number of measurements per model/quantization.
./run.py --runs 1 # Fast: only 1 measurement
./run.py --runs 5 # Accurate: 5 measurements (average)
Default: 3
--context, -c (integer)
Context length in tokens.
./run.py --context 4096 # 4K context
./run.py --context 32768 # 32K context
Default: 2048
--list-presets
List all available presets (readonly + user presets) and exit.
./run.py --list-presets
--preset, -p (string)
Load a preset before parsing all remaining CLI arguments.
If omitted, default_classic is used. The legacy alias default still
loads default_classic automatically.
./run.py --preset quick_test
./run.py --preset high_quality --runs 3
./run.py --preset default_classic
./run.py --preset default_compatability_test
Built-in readonly presets:
default_classicdefault_compatability_testdefault(alias fordefault_classic)quick_testhigh_qualityresource_limited
Readonly preset names cannot be saved, deleted, or imported as user presets.
This restriction also applies to the legacy alias default.
For capability-driven runs across many models, individual model load failures are logged and skipped so the benchmark can continue with the remaining models.
--prompt, -P (string)
Default test prompt.
./run.py --prompt "Explain machine learning"
./run.py -P "Explain machine learning"
Default: "Is the sky blue?"
--limit, -l (integer)
Maximum number of models to test.
./run.py --limit 1 # Only 1 model (usually smallest)
./run.py --limit 5 # First 5 models
Default: None (all models)
--dev-mode
Development mode: Automatically tests the smallest model with 1 run.
./run.py --dev-mode # Equivalent to --limit 1 --runs 1
Default: false
Filter Options
--only-vision
Test only models with vision capability (multimodal).
./run.py --only-vision --runs 2
Default: false
--only-tools
Test only models with tool-calling support.
./run.py --only-tools --runs 2
Default: false
--quants (string)
Test only specific quantizations (comma-separated).
./run.py --quants "q4,q5,q6" # Only Q4/Q5/Q6
./run.py --quants "q8" # Only Q8
Default: None (all quants)
--arch (string)
Test only specific architectures (comma-separated).
./run.py --arch "llama,mistral" # Only Llama and Mistral
./run.py --arch "qwen" # Only Qwen
Default: None (all architectures)
--params (string)
Test only specific parameter sizes (comma-separated).
./run.py --params "3B,7B,8B" # 3B, 7B and 8B models
./run.py --params "1B" # Only 1B models
Default: None (all sizes)
--min-context (integer)
Minimum context length in tokens.
./run.py --min-context 32000 # Only models with ≥32K context
Default: None (no minimum)
--max-size (float)
Maximum model size in GB.
./run.py --max-size 10.0 # Only models ≤10GB
./run.py --max-size 5.0 # Only models ≤5GB
Default: None (no limit)
--include-models (string)
Only test models matching the regex pattern.
./run.py --include-models "llama.*7b" # All 7B Llama models
./run.py --include-models "qwen|phi" # Qwen OR Phi
Default: None (all models)
--exclude-models (string)
Exclude models matching the regex pattern.
./run.py --exclude-models ".*uncensored.*" # No uncensored models
./run.py --exclude-models "test|exp" # No test/experimental
Default: None (no exclusions)
--compare-with (string)
Compare with previous results.
./run.py --compare-with "20260104_172200.json"
./run.py --compare-with "latest" # Latest result
Default: None (no comparison)
--rank-by (choice)
Sort results by metric.
Options: speed, efficiency, ttft, vram
./run.py --rank-by speed # By tokens/s
./run.py --rank-by efficiency # By tokens/s per GB VRAM
./run.py --rank-by ttft # By Time to First Token
./run.py --rank-by vram # By VRAM usage (low→high)
Default: speed
Cache Management
--retest
Ignore cache and retest all models.
./run.py --retest # Overwrites old results
Default: false (uses cache if available)
--list-cache
Display all cached models and exit.
./run.py --list-cache
Output: Table with all cache entries
--export-cache (string)
Export cache contents as JSON.
./run.py --export-cache "cache_export.json"
Exits the program after export.
--export-only
Generate reports from cache without new tests.
./run.py --export-only # Creates JSON/CSV/PDF/HTML
Default: false
Hardware Profiling
--enable-profiling
Enable hardware profiling (GPU temp & power).
./run.py --enable-profiling
Default: false
--max-temp (float)
Maximum GPU temperature in °C (warning).
./run.py --enable-profiling --max-temp 80.0
Default: None (no warning)
--max-power (float)
Maximum GPU power draw in Watts (warning).
./run.py --enable-profiling --max-power 400.0
Default: None (no warning)
--disable-gtt
Disable GTT (Shared System RAM) for AMD GPUs.
./run.py --disable-gtt # Only dedicated VRAM
Default: false (GTT enabled)
Note: Only relevant for AMD iGPUs (e.g., Radeon 890M).
Inference Parameters
All override values from config files:
--temperature (float)
./run.py --temperature 0.7 # More creative responses
./run.py --temperature 0.0 # Deterministic
--top-k, --top-k-sampling (integer)
./run.py --top-k 50
--top-p, --top-p-sampling (float)
./run.py --top-p 0.95
--min-p, --min-p-sampling (float)
./run.py --min-p 0.05
--repeat-penalty (float)
./run.py --repeat-penalty 1.3
--max-tokens (integer)
./run.py --max-tokens 512
Load Config (Performance Tuning)
All override values from config files:
--n-gpu-layers (integer)
./run.py --n-gpu-layers -1 # All layers on GPU (default)
./run.py --n-gpu-layers 0 # CPU only
./run.py --n-gpu-layers 20 # First 20 layers on GPU
--n-batch (integer)
./run.py --n-batch 1024 # Larger batches (faster)
./run.py --n-batch 128 # Smaller batches (less VRAM)
--n-threads (integer)
./run.py --n-threads -1 # Auto (default)
./run.py --n-threads 8 # 8 CPU threads
--flash-attention / --no-flash-attention
./run.py --flash-attention # Enabled (default)
./run.py --no-flash-attention # Disabled
--rope-freq-base (float)
./run.py --rope-freq-base 10000.0
--rope-freq-scale (float)
./run.py --rope-freq-scale 1.0
--use-mmap / --no-mmap
./run.py --use-mmap # Enabled (default)
./run.py --no-mmap # Disabled
--use-mlock
./run.py --use-mlock # Enabled (prevents swapping)
--kv-cache-quant (choice)
Options: f32, f16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
./run.py --kv-cache-quant q8_0 # 8-bit quantization (saves VRAM)
./run.py --kv-cache-quant f16 # Half-precision (balanced)
Default: null (model default)
REST API Mode
Uses LM Studio REST API v1 instead of Python SDK/CLI.
--use-rest-api
./run.py --use-rest-api --limit 1
Benefits:
- More detailed stats (TTFT, tok/s)
- Stateful chats (response_id tracking)
- Parallel requests (continuous batching)
- MCP integration
- Response caching
Default: false (uses SDK/CLI)
--api-token (string)
API permission token for REST API authentication.
./run.py --use-rest-api --api-token "lms_your_token_here"
Default: null (no token, server must be open)
Create: LM Studio → Settings → Server → Generate Token
--n-parallel (integer)
Max parallel predictions per model (REST API only).
./run.py --use-rest-api --n-parallel 8
Default: 4
Requirement: LM Studio 0.4.0+, continuous batching support
--unified-kv-cache
Enable unified KV cache (REST API only).
./run.py --use-rest-api --unified-kv-cache --n-parallel 8
Benefit: Optimizes VRAM for parallel requests
Default: false
Examples
Quick Test of One Model
./run.py --limit 1 --runs 1
# Or shorter:
./run.py --dev-mode
All 7B Llama Models with Q4/Q5/Q6 Quants
./run.py --include-models "llama.*7b" --quants "q4,q5,q6" --runs 2
Vision Models Only with Hardware Profiling
./run.py --only-vision --enable-profiling --max-temp 80.0 --max-power 400.0
REST API with Parallel Requests
./run.py --use-rest-api --n-parallel 8 --unified-kv-cache --limit 5
Export Without New Tests
./run.py --export-only
Custom Inference Parameters
./run.py --temperature 0.7 --top-p 0.95 --max-tokens 512 --limit 3
Preset Workflow
./run.py --list-presets
./run.py --preset quick_test
./run.py --preset resource_limited --max-size 10 --runs 2
Performance Tuning (VRAM-optimized)
./run.py --n-batch 128 --kv-cache-quant q8_0 --limit 5
Manage Cache
./run.py --list-cache # Display cache contents
./run.py --export-cache "backup.json" # Export cache
./run.py --retest --limit 1 # Ignore cache
Configuration Priority
- CLI Arguments (highest priority)
- User Config (
~/.config/lm-studio-bench/defaults.json) - Project Config (
config/defaults.json) - Hard-coded Defaults (in code)
Example:
# User config has "num_runs": 5
# Project config has "num_runs": 3
./run.py --runs 1 # → uses 1 (CLI overrides)
./run.py # → uses 5 (from user config)
Tips & Best Practices
1. Persistent REST API Config
If you mainly use REST API:
config/defaults.json:
{
"lmstudio": {
"use_rest_api": true,
"api_token": "lms_your_token"
}
}
Then simply:
./run.py --limit 1 # automatically uses REST API
2. VRAM Optimization
When VRAM is limited:
./run.py --kv-cache-quant q8_0 --n-batch 128 --max-size 10.0
3. Fast Development
./run.py --dev-mode # Tests only smallest model with 1 run
4. Reproducible Benchmarks
./run.py --temperature 0.0 --runs 5 --retest
5. Hardware Monitoring
./run.py --enable-profiling --max-temp 80.0 --max-power 400.0
Logging Configuration
The benchmark tool generates timestamped log files for debugging and monitoring.
Log File Locations
logs/
├── benchmark_YYYYMMDD_HHMMSS.log # Benchmark execution logs
└── webapp_YYYYMMDD_HHMMSS.log # Web dashboard logs
Log Format
Each log entry follows this format:
YYYY-MM-DD HH:MM:SS,mmm - LEVEL - LEVEL_ICON message
2026-03-22 13:35:32,445 - INFO - ℹ️ Starting benchmark...
Log Levels
The tool uses standard Python logging levels:
| Level | Usage | Examples |
|---|---|---|
INFO | General information and progress | Model loading, benchmark completion, hardware metrics |
WARNING | Non-fatal issues and fallbacks | GPU tool missing, using CLI fallback, skipped models |
ERROR | Runtime errors requiring attention | Model load failure, API unavailable, VRAM exceeded |
Level Icons
Each log level also gets an automatic icon prefix:
| Level | Icon |
|---|---|
DEBUG | 🐛 |
INFO | ℹ️ |
WARNING | ⚠️ |
ERROR | ❌ |
CRITICAL | 🔥 |
Hardware Metrics in Logs
When hardware profiling is enabled (--enable-profiling), metrics appear with emoji indicators:
🌡️ GPU Temp: 42°C
⚡ GPU Power: 125W
💾 GPU VRAM: 8.2GB
🧠 GPU GTT: 0.0GB
🖥️ CPU: 35.2%
💾 RAM: 18.5GB
Third-Party Library Logging
The following libraries have suppressed debug output for cleaner logs:
| Library | Level | Reason |
|---|---|---|
httpx | WARNING | HTTP client noise |
lmstudio | WARNING | SDK debug output |
urllib3 | WARNING | HTTP library noise |
websockets | WARNING | WebSocket protocol noise |
Viewing Logs
Real-time monitoring:
# Watch benchmark execution
tail -f ~/.local/share/lm-studio-bench/logs/benchmark_*.log
# Watch web dashboard
tail -f ~/.local/share/lm-studio-bench/logs/webapp_*.log
Search and filter:
# Find errors
grep ERROR ~/.local/share/lm-studio-bench/logs/benchmark_*.log
# Find warnings
grep WARNING ~/.local/share/lm-studio-bench/logs/benchmark_*.log
# Find specific model errors
grep "model_name_pattern" \
~/.local/share/lm-studio-bench/logs/benchmark_*.log
# Count log entries by level
grep -c INFO ~/.local/share/lm-studio-bench/logs/benchmark_*.log
grep -c ERROR ~/.local/share/lm-studio-bench/logs/benchmark_*.log
See Also
- QUICKSTART.md - Quick start guide
- REST_API_FEATURES.md - REST API details
- HARDWARE_MONITORING_GUIDE.md - Hardware profiling
- LLM_METADATA_GUIDE.md - Metadata & capabilities
Hardware Monitoring Live Charts - Guide
✅ Status: Fully Implemented with GPU Detection
Hardware monitoring is now fully functional with stable live charts for all metrics and improved GPU model detection.
Monitoring logic is shared in tools/hardware_monitor.py and used by both
classic benchmark flows and capability-driven agent flows.
📊 Implemented Metrics
GPU Detection and Model Info
The system automatically detects all installed GPUs:
-
NVIDIA GPUs
- Detection:
nvidia-smi --query-gpu=name - VRAM:
nvidia-smi --query-gpu=memory.total - Temperature:
nvidia-smi --query-gpu=temperature.gpu - Power:
nvidia-smi --query-gpu=power.draw
- Detection:
-
AMD GPUs
- rocm-smi detection:
rocm-smi --showproductname - Device ID mapping:
lspci -d 1002:{device_id} - Example:
1002:150e→ "Radeon Graphics (Ryzen 9 7950X3D)" - rocm-smi search path:
/usr/bin,/usr/local/bin,/opt/rocm-*/bin/ - VRAM:
rocm-smi --showmeminfo vram - GTT:
rocm-smi --showmeminfo gtt - Temperature:
rocm-smi --showtemp
- rocm-smi detection:
-
iGPU detection
- Extract from CPU string: regex
r'Radeon\s+(\d+[A-Za-z]*)' - Shows integrated Radeon graphics separately
- Prevents redundancy with dedicated GPUs
- Extract from CPU string: regex
GPU Metrics
-
🌡️ GPU Temperature (°C) - Red
- NVIDIA:
nvidia-smi --query-gpu=temperature.gpu - AMD:
rocm-smi --showtemp - Intel:
intel-gpu-top(if available)
- NVIDIA:
-
⚡ GPU Power (W) - Blue
- NVIDIA:
nvidia-smi --query-gpu=power.draw - AMD:
rocm-smi(Current Socket Graphics Package Power) - Intel: alternative measurement methods
- NVIDIA:
-
💾 GPU VRAM Usage (GB) - Green
- NVIDIA:
nvidia-smi --query-gpu=memory.used - AMD:
rocm-smi --showmeminfo vram(in bytes)
- NVIDIA:
-
🧠 GPU GTT Usage (GB) - Purple
- AMD only:
rocm-smi --showmeminfo gtt - System RAM that is used as VRAM
- Example: 2GB VRAM + 46GB GTT = 48GB effective
- AMD only:
System Metrics (with --enable-profiling)
-
🖥️ CPU Usage (%) - Orange
psutil.cpu_percent(interval=0.1)- 0-100% range
- System-wide, not per process
-
💾 System RAM Usage (GB) - Cyan
psutil.virtual_memory().used- Smoothing: moving average over 3 samples
- Prevents spikes from cache/buffer fluctuations
- Very stable curves
🔧 Activation
Hardware monitoring is automatically enabled with:
# WebApp with hardware monitoring
./run.py --webapp
# CLI with hardware monitoring
./run.py --enable-profiling
# Only with specific models
./run.py --limit 2 --enable-profiling
📝 Logger Output
When --enable-profiling is active, the benchmark prints metrics every second:
🌡️ GPU Temp: 45.3°C
⚡ GPU Power: 125.5W
💾 GPU VRAM: 8.2GB
🧠 GPU GTT: 0.0GB
🖥️ CPU: 35.2%
💾 RAM: 18.5GB
These outputs are:
- ✅ Saved in
~/.local/share/lm-studio-bench/logs/benchmark_YYYYMMDD_HHMMSS.log - ✅ Shown in the WebApp terminal
- ✅ Visualized as charts
🎯 Data Flow
Backend (cli/benchmark.py / agents/benchmark.py)
↓
Shared Module (tools/hardware_monitor.py)
↓
HardwareMonitor._monitor_loop()
├─ _get_temperature()
├─ _get_power_draw()
├─ _get_vram_usage()
├─ _get_gtt_usage()
├─ _get_cpu_usage()
└─ _get_ram_usage()
↓
logger.info() → stdout + log file
↓
WebApp Backend (app.py)
├─ _consume_output() Task (blocking readline)
├─ parse_hardware_metrics() (Regex patterns)
└─ hardware_history dict
↓
WebSocket
└─ Sends every 2 seconds (last 60 entries)
↓
Frontend (dashboard.html.jinja)
└─ 6 Plotly.js charts with live updates
Before each profiling run, HardwareMonitor.start() calls
_reset_measurements(). This clears prior temperature, power, VRAM, GTT,
CPU and RAM samples, so chart data and exported min/max/avg values only
reflect the current run.
🐛 Fixes and Optimizations
Fix 1: rocm-smi 7.0.1 Format Change
Problem: rocm-smi changed its output format Solution: regex parser extracts the last number from the line
match = re.search(r'[\d.]+\s*$', line.strip())
Fix 2: Logger Routing
Problem: hardware data did not appear in log files
Solution: print() → logger.info() for stdout + file
All hardware metrics are logged using Python's standard logging module:
logger.info(f"🌡️ GPU Temp: {temp:.1f}°C")
logger.info(f"💾 Memory: {vram_mb:.1f}MB VRAM + {gtt_mb:.1f}MB GTT")
This ensures metrics appear in both:
- stdout - Real-time display in terminal
- log files -
~/.local/share/lm-studio-bench/logs/benchmark_YYYYMMDD_HHMMSS.logfor permanent record - WebApp - Streamed via WebSocket to dashboard
Fix 3: WebApp Output Streaming
Problem: WebApp showed only 10% of the hardware data
Solution: asyncio.wait_for() → blocking readline() in executor
Fix 4: RAM Monitoring Spikes
Problem: RAM chart jumped between 1.8GB and 28.3GB Solution: moving average over 3 samples → very stable curve
Fix 5: Runtime Counter Does Not Stop
Problem: runtime counter continued after benchmark end
Solution: clearInterval(uptimeInterval) on completion
Fix 6: WebApp Initialization Race Conditions
Problem: links were not interactive, light mode on startup Solution: 3x DOMContentLoaded events → 1x consolidated event
📊 Chart Properties
All charts update every 2 seconds with:
- Min/Max/Avg statistics - real-time calculation
- Last 60 data points - about 2 minutes of history
- Responsive design - adapts to window size
- Dark mode - default for all charts
- Hover tooltips - show exact values on hover
LM Studio CLI - Available LLM Metadata with GPU Analysis
📋 Quick Reference
Main metadata query commands
lms ls --json # All downloaded models with metadata
lms ps --json # Currently loaded models
lms status # Server status + model size
lms version # LM Studio version
🎯 GPU Support and Hardware Requirements
Automatic GPU detection in the benchmark
The benchmark system automatically detects all your GPUs and specs:
NVIDIA GPUs:
- Automatic detection via
nvidia-smi - VRAM size recorded for offload optimization
- Temperature and power are monitored
AMD GPUs (rocm-smi):
- Detailed device ID mapping for GPU model names
- VRAM and GTT memory are tracked separately
- rocm-smi search paths:
/usr/bin,/usr/local/bin,/opt/rocm-*/bin/
iGPU detection:
- Radeon iGPUs are extracted from the CPU string
- Regex pattern:
Radeon\s+(\d+[A-Za-z]*) - Shows, for example, "Radeon 890M (Ryzen 9 7950X3D)" separately
📊 Full Metadata Fields (15 fields per model)
Category 1: Model identification (5 fields)
| Field | Type | Example | Description |
|---|---|---|---|
type | string | "llm" | Model type (llm, embedding) |
modelKey | string | "mistralai/ministral-3-3b" | Unique model ID |
displayName | string | "Ministral 3 3B" | Display name |
publisher | string | "mistralai" | Model publisher/developer |
path | string | "mistralai/ministral-3-3b" | Local storage path |
Category 2: Technical specifications (4 fields)
| Field | Type | Example | Description |
|---|---|---|---|
architecture | string | "mistral3", "gemma3", "llama" | Model architecture |
format | string | "gguf" | File format (GGUF, etc.) |
paramsString | string | "3B", "7B", "13B" | Parameter size |
sizeBytes | number | 2986817071 | Size in bytes |
Category 3: Model capabilities (3 fields)
| Field | Type | Example | Description |
|---|---|---|---|
vision | boolean | true / false | Can process images? |
trainedForToolUse | boolean | true / false | Supports tool calling? |
maxContextLength | number | 131072, 262144 | Maximum context length in tokens |
Category 4: Quantization and variants (4 fields)
| Field | Type | Example | Description |
|---|---|---|---|
quantization.name | string | "Q4_K_M", "Q8_0", "F16" | Quantization method |
quantization.bits | number | 4, 8, 16 | Bits per weight |
variants | array | [@q4_k_m, @q8_0] | All available quantizations |
selectedVariant | string | "mistralai/ministral-3-3b@q4_k_m" | Current selection |
🔍 Practical Examples with Your Models
Example 1: List vision models
lms ls --json | jq '.[] | select(.vision == true) | {displayName, paramsString, maxContextLength}'
Output:
• Gemma 3 4B (4B) - 131072 tokens
• Ministral 3 3B (3B) - 262144 tokens
• Qwen3 Vl 8B (8B) - 262144 tokens
The command uses the jq filter shown above.
Example 2: Tool-calling models only
lms ls --json | jq '.[] | select(.trainedForToolUse == true) | .displayName'
Example 3: Sort models by size
lms ls --json | jq 'sort_by(.sizeBytes) | .[] | {displayName, sizeGB: (.sizeBytes/1024/1024/1024|round*100/100)}'
Example 4: Models with large context length (≥128k tokens)
lms ls --json | jq '.[] | select(.maxContextLength >= 131072) | {modelKey, maxContextLength}'
Example 5: Model architecture distribution
lms ls --json | jq -r '.[] | .architecture' | sort | uniq -c
🐍 Python SDK Access
SDK methods for metadata queries
import lmstudio
# 1. Fetch all downloaded models
models = lmstudio.list_downloaded_models()
for model in models:
print(f"Model: {model.model_key}")
print(f" Size: {model.info.sizeBytes / 1024**3:.2f} GB")
print(f" Vision: {model.info.vision}")
print(f" Maximum context length: {model.info.maxContextLength} tokens")
print(f" Architecture: {model.info.architecture}")
print()
# 2. Currently loaded models
loaded_models = lmstudio.list_loaded_models()
for llm in loaded_models:
print(f"Loaded: {llm.identifier}")
# 3. Filter models
vision_models = [m for m in models if m.info.vision]
print(f"Vision models: {len(vision_models)}")
# 4. Sort by size
large_models = sorted(models, key=lambda m: m.info.sizeBytes, reverse=True)[:3]
for model in large_models:
print(f"{model.info.displayName}: {model.info.sizeBytes / 1024**3:.2f} GB")
💡 Common Use Cases
Use case 1: Quick performance tests
Filter only small models < 1GB for fast benchmarks:
lms ls --json | jq '.[] | select(.sizeBytes < 1000000000) | .modelKey'
Use case 2: Long-form processing
Models with large context for document analysis:
lms ls --json | jq '.[] | select(.maxContextLength >= 100000) | .displayName'
Use case 3: Image processing
Multi-modal models for vision tasks:
lms ls --json | jq '.[] | select(.vision == true) | .modelKey'
Use case 4: Tool integration
Models with function calling for agent systems:
lms ls --json | jq '.[] | select(.trainedForToolUse == true) | .displayName'
Use case 5: Quantization comparison
All available quantizations for a model:
lms ls "google/gemma-3-1b" --json | jq '.variants[]'
🎯 Benchmarking with Metadata
Integration into benchmark scripts:
import subprocess
import json
# Load model metadata
result = subprocess.run(
['lms', 'ls', '--json'],
capture_output=True,
text=True,
check=False
)
models = json.loads(result.stdout)
# Filter for benchmarking
benchmark_candidates = [
m for m in models
if m['sizeBytes'] < 5e9 # < 5GB
and m['vision'] is False # Text only
]
print(f"Benchmark candidates: {len(benchmark_candidates)}")
for model in benchmark_candidates:
print(f" - {model['displayName']} ({model['paramsString']})")
📝 Tips and Tricks
Convert size
# Bytes to GB
python3 -c "print(f'{2986817071/1024**3:.2f} GB')" # Output: 2.78 GB
JSON pretty print
lms ls --json | jq '.' | less
Quick statistics
# Average model size
lms ls --json | jq '[.[].sizeBytes] | add / length / 1024 / 1024 / 1024'
# Largest model
lms ls --json | jq 'max_by(.sizeBytes) | .displayName'
# Models per architecture
lms ls --json | jq 'group_by(.architecture) | map({architecture: .[0].architecture, count: length})'
🔗 Related Commands
lms status # Server status (shows loaded models too)
lms version # LM Studio version
lms load <model> # Load a model
lms unload --all # Unload all models
Troubleshooting
No output for lms ls --json
- Ensure the LM Studio server is running:
lms server start - Check for port conflicts
jq not installed
- Install:
sudo apt install jq(Linux) orbrew install jq(macOS) - Alternative: use Python parsing
Unlimited output
- Use
| head -n 5to limit - Or pipe to
lessfor paging:| less
User Data & Configuration Locations
This project follows the XDG Base Directory Specification for storing user data and configuration.
Directory Structure
Project Directory
The project directory contains read-only defaults and optional compatibility locations:
<project>/
├── config/
│ └── defaults.json # Project defaults (in Git)
├── results/ # Optional: legacy/manual compatibility location
└── logs/ # Optional: legacy/manual debug location
User Directories (XDG Standard)
User-specific data is stored in standard XDG locations:
~/.config/lm-studio-bench/
├── defaults.json # User configuration overrides (optional)
└── presets/
├── my_fast_test.json # User preset example
└── my_quality.json # User preset example
~/.local/share/lm-studio-bench/results/
├── benchmark_results_<timestamp>.json
├── benchmark_results_<timestamp>.csv
├── benchmark_results_<timestamp>.pdf
├── benchmark_results_<timestamp>.html
├── benchmark_cache.db # SQLite benchmark cache
├── model_metadata.db # Model metadata cache
└── metadata/
└── <model_id>/
└── metadata.json # Optional per-model metadata fallback
~/.local/share/lm-studio-bench/logs/
├── benchmark_<timestamp>.log
├── benchmark_latest.log # Symlink to newest benchmark log
├── webapp_<timestamp>.log
├── webapp_latest.log # Symlink to newest webapp log
├── runapp_<timestamp>.log
├── runapp_latest.log # Symlink to newest launcher log
├── trayapp_<timestamp>.log
└── trayapp_latest.log # Symlink to newest tray log
Configuration Loading
Configuration is loaded with the following priority:
- CLI Arguments (highest priority)
- User Config (
~/.config/lm-studio-bench/defaults.json) - Project Config (
config/defaults.json) - Hard-coded Defaults (in code)
Example
Project (config/defaults.json):
{
"num_runs": 3,
"context_length": 2048,
"lmstudio": {
"use_rest_api": false
}
}
User (~/.config/lm-studio-bench/defaults.json):
{
"num_runs": 5,
"lmstudio": {
"use_rest_api": true
}
}
Result (merged configuration):
{
"num_runs": 5, // User override
"context_length": 2048, // Project default
"lmstudio": {
"use_rest_api": true // User override
}
}
With CLI:
./run.py --runs 10 --context 4096
Final configuration:
num_runs: 10 (CLI)context_length: 4096 (CLI)use_rest_api: true (User config)
Creating User Configuration
Step 1: Create Config Directory
mkdir -p ~/.config/lm-studio-bench
Step 2: Create User Config File
nano ~/.config/lm-studio-bench/defaults.json
Step 3: Add Your Overrides
Only include fields you want to override:
{
"num_runs": 5,
"context_length": 4096,
"inference": {
"temperature": 0.7
}
}
Important: You only need to specify fields you want to change. All other values will use project defaults.
Directory Initialization
On first run, the tool automatically:
- Creates user data directories (
~/.config/...and~/.local/share/...) - Places new results in
~/.local/share/lm-studio-bench/results/ - Places runtime logs in
~/.local/share/lm-studio-bench/logs/
Note: Legacy files in project-local results/ are not automatically
moved. If you still use that location, move them manually to the XDG path.
Benefits of XDG Structure
For Users
- ✅ Persistent User Settings: Configuration survives project updates
- ✅ Cleaner Project Directory: User data separated from code
- ✅ Standard Locations: Follows Linux conventions
- ✅ Easy Backups: Backup
~/.local/share/lm-studio-bench/and~/.config/lm-studio-bench/ - ✅ Multi-User Support: Each user has their own data
For Developers
- ✅ No Git Conflicts: User data not in version control
- ✅ Clean Updates:
git pulldoesn't affect user data - ✅ Portable: Project directory can be moved/deleted without losing user data
Environment Variables
You can override paths with environment variables:
# Override config directory
export XDG_CONFIG_HOME="$HOME/my-configs"
# Override data directory
export XDG_DATA_HOME="$HOME/my-data"
# Now config is in: $HOME/my-configs/lm-studio-bench/defaults.json
# Now results are in: $HOME/my-data/lm-studio-bench/results/
FAQ
Q: Where are my benchmark results stored?
A: ~/.local/share/lm-studio-bench/results/
If you pass --output-dir, report files (JSON/CSV/HTML/PDF) are written there.
The SQLite cache databases still live in the user results directory.
Q: Where are the SQLite databases stored?
A:
~/.local/share/lm-studio-bench/results/benchmark_cache.db~/.local/share/lm-studio-bench/results/model_metadata.db
Q: Where do I put custom configuration?
A: ~/.config/lm-studio-bench/defaults.json
Only include fields you want to override from project defaults.
Q: Where are user presets stored?
A: ~/.config/lm-studio-bench/presets/
Built-in readonly presets (default_classic,
default_compatibility_test, default as a legacy alias,
quick_test, high_quality, resource_limited) are not stored as
files.
Readonly preset names cannot be overwritten or deleted by user presets,
including the alias default.
Q: What happens to my old results?
A: They are not auto-migrated from legacy project-local folders.
Move them manually to ~/.local/share/lm-studio-bench/results/.
Q: Can I use the old config/defaults.json?
A: Yes! It's still used as project defaults. User config in ~/.config/ overrides it.
Q: How do I reset to project defaults?
A: Delete your user config:
rm ~/.config/lm-studio-bench/defaults.json
Q: How do I backup my data?
A: Backup these directories:
# Configuration
tar -czf lms-bench-config.tar.gz ~/.config/lm-studio-bench/
# Results and cache
tar -czf lms-bench-data.tar.gz ~/.local/share/lm-studio-bench/
Q: What about logs?
A: Logs are stored in:
~/.local/share/lm-studio-bench/logs/
This includes benchmark, web app, tray, and launcher logs.
See Also
- Configuration Reference - All configuration options
- Architecture Documentation - System design
- XDG Base Directory Spec - Standard specification
LM Studio REST API v1 Integration
Overview
The benchmark tool now supports LM Studio's native REST API v1 (/api/v1/*)
in addition to the existing Python SDK/CLI mode. This enables advanced
features such as stateful chats, parallel requests, and more precise metrics.
New Features
1. REST API Mode (--use-rest-api)
- Uses
/api/v1/chatfor inference instead of the Python SDK - Stateful chat management (response_id tracking)
- Detailed stats in the response (TTF, tokens/s, tokens in/out)
- Streaming events for more accurate measurement
2. Model Management via API
GET /api/v1/models— list with capabilities (vision, tool-use)POST /api/v1/models/load— explicit load with configurationPOST /api/v1/models/unload— explicit unloadPOST /api/v1/models/download— download model via API
3. Improved Capabilities Detection
- Vision support:
capabilities.visionflag from the API - Tool calling:
capabilities.trained_for_tool_useflag - Use the
--only-visionor--only-toolsfilters
4. Parallel Inference (LM Studio 0.4.0+)
--n-parallel N— max concurrent predictions (default: 4)--unified-kv-cache— optimizes VRAM usage for parallel requests- Continuous batching support (llama.cpp 2.0+)
5. API Authentication
--api-token TOKEN— permission key for protected servers- Config:
lmstudio.api_tokeninconfig/defaults.json
Usage
Basic usage (REST API mode)
# REST API with default settings
./run.py --use-rest-api --limit 1
# With API token
./run.py --use-rest-api --api-token "your-token-here" --limit 1
# With parallel requests (LM Studio 0.4.0+)
./run.py --use-rest-api --n-parallel 8 --unified-kv-cache --limit 1
Filter by capabilities
# Test only vision-capable models
./run.py --use-rest-api --only-vision --runs 2
# Test only tool-calling models
./run.py --use-rest-api --only-tools --runs 2
Config file (persistent)
config/defaults.json:
{
"lmstudio": {
"host": "localhost",
"ports": [1234, 1235],
"api_token": "your-token-here",
"use_rest_api": true
}
}
Then simply:
./run.py --limit 1 # will automatically use REST API from config
Comparison: SDK vs. REST API
| Feature | SDK/CLI Mode | REST API Mode |
|---|---|---|
| Model Loading | lms load CLI | POST /api/v1/models/load |
| Inference | lmstudio.llm() | POST /api/v1/chat |
| Stats | SDK stats object | Detailed response stats |
| Streaming | SDK stream | SSE stream (Server-Sent Events) |
| Parallel Requests | ❌ | ✅ (with --n-parallel) |
| Stateful Chats | ❌ | ✅ (response_id tracking) |
| Capabilities | Metadata parsing | Native API fields |
| Authentication | ❌ | ✅ (permission keys) |
API Response Format
Dashboard summary API (/api/dashboard/stats)
The web dashboard now exposes additional summary fields for quick visual analysis of benchmark history. The endpoint is consumed by the Home and Results views to render KPI cards and charts.
New response fields:
speed_summary:min,p50,avg,p95,maxtokens/stop_models_extended: Top 10 models by speed (model, quantization, speed, VRAM, architecture)quantization_distribution: count per quantizationarchitecture_distribution: count per architectureefficiency_top: top models ranked bytokens_per_sec_per_gb
Example (excerpt):
{
"speed_summary": {
"min": 22.44,
"p50": 48.17,
"avg": 51.26,
"p95": 86.11,
"max": 93.88
},
"top_models_extended": [
{
"model_name": "qwen/qwen3-4b@q4_k_m",
"quantization": "q4_k_m",
"speed": 93.88,
"vram_mb": "6144",
"architecture": "qwen3"
}
],
"quantization_distribution": {
"q4_k_m": 22,
"q5_k_m": 13
}
}
/api/v1/chat stats
{
"text": "... generated text ...",
"stats": {
"tokens_in": 42,
"tokens_out": 128,
"time_to_first_token_ms": 234.5,
"total_time_ms": 1523.8,
"tokens_per_second": 84.02
}
}
/api/v1/models capabilities
{
"models": [
{
"key": "llava-1.6-vicuna-7b-q4_k_m",
"capabilities": {
"vision": true,
"trained_for_tool_use": false
}
},
{
"key": "qwen-2.5-coder-14b-instruct-q5_k_m",
"capabilities": {
"vision": false,
"trained_for_tool_use": true
}
}
]
}
Implementation details
New files
core/client.py: REST API client with wrapper functionsLMStudioRESTClient: main classModelInfo,ModelCapabilities,ChatStats: data classesis_vision_model(),is_tool_model(): helpers
Modified files
-
cli/benchmark.py:_run_inference(): dispatcher (SDK vs REST)_run_inference_rest(): REST-based inference_run_inference_sdk(): SDK-based inference (renamed)_load_model_rest(),_unload_model_rest(): REST model management
-
config/defaults.json: addedapi_token,use_rest_apifields -
core/config.py: new config fields inBASE_DEFAULT_CONFIG
CLI flags
--use-rest-api Enable REST API mode
--api-token TOKEN API permission token
--n-parallel N Max parallel predictions (REST only)
--unified-kv-cache Unified KV cache (REST only)
Troubleshooting
Server unreachable
# Check whether LM Studio is running
curl http://localhost:1234/
# Healthcheck via CLI
lms server status
API token errors
# Generate token in Settings > Server
# Save it in config or pass via CLI
./run.py --use-rest-api --api-token "lms_..."
REST vs SDK performance
- REST: more precise stats, more features
- SDK: slightly faster (direct Python access)
- For benchmarking, REST is recommended (better metrics)
Additional REST Client Features
1. Download Progress Tracking
The REST client now supports real-time download progress monitoring:
from rest_client import LMStudioRESTClient
client = LMStudioRESTClient()
def on_progress(status):
if status["state"] == "downloading":
print(f"Progress: {status['progress'] * 100:.1f}%")
# Wait for download to complete with progress updates
success = client.download_model(
model_key="qwen/qwen3-1.7b",
wait_for_completion=True,
progress_callback=on_progress
)
API: Polls /api/v1/models/download/status every 2 seconds until completion.
2. MCP Integration
Model Context Protocol (MCP) servers can now be attached to chat requests:
# LM Studio v1 API format
mcp_integrations = [
{
"type": "ephemeral_mcp",
"server_label": "filesystem",
"server_url": "http://localhost:3001/mcp"
}
]
result = client.chat_stream(
messages=[{"role": "user", "content": "List files in /tmp"}],
model="qwen/qwen3-4b",
mcp_integrations=mcp_integrations
)
Note: Requires MCP server running. Integrations are passed in the integrations array field.
3. Stateful Chat History
Enable multi-turn conversations with automatic response_id tracking:
client = LMStudioRESTClient()
# First message
result1 = client.chat_stream(
messages=[{"role": "user", "content": "What is 2+2?"}],
model="qwen/qwen3-4b",
use_stateful=True
)
# response_id stored automatically
# Second message - automatically includes previous_response_id
result2 = client.chat_stream(
messages=[{"role": "user", "content": "Add 3 to that."}],
model="qwen/qwen3-4b",
use_stateful=True
)
# Server can maintain conversation context
# Reset state when starting new conversation
client.reset_stateful_chat()
API: Extracts response_id from chat.end event, sends previous_response_id in subsequent requests.
4. Response Caching
Identical requests are cached in memory for instant responses:
client = LMStudioRESTClient(enable_cache=True)
# First request - hits API (slow)
result1 = client.chat_stream(
messages=[{"role": "user", "content": "Count to 5"}],
model="qwen/qwen3-4b",
temperature=0.5
)
# Time: ~0.5s
# Second identical request - hits cache (instant)
result2 = client.chat_stream(
messages=[{"role": "user", "content": "Count to 5"}],
model="qwen/qwen3-4b",
temperature=0.5
)
# Time: ~0.0s (10,000x faster!)
# Cache management
cache_size = len(client._RESPONSE_CACHE) # Check cache size
cleared = client.clear_cache() # Clear all cached responses
Cache Key: MD5 hash of (messages, model, temperature)
Bypassed: When using use_stateful=True or mcp_integrations (non-deterministic)
Documentation links
- LM Studio REST API Docs
- /api/v1/models endpoint
- /api/v1/chat endpoint
- Headless mode
- LM Studio 0.4.0 blog
Capability-Driven Benchmark Agent Integration
The new Capability-Driven Benchmark Agent functionality is fully integrated into the project and is now available via run.py.
3 Operating Modes
The system now supports 3 different operating modes:
1. Classic Benchmark (Default)
Measures token/s speed across all installed models:
./run.py --limit 5 # Test 5 models
./run.py --export-only # Generate reports from cache
./run.py --runs 1 # Fast-mode with 1 measurement
Metrics: Tokens/s, latency, VRAM usage
2. Capability-Driven Agent ⭐ NEW
Tests model capabilities with quality metrics:
./run.py --agent "model-id" # Automatically test all capabilities
# With specific capabilities
./run.py --agent "llama-13b" --capabilities general_text,reasoning
# With output format options
./run.py --agent "llama-13b" --output-dir ./results/ --formats json,html
# Verbose mode
./run.py --agent "llama-13b" --verbose
Detectable Capabilities:
general_text- Basic language understanding (QA, summarization, classification)reasoning- Logical and mathematical reasoningvision- Multimodal understanding (image captioning, VQA, OCR)tooling- Tool calling and function execution
Metrics per Capability:
- Quality: ROUGE, F1, Exact Match, Accuracy, Function Call Accuracy
- Performance: Tokens/s, latency
- Reports: JSON + HTML with visualizations
- Storage: SQLite database for historical tracking and comparison
Runtime Resilience:
- Multi-model capability runs continue when a single model fails to load or execute; failed models are logged and skipped.
- Embedding models are retried automatically without
offload_kv_cache_to_gpuif LM Studio rejects that load option.
Data Storage:
Results are automatically saved to:
- JSON Reports:
./output/benchmark_results_*.json - HTML Reports:
./output/benchmark_results_*.html - SQLite Cache:
~/.local/share/lm-studio-bench/results/benchmark_cache.db
The SQLite database stores individual test results and capability summaries, allowing you to:
- Track performance over time
- Compare results across models
- Query specific capability metrics
- Build custom dashboards from cached data
SQLite Metrics Matrix (Classic vs Capability)
The table below lists what is currently persisted in SQLite for both test types, so missing metrics are easy to spot.
| Metric Group | Classic Benchmark (benchmark_results) | Capability Benchmark (benchmark_results, source='compatibility') |
|---|---|---|
| Run identity | id, model_key, model_name, quantization, timestamp | id, model_name, model_key, capability, test_id, test_name, timestamp |
| Throughput/latency | avg_tokens_per_sec, avg_ttft, avg_gen_time, tokens_per_sec_p50, tokens_per_sec_p95, tokens_per_sec_std, ttft_p50, ttft_p95, ttft_std | latency_ms, throughput_tokens_per_sec (per test), avg_latency_ms, avg_throughput (summary) |
| Token volume | prompt_tokens, completion_tokens | prompt_tokens, tokens_generated |
| Quality metrics | Stored for parity columns but normally NULL for classic runs | quality_score, rouge_score, f1_score, exact_match_score, accuracy_score, function_call_accuracy, avg_quality_score, avg_rouge, avg_f1, avg_exact_match, avg_accuracy |
| Success/failure | success, error_message, error_count | success, error_message (per test), total_tests, successful_tests, failed_tests, success_rate, error_count |
| Hardware profiling | gpu_type, gpu_offload, vram_mb, temp_celsius_min/max/avg, power_watts_min/max/avg, vram_gb_min/max/avg, gtt_gb_min/max/avg, cpu_percent_min/max/avg, ram_gb_min/max/avg | Same run-level hardware fields are persisted on each capability test row |
| Inference/load params | context_length, temperature, top_k_sampling, top_p_sampling, min_p_sampling, repeat_penalty, max_tokens, n_gpu_layers, n_batch, n_threads, flash_attention, rope_freq_base, rope_freq_scale, use_mmap, use_mlock, kv_cache_quant | Same run-level inference/load fields are persisted on each capability test row |
| Environment/version | lmstudio_version, app_version, nvidia_driver_version, rocm_driver_version, intel_driver_version, os_name, os_version, cpu_model, python_version | Same environment/version fields are persisted on each capability test row |
| Derived/comparison | tokens_per_sec_per_gb, tokens_per_sec_per_billion_params, speed_delta_pct, prev_timestamp | Same derived/comparison fields are persisted on each capability test row |
| Raw text/reference | prompt (full input prompt), raw_output, reference_output | prompt, raw_output, reference_output |
Quick gap summary
- Missing in capability mode: TTFT distribution stats and classic-only aggregate throughput percentiles.
- Missing in classic mode: meaningful per-test quality metrics (ROUGE/F1/Exact/Accuracy) because classic benchmarks do not execute capability test cases.
Variant selection in REST mode
- Capability mode now forwards the exact requested model identifier,
including any
@quantizationsuffix, to the LM Studio REST API. - This keeps
load,chat, andunloadaligned with the selected variant and avoids silently falling back to a server-side default quantization.
3. Web Dashboard
Modern web UI with live streaming and configuration:
./run.py --webapp # Starts on http://localhost:8080
./run.py -w # Short form
Agent Options
./run.py --agent MODEL_PATH [OPTIONS]
OPTIONS:
--capabilities CAPS Comma-separated capabilities
(general_text, reasoning, vision, tooling)
--output-dir DIR Output directory (default: output)
--config FILE YAML configuration file
--formats FORMATS Output formats: json,html (default: json,html)
--max-tests N Max tests per capability
--context-length N Model context length (default: 2048)
--gpu-offload RATIO GPU offload ratio 0.0-1.0 (default: 1.0)
--temperature TEMP Generation temperature (default: 0.1)
-v, --verbose Enable verbose logging
Test Data and Prompts
The following test files are available:
tests/
├── data/
│ ├── text/
│ │ ├── qa_samples.json # QA examples
│ │ ├── reasoning_samples.json # Reasoning examples
│ │ └── tooling_samples.json # Tool-calling examples
│ └── images/
│ └── README.md # Vision datasets
└── prompts/
├── general_text_qa.md
├── general_text_summarization.md
├── reasoning_logical.md
├── reasoning_math.md
├── tooling_function_call.md
├── vision_caption.md
└── vision_vqa.md
Example Executions
# All capabilities (auto-detected)
./run.py --agent "my-model" --output-dir results/
# Only General Text and Reasoning
./run.py --agent "my-model" --capabilities general_text,reasoning
# With custom config
./run.py --agent "my-model" --config config/bench.yaml
# Verbose with all details
./run.py --agent "my-model" --verbose --max-tests 20
# Classic benchmark still available
./run.py --limit 10 --runs 3
Code Structure
cli/
├── main.py # CLI entrypoint for agent
├── __main__.py # Makes cli package executable
├── benchmark.py # Classic benchmark runner
├── metrics.py # Metric implementations
├── reporting.py # JSON & HTML report generation
└── report_template.html.template
config/
└── bench.yaml # Default configuration
agents/
├── benchmark.py # Benchmark executor
├── runner.py # Test orchestration
└── capabilities.py # Capability detection
core/
├── config.py # Configuration loading
├── paths.py # XDG/user path handling
├── client.py # LM Studio REST API client
└── tray.py # Linux tray controller
Documentation
- README-bench.md - Detailed agent documentation
- ARCHITECTURE.md - System architecture
- CONFIGURATION.md - Configuration guide
Logging
Capability benchmark logs use automatic level icons in addition to benchmark-specific emoji markers:
🐛Debugℹ️Info⚠️Warning❌Error🔥Critical
Capability-Driven Benchmark Agent for LM Studio Bench
This benchmark agent implements capability-driven evaluation for language models and multimodal models. It detects model capabilities, runs targeted tests, computes quality metrics, and generates comprehensive reports.
Features
- Automatic capability detection (general text, reasoning, vision, tooling)
- Per-capability test suites with standardized prompts
- Quality metrics: ROUGE, F1, Exact Match, Accuracy, Function Call Accuracy
- Performance metrics: tokens/sec, latency
- Machine-readable JSON and human-friendly HTML reports
- CLI interface with extensive configuration options
- Docker support for containerized execution
- GitHub Actions integration for CI/CD benchmarking
Quick Start
Local Execution
Run a benchmark on a model:
python -m cli.main "path/to/model" --output-dir output
Run across installed models:
python -m cli.main --all-models --output-dir output
python -m cli.main --random-models 5 --output-dir output
With specific capabilities:
python -m cli.main "model-id" \
--capabilities general_text,reasoning \
--output-dir results
Using Docker
Build the Docker image:
docker build -f scripts/Dockerfile.bench -t lm-bench-agent .
Run benchmark in container:
docker run -v $(pwd)/output:/app/output \
lm-bench-agent "model-path" \
--output-dir /app/output
Capabilities
The agent supports four primary capabilities:
1. General Text
Tests basic language understanding and generation:
- Question answering
- Summarization
- Classification
Metrics: ROUGE-1, ROUGE-L, F1
2. Reasoning
Tests logical and mathematical reasoning:
- Logical reasoning (syllogisms)
- Math problem solving
- Chain-of-thought reasoning
Metrics: Exact Match, F1, Accuracy
3. Vision
Tests multimodal understanding (requires vision models):
- Image captioning
- Visual Question Answering (VQA)
- OCR and visual reasoning
Metrics: Accuracy, ROUGE-L
4. Tooling
Tests function calling and tool use:
- Function selection
- Parameter extraction
- API interaction patterns
Metrics: Function Call Accuracy, Parameter Accuracy
CLI Reference
Basic Usage
python -m cli.main MODEL_PATH [OPTIONS]
Arguments
MODEL_PATH: Path to model or model identifier (required)
Options
Model Configuration
--model-name NAME: Override model name (default: derived from path)--all-models: Run the capability benchmark for all installed models--random-models N: Run the capability benchmark forNrandom installed models--capabilities CAPS: Comma-separated capabilities to test- Options:
general_text,reasoning,vision,tooling - Default: Auto-detect from model metadata
- Options:
Output Configuration
--output-dir DIR: Output directory (default:output)--formats FMTS: Output formats:json,html(default: both)
Test Configuration
--max-tests N: Maximum tests per capability (default: 10)--config FILE: Path to YAML configuration file
Model Parameters
--context-length N: Model context length (default: 2048)--gpu-offload RATIO: GPU offload ratio 0.0-1.0 (default: 1.0)--temperature T: Generation temperature (default: 0.1)
Other
--verbose,-v: Enable verbose logging
Examples
Benchmark with custom configuration:
python -m cli.main "mymodel" \
--config custom_config.yaml \
--max-tests 20 \
--verbose
Test only reasoning capability:
python -m cli.main "reasoning-model" \
--capabilities reasoning \
--temperature 0.0 \
--max-tests 50
Generate only JSON output:
python -m cli.main "model" \
--formats json \
--output-dir json_results
Run against random installed models:
python -m cli.main --random-models 3 --capabilities general_text,reasoning
Runtime Behavior
- When running across multiple installed models, a single model failure is logged and skipped so the benchmark can continue.
- For embedding models loaded through the LM Studio REST API, the loader
automatically retries without
offload_kv_cache_to_gpuif LM Studio rejects that option. - Log output includes automatic level icons such as
ℹ️,⚠️, and❌in addition to benchmark-specific emoji markers.
Configuration File
The agent reads configuration from config/bench.yaml by default. Override with --config flag.
Configuration Schema
context_length: 2048
gpu_offload: 1.0
temperature: 0.1
max_tokens: 256
max_tests_per_capability: 10
use_rest_api: true
data_dir: tests/data
prompts_dir: tests/prompts
timeout_seconds: 300
metric_weights:
general_text:
rouge-1: 0.3
rouge-l: 0.4
f1: 0.3
reasoning:
exact_match: 0.5
f1: 0.3
accuracy: 0.2
vision:
accuracy: 0.6
rouge-l: 0.4
tooling:
function_call_accuracy: 0.7
accuracy: 0.3
composite_score_weights:
quality: 0.6
performance: 0.2
efficiency: 0.2
lmstudio:
host: localhost
ports:
- 1234
- 1235
api_token: null
Key Configuration Options
context_length: Maximum context length for modelgpu_offload: GPU memory allocation (0.0 = CPU only, 1.0 = full GPU)max_tests_per_capability: Limit tests to prevent long runsmetric_weights: Per-capability metric importancecomposite_score_weights: Overall score composition
Output Format
JSON Report
The JSON report follows this schema:
{
"schema_version": "1.0",
"generated_at": "2025-01-15T10:30:00",
"report": {
"model_name": "model-name",
"model_path": "path/to/model",
"capabilities": ["general_text", "reasoning"],
"timestamp": "2025-01-15T10:30:00",
"summary": {
"total_tests": 20,
"successful_tests": 19,
"success_rate": 0.95,
"avg_latency_ms": 245.6,
"avg_quality_score": 0.823,
"avg_throughput_tokens_per_sec": 42.3,
"by_capability": {
"general_text": {
"test_count": 10,
"avg_quality_score": 0.856,
"success_rate": 1.0
}
}
},
"results": [
{
"test_id": "qa_001",
"capability": "general_text",
"latency_ms": 230.5,
"tokens_generated": 12,
"throughput": 52.1,
"quality_score": 0.89,
"metrics": [
{
"name": "rouge-1",
"value": 0.85,
"normalized": 0.85
}
],
"error": null
}
],
"config": {},
"raw_outputs_dir": "output/raw"
}
}
HTML Report
The HTML report provides:
- Summary statistics with visual indicators
- Per-test results table with status, latency, and quality scores
- Capability breakdown with aggregated metrics
- Color-coded quality scores (green/yellow/red)
Raw Outputs
Individual test outputs are saved in output/raw/:
{
"test_id": "qa_001",
"capability": "general_text",
"prompt": "What is the capital of France?",
"response": "Paris",
"latency_ms": 230.5,
"tokens_generated": 12,
"throughput": 52.1,
"timestamp": 1642244400.123,
"error": null
}
GitHub Actions Integration
The workflow .github/workflows/bench.yml enables CI benchmarking.
Triggering the Workflow
Manual Trigger
- Go to Actions tab in GitHub
- Select "Capability-Driven Benchmark"
- Click "Run workflow"
- Enter model path and capabilities
- Click "Run workflow"
Scheduled Trigger
Runs automatically every Sunday at midnight (UTC).
Push Trigger
Runs on push to main or dev branches.
Note: the benchmark step currently reads the model path only from
manual workflow_dispatch inputs. Push- and schedule-triggered
runs therefore skip the actual benchmark unless you adapt the
workflow to read the model path from another configuration source
(for example, a repository variable or secret).
Workflow Outputs
The workflow uploads three artifacts:
- benchmark-results-json: JSON reports (30-day retention)
- benchmark-results-html: HTML reports (30-day retention)
- benchmark-raw-outputs: Raw test outputs (7-day retention)
For pull requests, a summary comment is posted with key metrics.
Adding Test Data
General Text Tests
Add test cases to tests/data/text/qa_samples.json:
{
"id": "qa_004",
"prompt": "Your question here",
"reference": "Expected answer",
"category": "domain"
}
Reasoning Tests
Add to tests/data/text/reasoning_samples.json:
{
"id": "reasoning_004",
"prompt": "Problem statement",
"reference": "Answer",
"reasoning": "Explanation of solution",
"category": "math"
}
Vision Tests
Place images in tests/data/images/ and reference them in test cases.
Tooling Tests
Add to tests/data/text/tooling_samples.json:
{
"id": "tool_004",
"task": "Task description",
"expected_function": "function_name",
"expected_parameters": {"param": "value"},
"category": "function_calling"
}
Customizing Prompts
Prompt templates are in tests/prompts/:
general_text_qa.md: Question answeringgeneral_text_summarization.md: Summarizationreasoning_logical.md: Logical reasoningreasoning_math.md: Math problemsvision_caption.md: Image captioningvision_vqa.md: Visual QAtooling_function_call.md: Function calling
Edit templates to adjust instruction format or add few-shot examples.
Troubleshooting
Model Loading Fails
Ensure LM Studio is running and the model is available:
lms status
lms models list
No Tests Execute
Check that test data files exist:
ls tests/data/text/
Verify capabilities are correctly specified:
python -m cli.main "model" --capabilities general_text --verbose
Metrics Are Zero
This usually means:
- Model output format doesn't match expected format
- Reference answers need normalization
- Wrong capability assigned to test
Check raw outputs in output/raw/ to inspect actual responses.
Timeout Errors
Increase timeout in config:
timeout_seconds: 600
Or reduce test count:
python -m cli.main "model" --max-tests 5
API Integration
Using as a Library
from pathlib import Path
from agents.runner import BenchmarkRunner
from cli.reporting import generate_reports
config = {
"context_length": 2048,
"max_tests_per_capability": 5,
"use_rest_api": True
}
runner = BenchmarkRunner(
config=config,
output_dir=Path("output")
)
report = runner.run(
model_path="mymodel",
model_name="MyModel",
capabilities=["general_text"]
)
outputs = generate_reports(
report_data=report,
output_dir=Path("output"),
formats=["json", "html"]
)
print(f"JSON: {outputs['json']}")
print(f"HTML: {outputs['html']}")
Custom Model Adapter
Implement ModelAdapter interface:
from agents.benchmark import ModelAdapter, InferenceResult
class CustomAdapter(ModelAdapter):
def load(self, model_path, **kwargs):
pass
def unload(self):
pass
def infer(self, prompt, image_path=None, **kwargs):
return InferenceResult(...)
def is_loaded(self):
return True
Use with runner:
adapter = CustomAdapter()
report = runner.run(
model_path="model",
adapter=adapter
)
Architecture
Components
agents/capabilities.py: Capability detection logicagents/benchmark.py: Core benchmark agent and model adaptersagents/runner.py: Test orchestration and loadingcli/metrics.py: Metric implementationscli/reporting.py: Report generation (JSON, HTML)cli/main.py: Command-line interfaceconfig/bench.yaml: Default configurationtests/data/: Test datasetstests/prompts/: Prompt templates
Data Flow
- CLI parses arguments and loads configuration
- Runner detects capabilities from model metadata or flags
- Test loader creates test cases for detected capabilities
- Model adapter loads the model
- Agent runs each test case:
- Executes inference
- Saves raw output
- Computes metrics
- Reporter generates JSON and HTML from results
- Outputs are saved to disk
License
This benchmark agent is part of LM-Studio-Bench and follows the same license.
Contributing
Contributions are welcome:
- Add new capabilities
- Implement new metrics
- Expand test datasets
- Improve prompt templates
- Enhance reporting formats
Follow the coding standards in .github/instructions/code-standards.instructions.md.
SQLite Metric Parity Map
This table is intentionally compact: one metric per row.
Legend:
[x]= metric is stored in both test modes[ ]= metric is missing in at least one mode
Notes:
-
Capability rows normalize quantization to an uppercase label such as
Q4_K_M; classic rows keep the classic benchmark format such asq4_k_m. -
Capability
lmstudio_versionstores a parsed version orpkg_version (commit:<sha>), not the rawlms versionbanner output. -
Capability REST runs forward the exact model variant key, including the
@quantizationsuffix, to LM Studio load/chat/unload requests. -
Classic rows intentionally leave capability-only fields such as
quality_score,raw_output,reference_output,capability, andtest_idempty. -
Historical rows created before recent schema/runtime fixes may still contain
NULLvalues in parity columns. New rows should populate them.
| Metric | benchmark_results (classic) | benchmark_results (compatibility) | Stored in both tests |
|---|---|---|---|
| Row id | id | id | [x] |
| Model name | model_name | model_name | [x] |
| Timestamp | timestamp | timestamp | [x] |
| Model path/source | model_key | model_key | [x] |
| Capability label | capability | capability | [x] |
| Test case id | test_id | test_id | [x] |
| Test case name | test_name | test_name | [x] |
| Quantization | quantization | quantization | [x] |
| Inference params hash | inference_params_hash | inference_params_hash | [x] |
| Tokens per second | avg_tokens_per_sec | avg_tokens_per_sec | [x] |
| Latency | avg_gen_time | avg_gen_time | [x] |
| TTFT | avg_ttft | avg_ttft | [x] |
| Prompt token count | prompt_tokens | prompt_tokens | [x] |
| Completion/generated tokens | completion_tokens | tokens_generated | [x] |
| Primary quality score | quality_score | quality_score | [x] |
| ROUGE | rouge_score | rouge_score | [x] |
| F1 | f1_score | f1_score | [x] |
| Exact match | exact_match_score | exact_match_score | [x] |
| Accuracy | accuracy_score | accuracy_score | [x] |
| Function-call accuracy | function_call_accuracy | function_call_accuracy | [x] |
| Success flag | success | success | [x] |
| Error message | error_message | error_message | [x] |
| Error counter | error_count | error_count | [x] |
| Total tests per capability | - | aggregate COUNT(*) by capability | [ ] |
| Successful tests per capability | - | aggregate SUM(success = 1) | [ ] |
| Failed tests per capability | - | aggregate SUM(success != 1) | [ ] |
| Success rate per capability | - | derived aggregate (successful / total) | [ ] |
| GPU type | gpu_type | gpu_type | [x] |
| GPU offload ratio | gpu_offload | gpu_offload | [x] |
| VRAM (MB) | vram_mb | vram_mb | [x] |
| Temperature stats | temp_celsius_min/max/avg | temp_celsius_min/max/avg | [x] |
| Power stats | power_watts_min/max/avg | power_watts_min/max/avg | [x] |
| VRAM GB stats | vram_gb_min/max/avg | vram_gb_min/max/avg | [x] |
| GTT GB stats | gtt_gb_min/max/avg | gtt_gb_min/max/avg | [x] |
| CPU usage stats | cpu_percent_min/max/avg | cpu_percent_min/max/avg | [x] |
| RAM GB stats | ram_gb_min/max/avg | ram_gb_min/max/avg | [x] |
| Context length | context_length | context_length | [x] |
| Temperature sampling param | temperature | temperature | [x] |
| Top-K sampling param | top_k_sampling | top_k_sampling | [x] |
| Top-P sampling param | top_p_sampling | top_p_sampling | [x] |
| Min-P sampling param | min_p_sampling | min_p_sampling | [x] |
| Repeat penalty | repeat_penalty | repeat_penalty | [x] |
| Max tokens param | max_tokens | max_tokens | [x] |
| GPU layer setting | n_gpu_layers | n_gpu_layers | [x] |
| Batch setting | n_batch | n_batch | [x] |
| Thread setting | n_threads | n_threads | [x] |
| Flash attention setting | flash_attention | flash_attention | [x] |
| RoPE base setting | rope_freq_base | rope_freq_base | [x] |
| RoPE scale setting | rope_freq_scale | rope_freq_scale | [x] |
| mmap setting | use_mmap | use_mmap | [x] |
| mlock setting | use_mlock | use_mlock | [x] |
| KV cache quant setting | kv_cache_quant | kv_cache_quant | [x] |
| LM Studio version | lmstudio_version | lmstudio_version | [x] |
| App version | app_version | app_version | [x] |
| Driver versions | nvidia/rocm/intel_driver_version | nvidia/rocm/intel_driver_version | [x] |
| OS info | os_name, os_version | os_name, os_version | [x] |
| CPU model | cpu_model | cpu_model | [x] |
| Python version | python_version | python_version | [x] |
| Benchmark duration | benchmark_duration_seconds | benchmark_duration_seconds | [x] |
| Raw model output | raw_output | raw_output | [x] |
| Reference output | reference_output | reference_output | [x] |
| Efficiency per GB | tokens_per_sec_per_gb | tokens_per_sec_per_gb | [x] |
| Efficiency per B params | tokens_per_sec_per_billion_params | tokens_per_sec_per_billion_params | [x] |
| Speed delta vs previous | speed_delta_pct | speed_delta_pct | [x] |
| Previous timestamp link | prev_timestamp | prev_timestamp | [x] |
| Prompt hash | prompt_hash | prompt_hash | [x] |
| Full params hash | params_hash | params_hash | [x] |
| Prompt text | prompt | prompt | [x] |
Historical Validation Queries
Use these queries to find older rows that predate parity fixes.
-- Classic rows that still miss parity fields introduced later.
SELECT id, model_name, timestamp,
quantization, lmstudio_version, app_version, success
FROM benchmark_results
WHERE quantization IS NULL
OR lmstudio_version IS NULL
OR app_version IS NULL
OR success IS NULL
ORDER BY id DESC;
-- Compatibility rows that still miss core parity fields.
SELECT id, model_name, capability, test_id,
quantization, lmstudio_version, app_version,
prompt_hash, params_hash
FROM benchmark_results
WHERE source = 'compatibility'
AND (
quantization IS NULL
OR lmstudio_version IS NULL
OR app_version IS NULL
OR prompt_hash IS NULL
OR params_hash IS NULL
)
ORDER BY id DESC;
-- Compatibility summary directly from benchmark_results.
SELECT model_name,
capability,
COUNT(*) AS total_tests,
SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) AS successful_tests,
SUM(CASE WHEN success = 1 THEN 0 ELSE 1 END) AS failed_tests,
AVG(avg_gen_time) AS avg_latency_ms,
AVG(throughput_tokens_per_sec) AS avg_throughput,
AVG(quality_score) AS avg_quality_score,
AVG(rouge_score) AS avg_rouge,
AVG(f1_score) AS avg_f1,
AVG(exact_match_score) AS avg_exact_match,
AVG(accuracy_score) AS avg_accuracy
FROM benchmark_results
WHERE source = 'compatibility'
GROUP BY model_name, capability
ORDER BY MAX(id) DESC;
Architecture Documentation
Comprehensive architecture documentation with Mermaid diagrams showing how the Python modules interact and how CLI arguments and configuration files are processed.
Table of Contents
- Architecture Documentation
- Table of Contents
- System Architecture Overview
- Startup Flow
- Setup Flow (Installation & Configuration)
- Tray Control Flow (Linux)
- Tray Quit Sequence (Linux)
- Configuration Loading
- Configuration Priority
- Benchmark Execution Flow
- REST API vs SDK Mode
- Component Details
- Data Flow Summary
- Testing Architecture
- See Also
System Architecture Overview
graph TB
User([User]) --> RunPy[run.py<br/>Entry Point]
RunPy -->|--webapp/-w flag| WebApp[web/app.py<br/>FastAPI Server]
RunPy -->|benchmark mode| Benchmark[cli/benchmark.py<br/>Benchmark Engine]
Benchmark --> ConfigLoader[core/config.py<br/>Configuration Manager]
Benchmark --> PresetManager[core/presets.py<br/>Preset Manager]
Benchmark --> RestClient[core/client.py<br/>REST API Client]
ConfigLoader -->|reads| ProjectConfig[config/defaults.json<br/>Project Defaults]
ConfigLoader -->|reads| UserConfig[~/.config/lm-studio-bench/defaults.json<br/>User Overrides]
ConfigLoader -->|provides| DefaultConfig[(DEFAULT_CONFIG<br/>Merged)]
Benchmark -->|uses| LMStudio[LM Studio Server<br/>localhost:1234/1235]
RestClient -->|HTTP API v1| LMStudio
Benchmark -->|writes| ResultsDB[(~/.local/share/lm-studio-bench/results/<br/>benchmark_cache.db)]
Benchmark -->|exports| Reports[JSON/CSV/PDF/HTML<br/>Reports]
WebApp -->|launches| Benchmark
WebApp -->|reads| ResultsDB
WebApp -->|serves| Dashboard[Web Dashboard<br/>http://localhost:PORT]
RunPy -->|starts background process| Tray[core/tray.py<br/>Linux Tray Controller]
Tray -->|polls /api/status| WebApp
Tray -->|calls /api/benchmark/*| WebApp
Tray -->|Quit calls /api/system/shutdown| WebApp
style RunPy fill:#e1f5ff
style Benchmark fill:#ffe1e1
style ConfigLoader fill:#e1ffe1
style RestClient fill:#fff4e1
style DefaultsJSON fill:#f0f0f0
style LMStudio fill:#e8deff
Key Components:
- run.py: Wrapper script that decides between web dashboard and CLI benchmark mode
- benchmark.py: Main benchmark engine with argparse, model discovery, and execution
- config_loader.py: Loads and merges configuration from JSON file with built-in defaults
- core/presets.py: Manages readonly/user presets and maps presets to CLI args
- tools/hardware_monitor.py: Shared
GPUMonitorandHardwareMonitorimplementation for classic and capability flows - rest_client.py: REST API client for LM Studio v1 endpoints (optional mode)
- web/app.py: FastAPI web dashboard with live streaming and results browser
- tray.py: Linux AppIndicator tray controller for benchmark controls
Startup Flow
AppImage Entry Point
When the AppImage is executed, the bundled lmstudio-bench shell script runs
before run.py and splits on whether real arguments are present:
flowchart TD
AppImg([LM-Studio-Bench.AppImage args]) --> CheckArgs{Real args<br/>besides --debug/-d?}
CheckArgs -->|No args| TrayOnly[exec tray.py --url http://localhost:1234<br/>stays in system tray]
CheckArgs -->|Any other arg| RunPy[delegate to run.py + args]
style AppImg fill:#d0e8ff
style TrayOnly fill:#e1ffe1
style RunPy fill:#ffe1ff
--debug/-dis exempt:./AppImage --debugstill enters tray-only mode with verbose logging.
run.py Flow
flowchart TD
Start([./run.py args]) --> CheckHelp{--help or -h?}
CheckHelp -->|Yes| ShowHelp[Show Extended Help<br/>+ benchmark.py --help]
CheckHelp -->|No| CheckWebFlag{--webapp or -w<br/>in args?}
CheckWebFlag -->|Yes| RemoveFlag[Remove --webapp/-w<br/>from args]
RemoveFlag --> ResolvePort[Extract or assign<br/>web port]
ResolvePort --> StartTrayWeb[start tray.py<br/>with --url dashboard]
StartTrayWeb --> FindWebApp{web/app.py<br/>exists?}
FindWebApp -->|Yes| StartWeb[subprocess.call<br/>python web/app.py + args]
FindWebApp -->|No| ErrorWeb[Error: app.py not found]
CheckWebFlag -->|No| StartTrayCLI[start tray.py<br/>with localhost:1234]
StartTrayCLI --> FindBenchmark{cli/benchmark.py<br/>exists?}
FindBenchmark -->|Yes| StartBenchmark[subprocess.call<br/>python cli/benchmark.py + args]
FindBenchmark -->|No| ErrorBench[Error: benchmark.py not found]
ShowHelp --> Exit1([exit 0])
StartWeb --> Exit2([exit with app.py status])
StartBenchmark --> Exit3([exit with benchmark.py status])
ErrorWeb --> Exit4([exit 1])
ErrorBench --> Exit5([exit 1])
style Start fill:#e1f5ff
style StartWeb fill:#ffe1ff
style StartBenchmark fill:#ffe1e1
Decision Logic (run.py):
- Help Mode (
--help/-h): Displays extended help combining run.py explanation + benchmark.py CLI options - Web Mode (
--webapp/-w): Launches tray + FastAPI dashboard on a free local port - Benchmark Mode (default): Launches tray + benchmark.py with all CLI arguments
AppImage vs. run.py — default behaviour difference:
| Invocation | No-argument default |
|---|---|
./LM-Studio-Bench.AppImage | Tray-only (stays in panel, no benchmark) |
./run.py | Tray + benchmark.py (runs full benchmark) |
Setup Flow (Installation & Configuration)
flowchart TD
Start([./setup.sh args]) --> ParseArgs{Parse Arguments}
ParseArgs -->|--help| ShowHelp["Show Usage Info<br/>+ Exit 0"]
ParseArgs -->|--dry-run| DryMode["Set DRY_RUN=1<br/>Set INTERACTIVE=0"]
ParseArgs -->|--yes| AutoMode["Set INTERACTIVE=0<br/>Auto-answer 'no'"]
ParseArgs -->|--interactive| InterMode["Set INTERACTIVE=1<br/>Force Interactive"]
DryMode --> LogSetup["Setup Logging<br/>logs/setup_YYYYMMDD_HHMMSS.log"]
AutoMode --> LogSetup
InterMode --> LogSetup
LogSetup --> CheckLinux{OS = Linux?}
CheckLinux -->|No| ErrorOS["❌ Error:<br/>Not Linux"]
CheckLinux -->|Yes| DetectPKG["✅ Detect Package Manager<br/>apt/dnf/pacman/zypper/apk"]
ErrorOS --> Exit1([Exit 1])
DetectPKG --> CoreDeps["🔧 Check Core Dependencies<br/>Python3, Git, curl, pkg-config"]
CoreDeps --> SysLibs["📦 Check System Libraries<br/>gobject-introspection, cairo, PyGObject"]
SysLibs --> CheckLMS["🔍 Check LM Studio Stack<br/>lms CLI / llmster-headless"]
CheckLMS -->|Found| LMSFound["✅ LM Studio/llmster<br/>detected"]
CheckLMS -->|Not Found| LMSMissing["⚠️ LM Studio missing<br/>Offer download link"]
LMSFound --> GPUDetect["🎮 Detect GPU<br/>lspci → NVIDIA/AMD/Intel"]
LMSMissing --> GPUDetect
GPUDetect --> GPUTools{GPU Found?}
GPUTools -->|NVIDIA| NVIDIACheck["Check nvidia-smi<br/>+ Install if needed"]
GPUTools -->|AMD| AMDCheck["Check rocm-smi<br/>+ AMD Driver Check"]
GPUTools -->|Intel| IntelCheck["Check intel_gpu_top<br/>+ Install if needed"]
GPUTools -->|None| NoGPU["⚠️ No GPU detected"]
NVIDIACheck --> CreateVenv["🐍 Create Python venv<br/>python3 -m venv .venv"]
AMDCheck --> AMDDriver["🔍 Check AMD Drivers<br/>amdgpu, libdrm, ROCm"]
IntelCheck --> CreateVenv
NoGPU --> CreateVenv
AMDDriver --> CreateVenv
CreateVenv -->|venv already exists| RecreatChoice{"Recreate .venv?"}
CreateVenv -->|New venv| VenvOK["✅ venv created<br/>.venv/"]
RecreatChoice -->|Yes| VenvOK
RecreatChoice -->|No| UseExisting["Use existing .venv"]
VenvOK --> InstallReqs["📥 Install Requirements<br/>pip install -r requirements.txt"]
UseExisting --> InstallReqs
InstallReqs --> CheckConflict["Check pip conflicts<br/>pip check"]
CheckConflict --> Summary["📋 Print Summary<br/>Next steps (activation, run, etc)"]
Summary --> LogExit["📄 Save log file<br/>logs/setup_latest.log → symlink"]
LogExit --> Exit0([Exit 0])
ShowHelp --> Exit0
style Start fill:#e1f5ff
style LogSetup fill:#fff4e1
style DetectPKG fill:#e1ffe1
style CoreDeps fill:#e1ffe1
style CreateVenv fill:#ffe1e1
style InstallReqs fill:#ffe1e1
style Summary fill:#f0e1ff
style ErrorOS fill:#ffcccc
style LMSMissing fill:#fff9e1
Setup Flow Summary:
-
Parse Arguments: Handle
--help,--dry-run,--yes,--interactiveflags -
Logging Setup: Create timestamped log file in
logs/setup_YYYYMMDD_HHMMSS.log -
Environment Checks:
- Verify Linux OS
- Detect package manager (apt/dnf/pacman/zypper/apk)
- Check core dependencies (Python 3, Git, curl, pkg-config)
- Verify system libraries (gobject-introspection, cairo, PyGObject for tray support)
-
LM Studio Stack:
- Check for
lmsCLI orllmsterheadless binary - Offer download link if missing
- Check for
-
GPU & Monitoring Tools:
- Detect GPU type via
lspci(NVIDIA, AMD, Intel) - Install/check GPU-specific tools (
nvidia-smi,rocm-smi,intel_gpu_top) - For AMD: Check drivers, ROCm, libdrm, X.Org AMDGPU driver
- Detect GPU type via
-
Python Environment:
- Create virtual environment (
.venv/) - Install Python dependencies from
requirements.txt - Check for pip conflicts
- Create virtual environment (
-
Summary:
- Print next steps for user:
- Activate venv:
source .venv/bin/activate - Run webapp:
python run.py --webapp - Run CLI:
python run.py
- Activate venv:
- Log file symlink:
logs/setup_latest.log
- Print next steps for user:
Modes:
| Mode | Behavior |
|---|---|
--help | Show usage and exit |
--dry-run | Preview all actions (no changes) |
--yes | Non-interactive (auto-answer 'no' to optional prompts) |
--interactive | Force interactive mode (default if TTY detected) |
Tray Control Flow (Linux)
flowchart TD
TrayStart([tray.py start]) --> Poll[Poll /api/status<br/>every 3 seconds]
Poll --> Reachable{API reachable?}
Reachable -->|No| IconRed[Set icon: red<br/>error/unreachable]
Reachable -->|Yes| ReadStatus[Read status field]
ReadStatus -->|idle| IconGray[Set icon: gray]
ReadStatus -->|running| IconGreen[Set icon: green]
ReadStatus -->|paused| IconYellow[Set icon: yellow]
ReadStatus --> BtnLogic[Update Start/Pause/Stop states]
BtnLogic --> UserAction{User action}
UserAction -->|Start| StartCall[POST /api/benchmark/start]
UserAction -->|Pause/Resume| PauseCall[POST /api/benchmark/pause or resume]
UserAction -->|Stop| StopCall[POST /api/benchmark/stop]
UserAction -->|Quit| QuitCall[POST /api/system/shutdown]
QuitCall --> ExitTray[GTK main loop exit]
Tray behavior summary:
- Dynamic status icons: gray (idle), green (running), yellow (paused), red (API error/unreachable)
- Smart controls: Start enabled in idle/error, Pause and Stop enabled only in running or paused state
- Quit path: Tray triggers graceful shutdown endpoint, then exits
Tray Quit Sequence (Linux)
sequenceDiagram
participant U as User
participant T as Tray (GTK/AppIndicator)
participant A as web/app.py (FastAPI)
participant B as Benchmark Manager
participant P as Process Signal Handler
U->>T: Click Quit
T->>A: POST /api/system/shutdown
A->>B: stop_benchmark()
B-->>A: benchmark stopped or no-op
A-->>T: 200 OK (shutdown accepted)
A->>P: Start delayed SIGTERM thread
T->>T: Stop polling + GTK main_quit()
P->>A: Send SIGTERM to process
A-->>A: Uvicorn graceful shutdown
Configuration Loading
flowchart TD
Start([config_loader.py<br/>import]) --> BaseConfig[BASE_DEFAULT_CONFIG<br/>Hard-coded Defaults]
BaseConfig --> LoadFunc[load_default_config]
LoadFunc --> ReadProject[Read config/defaults.json<br/>Project Defaults]
ReadProject --> CheckUser{~/.config/lm-studio-bench/<br/>defaults.json exists?}
CheckUser -->|Yes| ReadUser[Read User Config<br/>Deep Merge]
CheckUser -->|No| UseProject[Use Project Only]
CheckFile -->|No| UseBase[Use BASE_DEFAULT_CONFIG]
ReadJSON --> ParseJSON[Parse JSON]
ParseJSON --> DeepMerge[_deep_merge<br/>Base + User Config]
DeepMerge --> NormalizePorts[_normalize_ports<br/>Ensure valid LM Studio ports]
UseBase --> NormalizePorts
NormalizePorts --> FinalConfig[(DEFAULT_CONFIG<br/>Global Singleton)]
FinalConfig --> BenchmarkImport[benchmark.py imports<br/>DEFAULT_CONFIG]
FinalConfig --> WebAppImport[web/app.py imports<br/>DEFAULT_CONFIG]
style BaseConfig fill:#f0f0f0
style FinalConfig fill:#e1ffe1
style DeepMerge fill:#fff4e1
Configuration Layers:
| Layer | Source | Priority |
|---|---|---|
| 1. Hard-coded | BASE_DEFAULT_CONFIG in config_loader.py | Lowest |
| 2. User Config | ~/.config/lm-studio-bench/defaults.json | Medium |
| 3. Project Config | config/defaults.json | Low |
| 3. CLI Arguments | argparse in benchmark.py | Highest |
Merge Strategy:
_deep_merge()recursively merges nested dictionaries- User config values override base config
Nonevalues in user config are skipped (base value retained)
Configuration Priority
flowchart LR
CLI[CLI Arguments<br/>--runs 5<br/>--context 4096] -->|Highest Priority| Merge[Configuration<br/>Merge]
UserCfg[~/.config/.../defaults.json<br/>context_length: 4096] -->|High Priority| Merge
ProjCfg[config/defaults.json<br/>num_runs: 3<br/>context_length: 2048] -->|Medium Priority| Merge
Base[BASE_DEFAULT_CONFIG<br/>prompt: default<br/>temperature: 0.1] -->|Lowest Priority| Merge
Merge --> Final[Final Configuration<br/>runs=5<br/>context=4096<br/>temperature=0.1]
style CLI fill:#ffe1e1
style JSON fill:#fff4e1
style Base fill:#f0f0f0
style Final fill:#e1ffe1
Example Priority Resolution:
# BASE_DEFAULT_CONFIG
{
"num_runs": 3,
"context_length": 2048,
"prompt": "Is the sky blue?"
}
# config/defaults.json
{
"num_runs": 5,
"prompt": "Explain machine learning"
}
# CLI: ./run.py --runs 1 --context 4096
# FINAL RESULT:
{
"num_runs": 1, # ← CLI override
"context_length": 4096, # ← CLI override
"prompt": "Explain..." # ← JSON override (no CLI arg)
}
Benchmark Execution Flow
flowchart TD
Start([benchmark.py main]) --> ParseArgs[Parse CLI Arguments<br/>argparse.ArgumentParser]
ParseArgs --> LoadConfig[Load DEFAULT_CONFIG<br/>from config_loader]
LoadConfig --> CheckFlags{Special Flags?}
CheckFlags -->|--list-cache| ListCache[Display Cache Entries<br/>exit]
CheckFlags -->|--export-cache| ExportCache[Export Cache to JSON<br/>exit]
CheckFlags -->|--export-only| ExportOnly[Generate Reports Only<br/>skip benchmark]
CheckFlags -->|Normal Mode| CreateBenchmark[Create LMStudioBenchmark<br/>instance]
CreateBenchmark --> MergeConfig[Merge Config Layers:<br/>CLI > JSON > Base]
MergeConfig --> InitComponents[Initialize Components:<br/>• GPUMonitor<br/>• BenchmarkCache<br/>• HardwareMonitor<br/>• REST Client optional]
InitComponents --> CheckServer{LM Studio<br/>Server Running?}
CheckServer -->|No| StartServer[Auto-start Server<br/>lms server start]
CheckServer -->|Yes| DiscoverModels[Discover Models<br/>lms ls --json]
StartServer --> DiscoverModels
DiscoverModels --> FilterModels[Apply Filters:<br/>--quants, --arch<br/>--only-vision, etc.]
FilterModels --> CheckCache{use_cache<br/>enabled?}
CheckCache -->|Yes| LoadCache[Load Cached Results<br/>SQLite lookup]
CheckCache -->|No| SkipCache[Skip Cache]
LoadCache --> RunBenchmarks[Run Benchmarks<br/>for Each Model]
SkipCache --> RunBenchmarks
RunBenchmarks --> TestModel[Test Model:<br/>1. Load Model<br/>2. Warmup Run<br/>3. N Measurement Runs<br/>4. Collect Stats]
TestModel --> Profiling{Profiling<br/>enabled?}
Profiling -->|Yes| MonitorHW[Monitor GPU/CPU/RAM<br/>Background Thread]
Profiling -->|No| SkipMonitor[Skip Monitoring]
MonitorHW --> SaveCache[Save Results to Cache<br/>SQLite INSERT]
SkipMonitor --> SaveCache
SaveCache --> NextModel{More Models?}
NextModel -->|Yes| RunBenchmarks
NextModel -->|No| Export[Export Reports:<br/>JSON, CSV, PDF, HTML]
Export --> End([Done])
ListCache --> End
ExportCache --> End
ExportOnly --> Export
style Start fill:#e1f5ff
style CreateBenchmark fill:#ffe1e1
style RunBenchmarks fill:#ffe1ff
style Export fill:#e1ffe1
Key Execution Steps:
- Argument Parsing: 49 CLI arguments processed by argparse
- Configuration Merge: CLI args override JSON file, JSON overrides base
- Component Initialization: GPU monitor, cache, profiler, REST client
- Model Discovery:
lms ls --jsonfetches all installed models - Filtering: Regex, quantization, architecture, capabilities filters
- Cache Lookup: Skip already-tested models (unless
--retest) - Benchmark Loop: For each model: load → warmup → measure (N runs) → unload
- Hardware Monitoring: Optional background thread for GPU/CPU/RAM stats
- Cache Storage: Save results to SQLite for future runs
- Report Generation: Export to JSON/CSV/PDF/HTML
REST API vs SDK Mode
flowchart TD
Start([Benchmark Init]) --> CheckMode{use_rest_api?<br/>CLI or config}
CheckMode -->|True| InitREST[Initialize REST Client<br/>LMStudioRESTClient]
CheckMode -->|False| InitSDK[Use Python SDK<br/>lmstudio package]
InitREST --> RESTURL[base_url from config:<br/>http://localhost:1234]
RESTURL --> RESTToken{api_token<br/>set?}
RESTToken -->|Yes| RESTAuth[Add Bearer Token<br/>to headers]
RESTToken -->|No| RESTNoAuth[No Authentication]
RESTAuth --> RESTReady[REST Client Ready]
RESTNoAuth --> RESTReady
RESTReady --> RESTFeatures[REST API Features:<br/>• Download Progress<br/>• MCP Integration<br/>• Stateful Chat<br/>• Response Caching<br/>• Parallel Inference<br/>• Unified KV Cache]
InitSDK --> SDKReady[SDK Ready]
SDKReady --> SDKFeatures[SDK Features:<br/>• Simple Python API<br/>• Model Loading<br/>• Inference<br/>• Basic Stats]
RESTFeatures --> Benchmark[Run Benchmarks]
SDKFeatures --> Benchmark
Benchmark --> RESTCall{Mode?}
RESTCall -->|REST| CallREST[HTTP POST /v1/chat/completions<br/>+ parse response stats]
RESTCall -->|SDK| CallSDK[client.llm.predict<br/>+ parse Model response]
CallREST --> Results[Collect Results:<br/>TTFT, tokens/s, VRAM]
CallSDK --> Results
style InitREST fill:#e1f5ff
style InitSDK fill:#ffe1e1
style RESTFeatures fill:#e1ffe1
style SDKFeatures fill:#fff4e1
Mode Comparison:
| Feature | REST API Mode | SDK/CLI Mode |
|---|---|---|
| Configuration | use_rest_api: true in config or --use-rest-api | Default mode |
| Endpoint | HTTP /v1/chat/completions | Python SDK client.llm.predict() |
| Stats | Detailed (TTFT, prompt/completion tokens, tok/s) | Basic (tokens/s only) |
| Authentication | Optional Bearer token | Not needed |
| Parallel Inference | ✅ --n-parallel (continuous batching) | ❌ Sequential only |
| Stateful Chats | ✅ response_id tracking | ❌ Stateless |
| MCP Integration | ✅ mcp_integrations parameter | ❌ Not available |
| Response Caching | ✅ MD5 hash caching (10,000x speedup) | ❌ No caching |
| Download Progress | ✅ Real-time model loading status | ❌ No progress |
Configuration Example:
{
"lmstudio": {
"host": "localhost",
"ports": [1234, 1235],
"use_rest_api": true,
"api_token": "lms_your_token_here"
}
}
Component Details
1. run.py (Entry Point)
Responsibilities:
- Parse
--webapp/-wflag - Route to web dashboard or benchmark
- Show extended help (
--help)
Key Functions:
- Flag detection:
"--webapp" in sys.argv or "-w" in sys.argv - Subprocess launching:
subprocess.call([sys.executable, script] + args)
2. config_loader.py (Configuration Manager)
Responsibilities:
- Load
config/defaults.json(project) +~/.config/lm-studio-bench/defaults.json(user overrides) - Merge with
BASE_DEFAULT_CONFIG - Provide
DEFAULT_CONFIGsingleton
Key Functions:
load_default_config(): Loads and merges config_deep_merge(): Recursive dict merge_normalize_ports(): Validates LM Studio ports
Configuration Fields:
| Section | Fields |
|---|---|
| Root | prompt, context_length, num_runs |
| lmstudio | host, ports, api_token, use_rest_api |
| inference | temperature, top_k_sampling, top_p_sampling, min_p_sampling, repeat_penalty, max_tokens |
| load | n_gpu_layers, n_batch, n_threads, flash_attention, rope_freq_base, rope_freq_scale, use_mmap, use_mlock, kv_cache_quant |
3. benchmark.py (Main Engine)
Responsibilities:
- Parse 49 CLI arguments
- Manage benchmark lifecycle
- Model discovery and filtering
- Cache management (SQLite)
- Runtime-safe cache schema migration for optional columns
- Hardware monitoring
- Report generation
Key Classes:
LMStudioBenchmark: Main orchestratorBenchmarkCache: SQLite cachingtools/hardware_monitor.py: Shared GPU detection and live profiling (GPUMonitor,HardwareMonitor)ModelDiscovery: Model listing and metadata
Reliability Behaviors (2026-03):
- Runtime cache migration: Missing optional SQLite columns are added automatically at startup and, if needed, once again during insert error recovery.
- Inference retry guard:
If LM Studio returns a server error containing
Model unloaded, the benchmark reloads the model and retries inference once.
CLI Arguments (49 total):
| Category | Arguments |
|---|---|
| Basic | --runs, --context, --prompt, --limit, --dev-mode |
| Presets | --list-presets, --preset |
| Filter | --only-vision, --only-tools, --quants, --arch, --params, --min-context, --max-size, --include-models, --exclude-models |
| Cache | --retest, --list-cache, --export-cache, --export-only |
| Profiling | --enable-profiling, --max-temp, --max-power, --disable-gtt |
| Inference | --temperature, --top-k, --top-p, --min-p, --repeat-penalty, --max-tokens |
| Load Config | --n-gpu-layers, --n-batch, --n-threads, --flash-attention, --rope-freq-base, --rope-freq-scale, --use-mmap, --use-mlock, --kv-cache-quant |
| REST API | --use-rest-api, --api-token, --n-parallel, --unified-kv-cache |
| Comparison | --compare-with, --rank-by |
4. rest_client.py (REST API Client)
Responsibilities:
- HTTP communication with LM Studio v1 API
- Model loading and unloading
- Chat completions with stats
- Download progress tracking
- MCP integration
- Stateful chat history
- Response caching
Key Classes:
LMStudioRESTClient: Main REST clientModelInfo: Model metadataChatStats: Response statistics (TTFT, tokens/s, etc.)ModelCapabilities: Vision, tools detection
New Features (✨ 2026-02-23):
-
Download Progress Tracking
wait_for_completion()with progress callbacks- Real-time model loading status
-
MCP Integration
mcp_integrationsparameter in chat requests- Model Context Protocol support
-
Stateful Chat History
use_stateful=Truefor conversation continuitylast_response_idtracking
-
Response Caching
- MD5 hash-based caching
- 10,000x+ speedup for repeated prompts
enable_cacheparameter
Example Usage:
client = LMStudioRESTClient(
base_url="http://localhost:1234",
api_token="lms_token"
)
# Load model with progress tracking
def on_progress(percent, status):
print(f"Loading: {percent:.1f}% - {status}")
client.load_model("model@q4", wait_for_completion=True, progress_callback=on_progress)
# Chat with caching
response = client.chat(
model="model@q4",
messages=[{"role": "user", "content": "Hello"}],
enable_cache=True, # 10,000x speedup for repeated prompts
use_stateful=True # Conversation continuity
)
5. tray.py (Linux Tray Controller)
Responsibilities:
- Provide Linux AppIndicator tray UI with benchmark controls
- Poll benchmark status and update icon/button state
- Trigger benchmark actions via web API
- Trigger graceful full shutdown via
/api/system/shutdown
Key Behaviors:
- 3-second polling loop via GLib timeout
- Icon states: gray (idle), green (running), yellow (paused), red (error)
- Control state logic:
- Start enabled in idle and recovery/error state
- Pause/Stop enabled only while benchmark is active
6. web/app.py + dashboard.html.jinja (Dashboard Analytics)
Responsibilities:
- Aggregate benchmark history for fast visual summaries
- Serve chart-ready payloads via
/api/dashboard/stats - Render Home/Results overview charts in the browser with Plotly
- Support quick navigation from ranking tables to model comparison
Home View (Executive Summary):
- KPI cards: cached models, avg speed, median (P50), P95, architectures, quantizations
- Top 10 bar chart (speed ranking)
- Quantization donut chart (distribution)
Results View (Exploration):
- Scatter:
Speed vs VRAM - Heatmap:
Model x Quantization -> avg tokens/s - Shared data source with table (
/api/results), so table and charts stay consistent
Quick Compare Flow:
- Compare actions in Home and Results tables call
openComparisonForModel(modelName) - Function opens Comparison view, selects the model, then loads full
historical trends via
/api/comparison/{model_name}
Dashboard Summary Fields (/api/dashboard/stats):
speed_summary(min,p50,avg,p95,max)top_models_extended(Top 10 models)quantization_distributionarchitecture_distributionefficiency_top
Data Flow Summary
graph LR
User([User]) -->|./run.py --runs 5| CLI[CLI Arguments]
ProjJSON[config/defaults.json] --> Config[Configuration<br/>Merge]
UserJSON[~/.config/.../defaults.json] --> Config
CLI --> Config
Base[BASE_DEFAULT_CONFIG] --> Config
Config --> Benchmark[Benchmark<br/>Execution]
Benchmark -->|lms ls| Models[Model<br/>Discovery]
Models --> Filter[Model<br/>Filtering]
Filter --> Cache{Cache<br/>Hit?}
Cache -->|Yes| Skip[Skip Test]
Cache -->|No| Test[Run Test]
Test --> LMStudio[LM Studio<br/>Server]
LMStudio --> Results[Collect<br/>Results]
Results --> DB[(SQLite<br/>Cache)]
Results --> Reports[JSON/CSV<br/>PDF/HTML]
Skip --> Reports
style CLI fill:#ffe1e1
style Config fill:#e1ffe1
style Cache fill:#fff4e1
style Reports fill:#e1f5ff
Testing Architecture
LM-Studio-Bench includes a comprehensive test suite with 900+ tests and strong coverage to ensure reliability and maintainability.
Test Organization
graph TB
Tests[tests/] --> Fixtures[conftest.py<br/>Test Fixtures & Utilities]
Tests --> BenchmarkTests[test_benchmark.py<br/>55+ tests]
Tests --> HardwareTests[test_hardware_monitor.py<br/>57+ tests]
Tests --> AppTests[test_app.py<br/>23+ tests]
Tests --> APITests[test_api_endpoints.py<br/>32+ tests]
Tests --> RestTests[test_rest_client.py<br/>22+ tests]
Tests --> TrayTests[test_tray.py<br/>26+ tests]
Tests --> PresetTests[test_preset_manager.py<br/>19+ tests]
Tests --> ConfigTests[test_config_loader.py<br/>9+ tests]
Tests --> PathTests[test_user_paths.py<br/>4+ tests]
Tests --> VersionTests[test_version_checker.py<br/>7+ tests]
Tests --> MetadataTests[test_scrape_metadata.py<br/>24+ tests]
Tests --> RunTests[test_run.py<br/>10+ tests]
BenchmarkTests --> Benchmark[cli/benchmark.py]
HardwareTests --> HardwareMon[tools/hardware_monitor.py]
AppTests --> WebApp[web/app.py]
APITests --> WebApp
RestTests --> RestClient[core/client.py]
TrayTests --> Tray[core/tray.py]
PresetTests --> PresetMgr[core/presets.py]
ConfigTests --> ConfigLoader[core/config.py]
PathTests --> UserPaths[core/paths.py]
VersionTests --> VersionChecker[core/version.py]
MetadataTests --> Metadata[tools/scrape_metadata.py]
RunTests --> RunPy[run.py]
style Tests fill:#e1f5ff
style Fixtures fill:#fff4e1
style BenchmarkTests fill:#ffe1e1
style AppTests fill:#e1ffe1
Test Coverage by Component
| Component | Test Module | Test Count | Coverage |
|---|---|---|---|
| Benchmark Engine | test_benchmark.py | 55+ | High |
| Web Dashboard | test_app.py | 23+ | Medium |
| API Endpoints | test_api_endpoints.py | 32+ | High |
| REST Client | test_rest_client.py | 22+ | High |
| Linux Tray | test_tray.py | 26+ | Medium |
| Preset Manager | test_preset_manager.py | 19+ | High |
| Config Loader | test_config_loader.py | 9+ | High |
| User Paths | test_user_paths.py | 4+ | High |
| Version Checker | test_version_checker.py | 7+ | High |
| Metadata Scraping | test_scrape_metadata.py | 24+ | Medium |
| Entry Point | test_run.py | 10+ | Medium |
Testing Approach
Unit Testing:
- Mock external dependencies (LM Studio API, system commands, file I/O)
- Isolated test cases that can run in any order
- Fast execution (no real API calls or file system operations)
- Use pytest fixtures for common setup and teardown
Test Fixtures (conftest.py):
- Mock LM Studio client and server responses
- Temporary directories for file operations
- Mock system commands (nvidia-smi, rocm-smi, etc.)
- Sample configuration and model data
Continuous Integration:
- GitHub Actions runs full test suite on every PR
- Code quality checks (flake8, pylint)
- Security scans (Bandit, CodeQL, Snyk)
- Test results reported in PR status checks
Running Tests:
# Run all tests
pytest
# Run with verbose output
pytest -v
# Run specific module
pytest tests/test_benchmark.py
# Run with coverage report
pytest --cov=core --cov=cli --cov=agents --cov=web --cov=tools --cov=run --cov-report=html
# Run tests matching a pattern
pytest -k "test_gpu_detection"
See Also
- Configuration Reference - All CLI arguments and config file options
- REST API Features - REST API integration details
- Quickstart Guide - Get started in 5 minutes