LM Studio Benchmark Documentation

Welcome to the LM Studio Benchmark documentation! This tool helps you measure and compare token/s performance across all your locally installed LLM models and their quantizations.

What is this?

A Python benchmark tool for LM Studio with a modern web dashboard that:

Automatically tests all local LLM models and quantizations
Measures token/s speeds with warmup and multiple runs
Exports results in JSON, CSV, PDF, and interactive HTML formats
Detects GPU capabilities (NVIDIA, AMD, Intel) and monitors VRAM usage
Provides a web dashboard with live charts and filtering options
Includes Linux tray controls with live status icons and quick actions

Quick Links

Quickstart Guide — Get started in 5 minutes
Configuration Reference — All CLI arguments and config file options
Architecture Documentation — System architecture with Mermaid diagrams, including testing architecture
REST API Integration — Advanced features with LM Studio API v1
Hardware Monitoring — GPU, CPU, RAM tracking
LLM Metadata Guide — Model capabilities and metadata
User Data & Configuration — XDG directory structure and config management
Agent Integration — How to integrate with LM Studio Agents

✅ Multi-model benchmarking with intelligent GPU offload ✅ Vision & tool-calling model detection ✅ Progressive VRAM management (automatic fallback) ✅ Caching system (skip already-tested models) ✅ Filter by quantization, architecture, params, context length ✅ Live web dashboard with 27 themes ✅ Linux tray controller with dynamic benchmark status icons ✅ REST API mode with parallel inference support ✅ Download progress tracking, MCP integration, stateful chats ✅ Response caching with 10,000x+ speedup for repeated prompts

Getting Started

Check out the Quickstart Guide to begin benchmarking your models!

🚀 Quick Start Guide - LM Studio Benchmark Tool

Installation

cd ~/LM-Studio-Bench

# 1) Preview setup (no changes)
./setup.sh --dry-run

# 2) Prepare system + Python environment (recommended)
./setup.sh

# 3) Activate virtual environment
source .venv/bin/activate

If you skip setup.sh, use this manual fallback:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

🌐 Web Dashboard (Recommended)

Start Web UI

./run.py --webapp

✅ Opens browser automatically at http://localhost:8080 ✅ Live streaming of benchmark output via WebSocket ✅ Browse all cached results with interactive tables ✅ System info (GPU model detection, LM Studio health, hardware details) ✅ Dark mode by default with 27 theme options ✅ All CLI parameters available as web form with tooltips ✅ Advanced filtering (quantization, architecture, size, context-length) ✅ Separate logs: ~/.local/share/lm-studio-bench/logs/webapp_*.log and ~/.local/share/lm-studio-bench/logs/benchmark_*.log ✅ Linux tray control with dynamic status icon and quick actions

Dashboard Features:

Start Benchmark: Configure and run benchmarks from web interface
- Filter by quantization, architecture, parameter size
- Rank results by speed, efficiency, TTFT, or VRAM
- Set hardware limits (max GPU temp, max power draw)
- Tooltip help for all options
System Info: OS, Kernel, CPU, GPU (with detailed model names)
LM Studio Health: Live healthcheck status (HTTP API + CLI fallback)
Live Output: Real-time streaming with colored logs and progress
Results Browser: Filter and sort all cached benchmark results
Export: Download JSON/CSV/PDF/HTML reports
Network Access: Access from other devices on same network

Linux Tray Control

When GTK/AppIndicator dependencies are installed, a tray controller starts with the web app.

Dynamic status icon:
- Gray: idle
- Green: running
- Yellow: paused
- Red: API unreachable/error
Smart controls:
- Start enabled in idle/error states
- Pause/Stop enabled only in running/paused states
Auto refresh: status and controls refresh every 3 seconds
Quit behavior: tray Quit triggers graceful full shutdown

Network Access

# Access dashboard from other devices
http://your-ip:8080

# Example:
http://192.168.1.100:8080

💻 Command Line (CLI)

Simple Benchmark (All Models)

./run.py

✅ Tests all installed models with 3 runs each (~1-2 hours) ✅ Automatically saves results to ~/.local/share/lm-studio-bench/results/ ✅ Clean output with emoji icons and formatted model lists ✅ Detailed logs saved to ~/.local/share/lm-studio-bench/logs/benchmark_YYYYMMDD_HHMMSS.log

Monitor Logs in Real-Time

# Watch benchmark execution
tail -f ~/.local/share/lm-studio-bench/logs/benchmark_*.log

# Watch web dashboard
tail -f ~/.local/share/lm-studio-bench/logs/webapp_*.log

# Search for errors
grep ERROR ~/.local/share/lm-studio-bench/logs/benchmark_*.log

Quick Test (3 NEW Models)

./run.py --limit 3 --runs 1

✅ Fast test with 3 NEW untested models (~5-10 minutes) ✅ Already tested models automatically loaded from cache ✅ Limit applies ONLY to new models, all cached models included

Development Mode (Fastest)

./run.py --dev-mode

✅ Automatically selects smallest model ✅ Single run for quick validation (~30 seconds) ✅ Perfect for testing changes

Test Single Model

./run.py --limit 1 --runs 1

✅ Single model benchmark (~1-2 minutes)

Advanced Features

1️⃣ Hardware Profiling (6 Live Charts)

Enable Complete Hardware Monitoring:

./run.py --enable-profiling --runs 1 --limit 3

Monitored Metrics:

🌡️ GPU Temperature (°C)
⚡ GPU Power (W)
💾 GPU VRAM (GB)
🧠 GPU GTT (GB) - AMD only
🖥️ System CPU usage (%)
💾 System RAM usage (GB)

✅ All metrics are displayed live in the WebApp ✅ 6 interactive Plotly.js charts with Min/Max/Avg stats ✅ Moving average for stable RAM curves ✅ Each metric is measured every second

With Safety Limits:

./run.py --enable-profiling --max-temp 85 --max-power 350

✅ Interrupts benchmark when limits are exceeded

2️⃣ AMD GTT Support (Shared System RAM)

Enable GTT (Default):

./run.py --limit 3

✅ Automatically uses VRAM + GTT (e.g. 2GB VRAM + 46GB GTT = 48GB) ✅ Enables larger models on AMD APUs/iGPUs ✅ Shown in logs: "💾 Memory: 0.4GB VRAM + 44.7GB GTT = 45.1GB total"

Disable GTT (VRAM-only):

./run.py --disable-gtt --limit 3

✅ Only uses dedicated VRAM ✅ More conservative offload levels ✅ Useful for benchmarking VRAM-only performance

3️⃣ Filtering Models

By Quantization:

./run.py --quants q4,q5 --limit 5

By Architecture:

./run.py --arch llama,mistral --limit 5

By Parameter Size:

./run.py --params 7B,8B --limit 5

By Context Length:

./run.py --min-context 32000 --limit 3

By Model Size:

./run.py --max-size 10 --limit 5

Vision Models Only:

./run.py --only-vision --runs 1

Regex-based Filtering (Include):

# Only Qwen or Phi models
./run.py --include-models "qwen|phi" --runs 1

# Only Llama 7B models
./run.py --include-models "llama.*7b" --runs 1

# Only Q4 quantizations
./run.py --include-models ".*q4.*" --runs 1

Regex-based Filtering (Exclude):

# Exclude uncensored models
./run.py --exclude-models "uncensored" --runs 1

# Exclude Q2 and Q3 quantizations
./run.py --exclude-models "q2|q3" --runs 1

# Exclude all vision models
./run.py --exclude-models ".*vision.*" --runs 1

Combined Filters (AND logic):

# Include llama, exclude q2, only tools
./run.py --include-models "llama" --exclude-models "q2" --only-tools --runs 1

# Vision models, 7B params, max 12GB
./run.py --only-vision --params 7B --max-size 12 --runs 1

3️⃣ Ranking & Sorting

Sort by Efficiency (Default: Speed):

./run.py --limit 5 --rank-by efficiency

Sort by TTFT (Lower = Better):

./run.py --limit 5 --rank-by ttft

Sort by VRAM Usage (Lower = Better):

./run.py --limit 5 --rank-by vram

4️⃣ Cache Management

View Cached Results:

./run.py --list-cache

✅ Shows all cached models with performance metrics

Force Retest (Ignore Cache):

./run.py --retest --limit 3

✅ Re-runs benchmarks even if cached

Regenerate Reports from Database:

./run.py --export-only

✅ Generates JSON/CSV/PDF/HTML from cached results in <1s ✅ No benchmarking - instant report generation ✅ Supports all filters (--params, --quants, --arch, etc.)

Examples:

# All cached models
./run.py --export-only

# Only 7B models from cache
./run.py --export-only --params 7B

# Q4 quantizations with historical comparison
./run.py --export-only --quants q4 --compare-with latest

✅ Retests models even if cached

Export Cache as JSON:

./run.py --export-cache my_backup.json

✅ Exports entire cache database

Cache Behavior:

First run: Tests all models (~2 hours for 20 models)
Second run: Loads from cache (~1 second!)
Automatic invalidation on parameter changes (prompt, context, temperature)
Shows "X of Y models cached" before starting

5️⃣ Historical Comparison & Trends

Compare with Latest Benchmark:

./run.py --limit 3 --runs 1 --compare-with latest

📊 Shows performance delta (%) vs previous run

Compare with Specific Benchmark:

./run.py --limit 3 --runs 1 --compare-with benchmark_results_20260104_170000.json

6️⃣ Custom Configuration

Adjust Number of Runs:

./run.py --runs 5 --limit 2

Custom Context Length:

./run.py --context 4096 --limit 2 --runs 1

Custom Prompt:

./run.py -P "Your custom prompt here" --limit 2 --runs 1

7️⃣ Presets (Fast Setup)

Show available presets:

./run.py --list-presets

Load a built-in preset:

# Default presets (readonly)
./run.py --preset default_classic              # Classic benchmark (default)
./run.py --preset default_compatibility_test   # Capability-driven test

# Other presets
./run.py --preset quick_test
./run.py --preset high_quality
./run.py --preset resource_limited

Load preset and override values:

./run.py --preset quick_test --runs 2 --context 2048
./run.py --preset default_classic --runs 5 --context 4096

Backwards Compatibility:

./run.py --preset default      # Automatically loads default_classic

Notes:

Default presets include explicit values for all benchmark form fields, so preset comparisons do not show null values for missing keys.
default_classic is optimized for full model benchmarking (3 runs)
default_compatibility_test (alias: default_compatability_test) is optimized for focused capability testing (1 run)
Capability-driven runs over many installed models continue when a single model fails to load; the failed model is logged and skipped.
Embedding models are retried automatically without KV-cache offload if LM Studio rejects that load option.
Legacy keys in imported/user presets are normalized automatically (context_length/top_k/top_p/min_p -> current key names).

📊 Output Formats

Each benchmark generates 4 files:

JSON Format

{
  "model_name": "qwen/qwen3-8b",
  "quantization": "q4_k_m",
  "avg_tokens_per_sec": 8.15,
  "tokens_per_sec_per_gb": 1.74,
  "speed_delta_pct": -0.2,
  ...
}

✅ Structured data for analysis

CSV Format

model_name,quantization,avg_tokens_per_sec,tokens_per_sec_per_gb,speed_delta_pct
qwen/qwen3-8b,q4_k_m,8.15,1.74,-0.2

✅ Excel/Sheets compatible

PDF Report

Model rankings (sortable)
Best-of-Quantization analysis
Quantization comparison tables (Q4 vs Q5 vs Q6)
Performance statistics & percentiles
Delta display (Δ% column)

HTML Report (Interactive Plotly)

Bar chart: Top 10 models
Scatter plot: Size vs Performance
Scatter plot: Efficiency analysis
NEW: Trend chart showing performance over time
Summary statistics with gradient backgrounds

📈 Feature Showcase

Example: Complete Analysis

./run.py \
  --quants q4,q5,q6 \
  --limit 5 \
  --runs 1 \
  --rank-by efficiency \
  --compare-with latest

Output:

✅ Filters to 5 models with 3 quantizations each
✅ Ranks by efficiency (Tokens/s per GB)
✅ Shows delta vs previous benchmark
✅ Generates all 4 export formats
✅ Includes percentile statistics (P50, P95, P99)
✅ Shows quantization comparison
✅ Displays performance trends if history available

🎯 Key Metrics

Metric	Description	Unit
Speed	Throughput	tokens/s
Efficiency	Speed per GB model size	tokens/s/GB
TTFT	Time to First Token	ms
Delta	Change vs previous	%
VRAM	Memory used	MB

📁 File Structure

results/
├── benchmark_results_20260104_170000.json
├── benchmark_results_20260104_170000.csv
├── benchmark_results_20260104_170000.pdf
└── benchmark_results_20260104_170000.html

🐛 Troubleshooting

No models found

Ensure LM Studio is installed and running
Check lms ls --json output

Server not responding

Start LM Studio server manually
Check ~/.lmstudio/server-logs/

Permission denied on results/

mkdir -p results/
chmod 755 results/

FEATURES.md - Complete feature list
PLAN.md - Implementation roadmap
requirements.txt - Python dependencies
errors.log - Debug information

Version: 1.0 (Phases 1-4 Complete) | Updated: 2026-01-04

Configuration Reference

Complete documentation of all CLI arguments and configuration options for the LM Studio Benchmark Tool.

Overview

The benchmark tool can be configured in three ways:

Project Defaults: config/defaults.json (in Git)
User Configuration: ~/.config/lm-studio-bench/defaults.json (optional overrides)
CLI Arguments: Override all config values

Priority: CLI Arguments > User Config > Project Defaults > Hard-coded Defaults

Configuration Files

Project Configuration (`config/defaults.json`)

The project configuration file contains all default settings for the benchmark. This file is shipped with the project and tracked in Git.

Location: <project_root>/config/defaults.json

User Configuration (`~/.config/lm-studio-bench/defaults.json`)

Optional user-specific configuration overrides. Only specify fields you want to customize.

Location: ~/.config/lm-studio-bench/defaults.json

Example (minimal user config):

{
  "num_runs": 5,
  "lmstudio": {
    "use_rest_api": true
  }
}

This overrides only num_runs and use_rest_api, all other values come from project defaults.

Complete Structure

{
  "prompt": "Is the sky blue?",
  "context_length": 2048,
  "num_runs": 3,
  "retest": false,
  "enable_profiling": false,
  "lmstudio": {
    "host": "localhost",
    "ports": [1234, 1235],
    "api_token": null,
    "use_rest_api": true
  },
  "inference": {
    "temperature": 0.1,
    "top_k_sampling": 40,
    "top_p_sampling": 0.9,
    "min_p_sampling": 0.05,
    "repeat_penalty": 1.2,
    "max_tokens": 256
  },
  "load": {
    "n_gpu_layers": -1,
    "n_batch": 512,
    "n_threads": -1,
    "flash_attention": true,
    "rope_freq_base": 10000,
    "rope_freq_scale": 1.0,
    "use_mmap": true,
    "use_mlock": false,
    "kv_cache_quant": "f16"
  }
}

Field Descriptions

Basic Settings

Field	Type	Default	Description
`prompt`	string	`"Is the sky blue?"`	Default test prompt for all benchmarks
`context_length`	integer	`2048`	Context length in tokens
`num_runs`	integer	`3`	Number of measurements per model/quantization
`retest`	boolean	`false`	Ignore cache and benchmark all selected models again
`enable_profiling`	boolean	`false`	Enable temperature/power monitoring

LM Studio Server (`lmstudio`)

Field	Type	Default	Description
`host`	string	`"localhost"`	LM Studio server hostname
`ports`	array	`[1234, 1235]`	Ports for server discovery (tries both)
`api_token`	string/null	`null`	API permission token (REST API authentication)
`use_rest_api`	boolean	`true`	Use REST API v1 instead of SDK/CLI

Inference Parameters (`inference`)

Field	Type	Default	Description
`temperature`	float	`0.1`	Sampling temperature (0.0-2.0, low=deterministic)
`top_k_sampling`	integer	`40`	Top-K sampling (limits choice to K most likely tokens)
`top_p_sampling`	float	`0.9`	Top-P / Nucleus sampling (cumulative probability)
`min_p_sampling`	float	`0.05`	Min-P sampling (minimum probability threshold)
`repeat_penalty`	float	`1.2`	Repeat penalty (prevents repetitions, 1.0=off)
`max_tokens`	integer	`256`	Maximum output tokens

Load Config (`load`)

Field	Type	Default	Description
`n_gpu_layers`	integer	`-1`	GPU layers (-1=auto/all, 0=CPU only, >0=specific)
`n_batch`	integer	`512`	Batch size for prompt processing
`n_threads`	integer	`-1`	CPU threads (-1=auto/all)
`flash_attention`	boolean	`true`	Flash attention (faster computation)
`rope_freq_base`	float	`10000`	RoPE frequency base
`rope_freq_scale`	float	`1.0`	RoPE frequency scaling
`use_mmap`	boolean	`true`	Memory mapping (faster model load)
`use_mlock`	boolean	`false`	Memory locking (prevents swapping)
`kv_cache_quant`	string	`"f16"`	KV cache quantization (f32/f16/q8_0/q4_0/etc.)

Preset Defaults and Compatibility

The tool includes two readonly default presets:

`default_classic` - Classic Benchmark Mode

Default preset for standard model benchmarking. Contains explicit values for all benchmark fields to avoid null values in preset comparisons.

benchmark_mode: classic
preset_mode: classic
runs: 3
context: 2048
Capability fields (agent_model, agent_capabilities, agent_max_tests): null

Backwards Compatibility: Loading --preset default automatically loads default_classic.

`default_compatibility_test` - Capability-Driven Test Mode

Default preset for focused capability testing of a single model.

Alias: The legacy name default_compatability_test is accepted as an alias for this preset for backward compatibility.

benchmark_mode: capability
preset_mode: capability
runs: 1
context: 2048
agent_model: qwen2.5-7b-instruct
agent_capabilities: general_text,reasoning
agent_max_tests: 10
No null values - all fields have explicit defaults

Compatibility mapping is applied automatically when loading and comparing presets with legacy keys:

context_length -> context
num_runs -> runs
top_k -> top_k_sampling
top_p -> top_p_sampling
min_p -> min_p_sampling

CLI Arguments

All CLI arguments override the corresponding values from both config files.

Basic Options

`--runs`, `-r` (integer)

Number of measurements per model/quantization.

./run.py --runs 1              # Fast: only 1 measurement
./run.py --runs 5              # Accurate: 5 measurements (average)

Default: 3

`--context`, `-c` (integer)

Context length in tokens.

./run.py --context 4096        # 4K context
./run.py --context 32768       # 32K context

Default: 2048

`--list-presets`

List all available presets (readonly + user presets) and exit.

./run.py --list-presets

`--preset`, `-p` (string)

Load a preset before parsing all remaining CLI arguments. If omitted, default_classic is used. The legacy alias default still loads default_classic automatically.

./run.py --preset quick_test
./run.py --preset high_quality --runs 3
./run.py --preset default_classic
./run.py --preset default_compatability_test

Built-in readonly presets:

default_classic
default_compatability_test
default (alias for default_classic)
quick_test
high_quality
resource_limited

Readonly preset names cannot be saved, deleted, or imported as user presets. This restriction also applies to the legacy alias default.

For capability-driven runs across many models, individual model load failures are logged and skipped so the benchmark can continue with the remaining models.

`--prompt`, `-P` (string)

Default test prompt.

./run.py --prompt "Explain machine learning"
./run.py -P "Explain machine learning"

Default: "Is the sky blue?"

`--limit`, `-l` (integer)

Maximum number of models to test.

./run.py --limit 1             # Only 1 model (usually smallest)
./run.py --limit 5             # First 5 models

Default: None (all models)

`--dev-mode`

Development mode: Automatically tests the smallest model with 1 run.

./run.py --dev-mode            # Equivalent to --limit 1 --runs 1

Default: false

`--only-vision`

Test only models with vision capability (multimodal).

./run.py --only-vision --runs 2

Default: false

`--only-tools`

Test only models with tool-calling support.

./run.py --only-tools --runs 2

Default: false

`--quants` (string)

Test only specific quantizations (comma-separated).

./run.py --quants "q4,q5,q6"     # Only Q4/Q5/Q6
./run.py --quants "q8"           # Only Q8

Default: None (all quants)

`--arch` (string)

Test only specific architectures (comma-separated).

./run.py --arch "llama,mistral"  # Only Llama and Mistral
./run.py --arch "qwen"           # Only Qwen

Default: None (all architectures)

`--params` (string)

Test only specific parameter sizes (comma-separated).

./run.py --params "3B,7B,8B"     # 3B, 7B and 8B models
./run.py --params "1B"           # Only 1B models

Default: None (all sizes)

`--min-context` (integer)

Minimum context length in tokens.

./run.py --min-context 32000     # Only models with ≥32K context

Default: None (no minimum)

`--max-size` (float)

Maximum model size in GB.

./run.py --max-size 10.0         # Only models ≤10GB
./run.py --max-size 5.0          # Only models ≤5GB

Default: None (no limit)

`--include-models` (string)

Only test models matching the regex pattern.

./run.py --include-models "llama.*7b"      # All 7B Llama models
./run.py --include-models "qwen|phi"       # Qwen OR Phi

Default: None (all models)

`--exclude-models` (string)

Exclude models matching the regex pattern.

./run.py --exclude-models ".*uncensored.*" # No uncensored models
./run.py --exclude-models "test|exp"       # No test/experimental

Default: None (no exclusions)

`--compare-with` (string)

Compare with previous results.

./run.py --compare-with "20260104_172200.json"
./run.py --compare-with "latest"           # Latest result

Default: None (no comparison)

`--rank-by` (choice)

Sort results by metric.

Options: speed, efficiency, ttft, vram

./run.py --rank-by speed         # By tokens/s
./run.py --rank-by efficiency    # By tokens/s per GB VRAM
./run.py --rank-by ttft          # By Time to First Token
./run.py --rank-by vram          # By VRAM usage (low→high)

Default: speed

Cache Management

`--retest`

Ignore cache and retest all models.

./run.py --retest                # Overwrites old results

Default: false (uses cache if available)

`--list-cache`

Display all cached models and exit.

./run.py --list-cache

Output: Table with all cache entries

`--export-cache` (string)

Export cache contents as JSON.

./run.py --export-cache "cache_export.json"

Exits the program after export.

`--export-only`

Generate reports from cache without new tests.

./run.py --export-only           # Creates JSON/CSV/PDF/HTML

Default: false

Hardware Profiling

`--enable-profiling`

Enable hardware profiling (GPU temp & power).

./run.py --enable-profiling

Default: false

`--max-temp` (float)

Maximum GPU temperature in °C (warning).

./run.py --enable-profiling --max-temp 80.0

Default: None (no warning)

`--max-power` (float)

Maximum GPU power draw in Watts (warning).

./run.py --enable-profiling --max-power 400.0

Default: None (no warning)

`--disable-gtt`

Disable GTT (Shared System RAM) for AMD GPUs.

./run.py --disable-gtt           # Only dedicated VRAM

Default: false (GTT enabled)

Note: Only relevant for AMD iGPUs (e.g., Radeon 890M).

Inference Parameters

All override values from config files:

`--temperature` (float)

./run.py --temperature 0.7       # More creative responses
./run.py --temperature 0.0       # Deterministic

`--top-k`, `--top-k-sampling` (integer)

./run.py --top-k 50

`--top-p`, `--top-p-sampling` (float)

./run.py --top-p 0.95

`--min-p`, `--min-p-sampling` (float)

./run.py --min-p 0.05

`--repeat-penalty` (float)

./run.py --repeat-penalty 1.3

`--max-tokens` (integer)

./run.py --max-tokens 512

Load Config (Performance Tuning)

All override values from config files:

`--n-gpu-layers` (integer)

./run.py --n-gpu-layers -1       # All layers on GPU (default)
./run.py --n-gpu-layers 0        # CPU only
./run.py --n-gpu-layers 20       # First 20 layers on GPU

`--n-batch` (integer)

./run.py --n-batch 1024          # Larger batches (faster)
./run.py --n-batch 128           # Smaller batches (less VRAM)

`--n-threads` (integer)

./run.py --n-threads -1          # Auto (default)
./run.py --n-threads 8           # 8 CPU threads

`--flash-attention` / `--no-flash-attention`

./run.py --flash-attention       # Enabled (default)
./run.py --no-flash-attention    # Disabled

`--rope-freq-base` (float)

./run.py --rope-freq-base 10000.0

`--rope-freq-scale` (float)

./run.py --rope-freq-scale 1.0

`--use-mmap` / `--no-mmap`

./run.py --use-mmap              # Enabled (default)
./run.py --no-mmap               # Disabled

`--use-mlock`

./run.py --use-mlock             # Enabled (prevents swapping)

`--kv-cache-quant` (choice)

Options: f32, f16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

./run.py --kv-cache-quant q8_0   # 8-bit quantization (saves VRAM)
./run.py --kv-cache-quant f16    # Half-precision (balanced)

Default: null (model default)

REST API Mode

Uses LM Studio REST API v1 instead of Python SDK/CLI.

`--use-rest-api`

./run.py --use-rest-api --limit 1

Benefits:

More detailed stats (TTFT, tok/s)
Stateful chats (response_id tracking)
Parallel requests (continuous batching)
MCP integration
Response caching

Default: false (uses SDK/CLI)

`--api-token` (string)

API permission token for REST API authentication.

./run.py --use-rest-api --api-token "lms_your_token_here"

Default: null (no token, server must be open)

Create: LM Studio → Settings → Server → Generate Token

`--n-parallel` (integer)

Max parallel predictions per model (REST API only).

./run.py --use-rest-api --n-parallel 8

Default: 4

Requirement: LM Studio 0.4.0+, continuous batching support

`--unified-kv-cache`

Enable unified KV cache (REST API only).

./run.py --use-rest-api --unified-kv-cache --n-parallel 8

Benefit: Optimizes VRAM for parallel requests

Default: false

Examples

Quick Test of One Model

./run.py --limit 1 --runs 1
# Or shorter:
./run.py --dev-mode

All 7B Llama Models with Q4/Q5/Q6 Quants

./run.py --include-models "llama.*7b" --quants "q4,q5,q6" --runs 2

Vision Models Only with Hardware Profiling

./run.py --only-vision --enable-profiling --max-temp 80.0 --max-power 400.0

REST API with Parallel Requests

./run.py --use-rest-api --n-parallel 8 --unified-kv-cache --limit 5

Export Without New Tests

./run.py --export-only

Custom Inference Parameters

./run.py --temperature 0.7 --top-p 0.95 --max-tokens 512 --limit 3

Preset Workflow

./run.py --list-presets
./run.py --preset quick_test
./run.py --preset resource_limited --max-size 10 --runs 2

Performance Tuning (VRAM-optimized)

./run.py --n-batch 128 --kv-cache-quant q8_0 --limit 5

Manage Cache

./run.py --list-cache                     # Display cache contents
./run.py --export-cache "backup.json"     # Export cache
./run.py --retest --limit 1               # Ignore cache

Configuration Priority

CLI Arguments (highest priority)
User Config (~/.config/lm-studio-bench/defaults.json)
Project Config (config/defaults.json)
Hard-coded Defaults (in code)

Example:

# User config has "num_runs": 5
# Project config has "num_runs": 3
./run.py --runs 1     # → uses 1 (CLI overrides)
./run.py              # → uses 5 (from user config)

Tips & Best Practices

1. Persistent REST API Config

If you mainly use REST API:

config/defaults.json:

{
  "lmstudio": {
    "use_rest_api": true,
    "api_token": "lms_your_token"
  }
}

Then simply:

./run.py --limit 1   # automatically uses REST API

2. VRAM Optimization

When VRAM is limited:

./run.py --kv-cache-quant q8_0 --n-batch 128 --max-size 10.0

3. Fast Development

./run.py --dev-mode   # Tests only smallest model with 1 run

4. Reproducible Benchmarks

./run.py --temperature 0.0 --runs 5 --retest

5. Hardware Monitoring

./run.py --enable-profiling --max-temp 80.0 --max-power 400.0

Logging Configuration

The benchmark tool generates timestamped log files for debugging and monitoring.

Log File Locations

logs/
├── benchmark_YYYYMMDD_HHMMSS.log    # Benchmark execution logs
└── webapp_YYYYMMDD_HHMMSS.log       # Web dashboard logs

Log Format

Each log entry follows this format:

YYYY-MM-DD HH:MM:SS,mmm - LEVEL - LEVEL_ICON message
2026-03-22 13:35:32,445 - INFO - ℹ️ Starting benchmark...

Log Levels

The tool uses standard Python logging levels:

Level	Usage	Examples
`INFO`	General information and progress	Model loading, benchmark completion, hardware metrics
`WARNING`	Non-fatal issues and fallbacks	GPU tool missing, using CLI fallback, skipped models
`ERROR`	Runtime errors requiring attention	Model load failure, API unavailable, VRAM exceeded

Level Icons

Each log level also gets an automatic icon prefix:

Level	Icon
`DEBUG`	`🐛`
`INFO`	`ℹ️`
`WARNING`	`⚠️`
`ERROR`	`❌`
`CRITICAL`	`🔥`

Hardware Metrics in Logs

When hardware profiling is enabled (--enable-profiling), metrics appear with emoji indicators:

🌡️ GPU Temp: 42°C
⚡ GPU Power: 125W
💾 GPU VRAM: 8.2GB
🧠 GPU GTT: 0.0GB
🖥️ CPU: 35.2%
💾 RAM: 18.5GB

Third-Party Library Logging

The following libraries have suppressed debug output for cleaner logs:

Library	Level	Reason
`httpx`	WARNING	HTTP client noise
`lmstudio`	WARNING	SDK debug output
`urllib3`	WARNING	HTTP library noise
`websockets`	WARNING	WebSocket protocol noise

Viewing Logs

Real-time monitoring:

# Watch benchmark execution
tail -f ~/.local/share/lm-studio-bench/logs/benchmark_*.log

# Watch web dashboard
tail -f ~/.local/share/lm-studio-bench/logs/webapp_*.log

Search and filter:

# Find errors
grep ERROR ~/.local/share/lm-studio-bench/logs/benchmark_*.log

# Find warnings
grep WARNING ~/.local/share/lm-studio-bench/logs/benchmark_*.log

# Find specific model errors
grep "model_name_pattern" \
  ~/.local/share/lm-studio-bench/logs/benchmark_*.log

# Count log entries by level
grep -c INFO ~/.local/share/lm-studio-bench/logs/benchmark_*.log
grep -c ERROR ~/.local/share/lm-studio-bench/logs/benchmark_*.log

Hardware Monitoring Live Charts - Guide

✅ Status: Fully Implemented with GPU Detection

Hardware monitoring is now fully functional with stable live charts for all metrics and improved GPU model detection.

Monitoring logic is shared in tools/hardware_monitor.py and used by both classic benchmark flows and capability-driven agent flows.

📊 Implemented Metrics

GPU Detection and Model Info

The system automatically detects all installed GPUs:

NVIDIA GPUs
- Detection: nvidia-smi --query-gpu=name
- VRAM: nvidia-smi --query-gpu=memory.total
- Temperature: nvidia-smi --query-gpu=temperature.gpu
- Power: nvidia-smi --query-gpu=power.draw
AMD GPUs
- rocm-smi detection: rocm-smi --showproductname
- Device ID mapping: lspci -d 1002:{device_id}
- Example: 1002:150e → "Radeon Graphics (Ryzen 9 7950X3D)"
- rocm-smi search path: /usr/bin, /usr/local/bin, /opt/rocm-*/bin/
- VRAM: rocm-smi --showmeminfo vram
- GTT: rocm-smi --showmeminfo gtt
- Temperature: rocm-smi --showtemp
iGPU detection
- Extract from CPU string: regex r'Radeon\s+(\d+[A-Za-z]*)'
- Shows integrated Radeon graphics separately
- Prevents redundancy with dedicated GPUs

GPU Metrics

🌡️ GPU Temperature (°C) - Red
- NVIDIA: nvidia-smi --query-gpu=temperature.gpu
- AMD: rocm-smi --showtemp
- Intel: intel-gpu-top (if available)
⚡ GPU Power (W) - Blue
- NVIDIA: nvidia-smi --query-gpu=power.draw
- AMD: rocm-smi (Current Socket Graphics Package Power)
- Intel: alternative measurement methods
💾 GPU VRAM Usage (GB) - Green
- NVIDIA: nvidia-smi --query-gpu=memory.used
- AMD: rocm-smi --showmeminfo vram (in bytes)
🧠 GPU GTT Usage (GB) - Purple
- AMD only: rocm-smi --showmeminfo gtt
- System RAM that is used as VRAM
- Example: 2GB VRAM + 46GB GTT = 48GB effective

System Metrics (with --enable-profiling)

🖥️ CPU Usage (%) - Orange
- psutil.cpu_percent(interval=0.1)
- 0-100% range
- System-wide, not per process
💾 System RAM Usage (GB) - Cyan
- psutil.virtual_memory().used
- Smoothing: moving average over 3 samples
- Prevents spikes from cache/buffer fluctuations
- Very stable curves

🔧 Activation

Hardware monitoring is automatically enabled with:

# WebApp with hardware monitoring
./run.py --webapp

# CLI with hardware monitoring
./run.py --enable-profiling

# Only with specific models
./run.py --limit 2 --enable-profiling

📝 Logger Output

When --enable-profiling is active, the benchmark prints metrics every second:

🌡️ GPU Temp: 45.3°C
⚡ GPU Power: 125.5W
💾 GPU VRAM: 8.2GB
🧠 GPU GTT: 0.0GB
🖥️ CPU: 35.2%
💾 RAM: 18.5GB

These outputs are:

✅ Saved in ~/.local/share/lm-studio-bench/logs/benchmark_YYYYMMDD_HHMMSS.log
✅ Shown in the WebApp terminal
✅ Visualized as charts

🎯 Data Flow

Backend (cli/benchmark.py / agents/benchmark.py)
   ↓
Shared Module (tools/hardware_monitor.py)
  ↓
HardwareMonitor._monitor_loop()
  ├─ _get_temperature()
  ├─ _get_power_draw()
  ├─ _get_vram_usage()
  ├─ _get_gtt_usage()
  ├─ _get_cpu_usage()
  └─ _get_ram_usage()
       ↓
logger.info() → stdout + log file
       ↓
WebApp Backend (app.py)
  ├─ _consume_output() Task (blocking readline)
  ├─ parse_hardware_metrics() (Regex patterns)
  └─ hardware_history dict
       ↓
WebSocket
  └─ Sends every 2 seconds (last 60 entries)
       ↓
Frontend (dashboard.html.jinja)
  └─ 6 Plotly.js charts with live updates

Before each profiling run, HardwareMonitor.start() calls _reset_measurements(). This clears prior temperature, power, VRAM, GTT, CPU and RAM samples, so chart data and exported min/max/avg values only reflect the current run.

🐛 Fixes and Optimizations

Fix 1: rocm-smi 7.0.1 Format Change

Problem: rocm-smi changed its output format Solution: regex parser extracts the last number from the line

match = re.search(r'[\d.]+\s*$', line.strip())

Fix 2: Logger Routing

Problem: hardware data did not appear in log files Solution: print() → logger.info() for stdout + file

All hardware metrics are logged using Python's standard logging module:

logger.info(f"🌡️ GPU Temp: {temp:.1f}°C")
logger.info(f"💾 Memory: {vram_mb:.1f}MB VRAM + {gtt_mb:.1f}MB GTT")

This ensures metrics appear in both:

stdout - Real-time display in terminal
log files - ~/.local/share/lm-studio-bench/logs/benchmark_YYYYMMDD_HHMMSS.log for permanent record
WebApp - Streamed via WebSocket to dashboard

Fix 3: WebApp Output Streaming

Problem: WebApp showed only 10% of the hardware data Solution: asyncio.wait_for() → blocking readline() in executor

Fix 4: RAM Monitoring Spikes

Problem: RAM chart jumped between 1.8GB and 28.3GB Solution: moving average over 3 samples → very stable curve

Fix 5: Runtime Counter Does Not Stop

Problem: runtime counter continued after benchmark end Solution: clearInterval(uptimeInterval) on completion

Fix 6: WebApp Initialization Race Conditions

Problem: links were not interactive, light mode on startup Solution: 3x DOMContentLoaded events → 1x consolidated event

📊 Chart Properties

All charts update every 2 seconds with:

Min/Max/Avg statistics - real-time calculation
Last 60 data points - about 2 minutes of history
Responsive design - adapts to window size
Dark mode - default for all charts
Hover tooltips - show exact values on hover

LM Studio CLI - Available LLM Metadata with GPU Analysis

📋 Quick Reference

Main metadata query commands

lms ls --json           # All downloaded models with metadata
lms ps --json           # Currently loaded models
lms status              # Server status + model size
lms version             # LM Studio version

🎯 GPU Support and Hardware Requirements

Automatic GPU detection in the benchmark

The benchmark system automatically detects all your GPUs and specs:

NVIDIA GPUs:

Automatic detection via nvidia-smi
VRAM size recorded for offload optimization
Temperature and power are monitored

AMD GPUs (rocm-smi):

Detailed device ID mapping for GPU model names
VRAM and GTT memory are tracked separately
rocm-smi search paths: /usr/bin, /usr/local/bin, /opt/rocm-*/bin/

iGPU detection:

Radeon iGPUs are extracted from the CPU string
Regex pattern: Radeon\s+(\d+[A-Za-z]*)
Shows, for example, "Radeon 890M (Ryzen 9 7950X3D)" separately

📊 Full Metadata Fields (15 fields per model)

Category 1: Model identification (5 fields)

Field	Type	Example	Description
`type`	string	"llm"	Model type (llm, embedding)
`modelKey`	string	"mistralai/ministral-3-3b"	Unique model ID
`displayName`	string	"Ministral 3 3B"	Display name
`publisher`	string	"mistralai"	Model publisher/developer
`path`	string	"mistralai/ministral-3-3b"	Local storage path

Category 2: Technical specifications (4 fields)

Field	Type	Example	Description
`architecture`	string	"mistral3", "gemma3", "llama"	Model architecture
`format`	string	"gguf"	File format (GGUF, etc.)
`paramsString`	string	"3B", "7B", "13B"	Parameter size
`sizeBytes`	number	2986817071	Size in bytes

Category 3: Model capabilities (3 fields)

Field	Type	Example	Description
`vision`	boolean	true / false	Can process images?
`trainedForToolUse`	boolean	true / false	Supports tool calling?
`maxContextLength`	number	131072, 262144	Maximum context length in tokens

Category 4: Quantization and variants (4 fields)

Field	Type	Example	Description
`quantization.name`	string	"Q4_K_M", "Q8_0", "F16"	Quantization method
`quantization.bits`	number	4, 8, 16	Bits per weight
`variants`	array	`[@q4_k_m, @q8_0]`	All available quantizations
`selectedVariant`	string	"mistralai/ministral-3-3b@q4_k_m"	Current selection

🔍 Practical Examples with Your Models

Example 1: List vision models

lms ls --json | jq '.[] | select(.vision == true) | {displayName, paramsString, maxContextLength}'

Output:

  • Gemma 3 4B (4B) - 131072 tokens
  • Ministral 3 3B (3B) - 262144 tokens
  • Qwen3 Vl 8B (8B) - 262144 tokens

The command uses the jq filter shown above.

Example 2: Tool-calling models only

lms ls --json | jq '.[] | select(.trainedForToolUse == true) | .displayName'

Example 3: Sort models by size

lms ls --json | jq 'sort_by(.sizeBytes) | .[] | {displayName, sizeGB: (.sizeBytes/1024/1024/1024|round*100/100)}'

Example 4: Models with large context length (≥128k tokens)

lms ls --json | jq '.[] | select(.maxContextLength >= 131072) | {modelKey, maxContextLength}'

Example 5: Model architecture distribution

lms ls --json | jq -r '.[] | .architecture' | sort | uniq -c

🐍 Python SDK Access

SDK methods for metadata queries

import lmstudio

# 1. Fetch all downloaded models
models = lmstudio.list_downloaded_models()
for model in models:
    print(f"Model: {model.model_key}")
    print(f"  Size: {model.info.sizeBytes / 1024**3:.2f} GB")
    print(f"  Vision: {model.info.vision}")
    print(f"  Maximum context length: {model.info.maxContextLength} tokens")
    print(f"  Architecture: {model.info.architecture}")
    print()

# 2. Currently loaded models
loaded_models = lmstudio.list_loaded_models()
for llm in loaded_models:
    print(f"Loaded: {llm.identifier}")

# 3. Filter models
vision_models = [m for m in models if m.info.vision]
print(f"Vision models: {len(vision_models)}")

# 4. Sort by size
large_models = sorted(models, key=lambda m: m.info.sizeBytes, reverse=True)[:3]
for model in large_models:
    print(f"{model.info.displayName}: {model.info.sizeBytes / 1024**3:.2f} GB")

💡 Common Use Cases

Use case 1: Quick performance tests

Filter only small models < 1GB for fast benchmarks:

lms ls --json | jq '.[] | select(.sizeBytes < 1000000000) | .modelKey'

Use case 2: Long-form processing

Models with large context for document analysis:

lms ls --json | jq '.[] | select(.maxContextLength >= 100000) | .displayName'

Use case 3: Image processing

Multi-modal models for vision tasks:

lms ls --json | jq '.[] | select(.vision == true) | .modelKey'

Use case 4: Tool integration

Models with function calling for agent systems:

lms ls --json | jq '.[] | select(.trainedForToolUse == true) | .displayName'

Use case 5: Quantization comparison

All available quantizations for a model:

lms ls "google/gemma-3-1b" --json | jq '.variants[]'

🎯 Benchmarking with Metadata

Integration into benchmark scripts:

import subprocess
import json

# Load model metadata
result = subprocess.run(
    ['lms', 'ls', '--json'],
    capture_output=True,
    text=True,
    check=False
)
models = json.loads(result.stdout)

# Filter for benchmarking
benchmark_candidates = [
    m for m in models
    if m['sizeBytes'] < 5e9  # < 5GB
    and m['vision'] is False  # Text only
]

print(f"Benchmark candidates: {len(benchmark_candidates)}")
for model in benchmark_candidates:
    print(f"  - {model['displayName']} ({model['paramsString']})")

📝 Tips and Tricks

Convert size

# Bytes to GB
python3 -c "print(f'{2986817071/1024**3:.2f} GB')"  # Output: 2.78 GB

JSON pretty print

lms ls --json | jq '.' | less

Quick statistics

# Average model size
lms ls --json | jq '[.[].sizeBytes] | add / length / 1024 / 1024 / 1024'

# Largest model
lms ls --json | jq 'max_by(.sizeBytes) | .displayName'

# Models per architecture
lms ls --json | jq 'group_by(.architecture) | map({architecture: .[0].architecture, count: length})'

lms status              # Server status (shows loaded models too)
lms version             # LM Studio version
lms load <model>        # Load a model
lms unload --all        # Unload all models

Troubleshooting

No output for `lms ls --json`

Ensure the LM Studio server is running: lms server start
Check for port conflicts

jq not installed

Install: sudo apt install jq (Linux) or brew install jq (macOS)
Alternative: use Python parsing

Unlimited output

Use | head -n 5 to limit
Or pipe to less for paging: | less

User Data & Configuration Locations

This project follows the XDG Base Directory Specification for storing user data and configuration.

Directory Structure

Project Directory

The project directory contains read-only defaults and optional compatibility locations:

<project>/
├── config/
│   └── defaults.json       # Project defaults (in Git)
├── results/                # Optional: legacy/manual compatibility location
└── logs/                   # Optional: legacy/manual debug location

User Directories (XDG Standard)

User-specific data is stored in standard XDG locations:

~/.config/lm-studio-bench/
├── defaults.json           # User configuration overrides (optional)
└── presets/
    ├── my_fast_test.json   # User preset example
    └── my_quality.json     # User preset example

~/.local/share/lm-studio-bench/results/
├── benchmark_results_<timestamp>.json
├── benchmark_results_<timestamp>.csv
├── benchmark_results_<timestamp>.pdf
├── benchmark_results_<timestamp>.html
├── benchmark_cache.db      # SQLite benchmark cache
├── model_metadata.db       # Model metadata cache
└── metadata/
    └── <model_id>/
        └── metadata.json   # Optional per-model metadata fallback

~/.local/share/lm-studio-bench/logs/
├── benchmark_<timestamp>.log
├── benchmark_latest.log    # Symlink to newest benchmark log
├── webapp_<timestamp>.log
├── webapp_latest.log       # Symlink to newest webapp log
├── runapp_<timestamp>.log
├── runapp_latest.log       # Symlink to newest launcher log
├── trayapp_<timestamp>.log
└── trayapp_latest.log      # Symlink to newest tray log

Configuration Loading

Configuration is loaded with the following priority:

CLI Arguments (highest priority)
User Config (~/.config/lm-studio-bench/defaults.json)
Project Config (config/defaults.json)
Hard-coded Defaults (in code)

Example

Project (config/defaults.json):

{
  "num_runs": 3,
  "context_length": 2048,
  "lmstudio": {
    "use_rest_api": false
  }
}

User (~/.config/lm-studio-bench/defaults.json):

{
  "num_runs": 5,
  "lmstudio": {
    "use_rest_api": true
  }
}

Result (merged configuration):

{
  "num_runs": 5,              // User override
  "context_length": 2048,     // Project default
  "lmstudio": {
    "use_rest_api": true      // User override
  }
}

With CLI:

./run.py --runs 10 --context 4096

Final configuration:

num_runs: 10 (CLI)
context_length: 4096 (CLI)
use_rest_api: true (User config)

Creating User Configuration

Step 1: Create Config Directory

mkdir -p ~/.config/lm-studio-bench

Step 2: Create User Config File

nano ~/.config/lm-studio-bench/defaults.json

Step 3: Add Your Overrides

Only include fields you want to override:

{
  "num_runs": 5,
  "context_length": 4096,
  "inference": {
    "temperature": 0.7
  }
}

Important: You only need to specify fields you want to change. All other values will use project defaults.

Directory Initialization

On first run, the tool automatically:

Creates user data directories (~/.config/... and ~/.local/share/...)
Places new results in ~/.local/share/lm-studio-bench/results/
Places runtime logs in ~/.local/share/lm-studio-bench/logs/

Note: Legacy files in project-local results/ are not automatically moved. If you still use that location, move them manually to the XDG path.

Benefits of XDG Structure

For Users

✅ Persistent User Settings: Configuration survives project updates
✅ Cleaner Project Directory: User data separated from code
✅ Standard Locations: Follows Linux conventions
✅ Easy Backups: Backup ~/.local/share/lm-studio-bench/ and ~/.config/lm-studio-bench/
✅ Multi-User Support: Each user has their own data

For Developers

✅ No Git Conflicts: User data not in version control
✅ Clean Updates: git pull doesn't affect user data
✅ Portable: Project directory can be moved/deleted without losing user data

Environment Variables

You can override paths with environment variables:

# Override config directory
export XDG_CONFIG_HOME="$HOME/my-configs"

# Override data directory
export XDG_DATA_HOME="$HOME/my-data"

# Now config is in: $HOME/my-configs/lm-studio-bench/defaults.json
# Now results are in: $HOME/my-data/lm-studio-bench/results/

FAQ

Q: Where are my benchmark results stored?

A: ~/.local/share/lm-studio-bench/results/

If you pass --output-dir, report files (JSON/CSV/HTML/PDF) are written there. The SQLite cache databases still live in the user results directory.

Q: Where are the SQLite databases stored?

~/.local/share/lm-studio-bench/results/benchmark_cache.db
~/.local/share/lm-studio-bench/results/model_metadata.db

Q: Where do I put custom configuration?

A: ~/.config/lm-studio-bench/defaults.json

Only include fields you want to override from project defaults.

Q: Where are user presets stored?

A: ~/.config/lm-studio-bench/presets/

Built-in readonly presets (default_classic, default_compatibility_test, default as a legacy alias, quick_test, high_quality, resource_limited) are not stored as files.

Readonly preset names cannot be overwritten or deleted by user presets, including the alias default.

Q: What happens to my old results?

A: They are not auto-migrated from legacy project-local folders. Move them manually to ~/.local/share/lm-studio-bench/results/.

Q: Can I use the old `config/defaults.json`?

A: Yes! It's still used as project defaults. User config in ~/.config/ overrides it.

Q: How do I reset to project defaults?

A: Delete your user config:

rm ~/.config/lm-studio-bench/defaults.json

Q: How do I backup my data?

A: Backup these directories:

# Configuration
tar -czf lms-bench-config.tar.gz ~/.config/lm-studio-bench/

# Results and cache
tar -czf lms-bench-data.tar.gz ~/.local/share/lm-studio-bench/

Q: What about logs?

A: Logs are stored in:

~/.local/share/lm-studio-bench/logs/

This includes benchmark, web app, tray, and launcher logs.

LM Studio REST API v1 Integration

Overview

The benchmark tool now supports LM Studio's native REST API v1 (/api/v1/*) in addition to the existing Python SDK/CLI mode. This enables advanced features such as stateful chats, parallel requests, and more precise metrics.

New Features

1. REST API Mode (`--use-rest-api`)

Uses /api/v1/chat for inference instead of the Python SDK
Stateful chat management (response_id tracking)
Detailed stats in the response (TTF, tokens/s, tokens in/out)
Streaming events for more accurate measurement

2. Model Management via API

GET /api/v1/models — list with capabilities (vision, tool-use)
POST /api/v1/models/load — explicit load with configuration
POST /api/v1/models/unload — explicit unload
POST /api/v1/models/download — download model via API

3. Improved Capabilities Detection

Vision support: capabilities.vision flag from the API
Tool calling: capabilities.trained_for_tool_use flag
Use the --only-vision or --only-tools filters

4. Parallel Inference (LM Studio 0.4.0+)

--n-parallel N — max concurrent predictions (default: 4)
--unified-kv-cache — optimizes VRAM usage for parallel requests
Continuous batching support (llama.cpp 2.0+)

5. API Authentication

--api-token TOKEN — permission key for protected servers
Config: lmstudio.api_token in config/defaults.json

Usage

Basic usage (REST API mode)

# REST API with default settings
./run.py --use-rest-api --limit 1

# With API token
./run.py --use-rest-api --api-token "your-token-here" --limit 1

# With parallel requests (LM Studio 0.4.0+)
./run.py --use-rest-api --n-parallel 8 --unified-kv-cache --limit 1

Filter by capabilities

# Test only vision-capable models
./run.py --use-rest-api --only-vision --runs 2

# Test only tool-calling models
./run.py --use-rest-api --only-tools --runs 2

Config file (persistent)

config/defaults.json:

{
  "lmstudio": {
    "host": "localhost",
    "ports": [1234, 1235],
    "api_token": "your-token-here",
    "use_rest_api": true
  }
}

Then simply:

./run.py --limit 1  # will automatically use REST API from config

Comparison: SDK vs. REST API

Feature	SDK/CLI Mode	REST API Mode
Model Loading	`lms load` CLI	`POST /api/v1/models/load`
Inference	`lmstudio.llm()`	`POST /api/v1/chat`
Stats	SDK stats object	Detailed response stats
Streaming	SDK stream	SSE stream (Server-Sent Events)
Parallel Requests	❌	✅ (with `--n-parallel`)
Stateful Chats	❌	✅ (response_id tracking)
Capabilities	Metadata parsing	Native API fields
Authentication	❌	✅ (permission keys)

API Response Format

Dashboard summary API (`/api/dashboard/stats`)

The web dashboard now exposes additional summary fields for quick visual analysis of benchmark history. The endpoint is consumed by the Home and Results views to render KPI cards and charts.

New response fields:

speed_summary: min, p50, avg, p95, max tokens/s
top_models_extended: Top 10 models by speed (model, quantization, speed, VRAM, architecture)
quantization_distribution: count per quantization
architecture_distribution: count per architecture
efficiency_top: top models ranked by tokens_per_sec_per_gb

Example (excerpt):

{
  "speed_summary": {
    "min": 22.44,
    "p50": 48.17,
    "avg": 51.26,
    "p95": 86.11,
    "max": 93.88
  },
  "top_models_extended": [
    {
      "model_name": "qwen/qwen3-4b@q4_k_m",
      "quantization": "q4_k_m",
      "speed": 93.88,
      "vram_mb": "6144",
      "architecture": "qwen3"
    }
  ],
  "quantization_distribution": {
    "q4_k_m": 22,
    "q5_k_m": 13
  }
}

`/api/v1/chat` stats

{
  "text": "... generated text ...",
  "stats": {
    "tokens_in": 42,
    "tokens_out": 128,
    "time_to_first_token_ms": 234.5,
    "total_time_ms": 1523.8,
    "tokens_per_second": 84.02
  }
}

`/api/v1/models` capabilities

{
  "models": [
    {
      "key": "llava-1.6-vicuna-7b-q4_k_m",
      "capabilities": {
        "vision": true,
        "trained_for_tool_use": false
      }
    },
    {
      "key": "qwen-2.5-coder-14b-instruct-q5_k_m",
      "capabilities": {
        "vision": false,
        "trained_for_tool_use": true
      }
    }
  ]
}

Implementation details

New files

core/client.py: REST API client with wrapper functions
- LMStudioRESTClient: main class
- ModelInfo, ModelCapabilities, ChatStats: data classes
- is_vision_model(), is_tool_model(): helpers

Modified files

cli/benchmark.py:
- _run_inference(): dispatcher (SDK vs REST)
- _run_inference_rest(): REST-based inference
- _run_inference_sdk(): SDK-based inference (renamed)
- _load_model_rest(), _unload_model_rest(): REST model management
config/defaults.json: added api_token, use_rest_api fields
core/config.py: new config fields in BASE_DEFAULT_CONFIG

CLI flags

--use-rest-api              Enable REST API mode
--api-token TOKEN           API permission token
--n-parallel N              Max parallel predictions (REST only)
--unified-kv-cache          Unified KV cache (REST only)

Troubleshooting

Server unreachable

# Check whether LM Studio is running
curl http://localhost:1234/

# Healthcheck via CLI
lms server status

API token errors

# Generate token in Settings > Server
# Save it in config or pass via CLI
./run.py --use-rest-api --api-token "lms_..."

REST vs SDK performance

REST: more precise stats, more features
SDK: slightly faster (direct Python access)
For benchmarking, REST is recommended (better metrics)

Additional REST Client Features

1. Download Progress Tracking

The REST client now supports real-time download progress monitoring:

from rest_client import LMStudioRESTClient

client = LMStudioRESTClient()

def on_progress(status):
    if status["state"] == "downloading":
        print(f"Progress: {status['progress'] * 100:.1f}%")

# Wait for download to complete with progress updates
success = client.download_model(
    model_key="qwen/qwen3-1.7b",
    wait_for_completion=True,
    progress_callback=on_progress
)

API: Polls /api/v1/models/download/status every 2 seconds until completion.

2. MCP Integration

Model Context Protocol (MCP) servers can now be attached to chat requests:

# LM Studio v1 API format
mcp_integrations = [
    {
        "type": "ephemeral_mcp",
        "server_label": "filesystem",
        "server_url": "http://localhost:3001/mcp"
    }
]

result = client.chat_stream(
    messages=[{"role": "user", "content": "List files in /tmp"}],
    model="qwen/qwen3-4b",
    mcp_integrations=mcp_integrations
)

Note: Requires MCP server running. Integrations are passed in the integrations array field.

3. Stateful Chat History

Enable multi-turn conversations with automatic response_id tracking:

client = LMStudioRESTClient()

# First message
result1 = client.chat_stream(
    messages=[{"role": "user", "content": "What is 2+2?"}],
    model="qwen/qwen3-4b",
    use_stateful=True
)
# response_id stored automatically

# Second message - automatically includes previous_response_id
result2 = client.chat_stream(
    messages=[{"role": "user", "content": "Add 3 to that."}],
    model="qwen/qwen3-4b",
    use_stateful=True
)
# Server can maintain conversation context

# Reset state when starting new conversation
client.reset_stateful_chat()

API: Extracts response_id from chat.end event, sends previous_response_id in subsequent requests.

4. Response Caching

Identical requests are cached in memory for instant responses:

client = LMStudioRESTClient(enable_cache=True)

# First request - hits API (slow)
result1 = client.chat_stream(
    messages=[{"role": "user", "content": "Count to 5"}],
    model="qwen/qwen3-4b",
    temperature=0.5
)
# Time: ~0.5s

# Second identical request - hits cache (instant)
result2 = client.chat_stream(
    messages=[{"role": "user", "content": "Count to 5"}],
    model="qwen/qwen3-4b",
    temperature=0.5
)
# Time: ~0.0s (10,000x faster!)

# Cache management
cache_size = len(client._RESPONSE_CACHE)  # Check cache size
cleared = client.clear_cache()             # Clear all cached responses

Cache Key: MD5 hash of (messages, model, temperature)
Bypassed: When using use_stateful=True or mcp_integrations (non-deterministic)

Documentation links

Capability-Driven Benchmark Agent Integration

The new Capability-Driven Benchmark Agent functionality is fully integrated into the project and is now available via run.py.

3 Operating Modes

The system now supports 3 different operating modes:

1. Classic Benchmark (Default)

Measures token/s speed across all installed models:

./run.py --limit 5              # Test 5 models
./run.py --export-only          # Generate reports from cache
./run.py --runs 1               # Fast-mode with 1 measurement

Metrics: Tokens/s, latency, VRAM usage

2. Capability-Driven Agent ⭐ NEW

Tests model capabilities with quality metrics:

./run.py --agent "model-id"     # Automatically test all capabilities

# With specific capabilities
./run.py --agent "llama-13b" --capabilities general_text,reasoning

# With output format options
./run.py --agent "llama-13b" --output-dir ./results/ --formats json,html

# Verbose mode
./run.py --agent "llama-13b" --verbose

Detectable Capabilities:

general_text - Basic language understanding (QA, summarization, classification)
reasoning - Logical and mathematical reasoning
vision - Multimodal understanding (image captioning, VQA, OCR)
tooling - Tool calling and function execution

Metrics per Capability:

Quality: ROUGE, F1, Exact Match, Accuracy, Function Call Accuracy
Performance: Tokens/s, latency
Reports: JSON + HTML with visualizations
Storage: SQLite database for historical tracking and comparison

Runtime Resilience:

Multi-model capability runs continue when a single model fails to load or execute; failed models are logged and skipped.
Embedding models are retried automatically without offload_kv_cache_to_gpu if LM Studio rejects that load option.

Data Storage:

Results are automatically saved to:

JSON Reports: ./output/benchmark_results_*.json
HTML Reports: ./output/benchmark_results_*.html
SQLite Cache: ~/.local/share/lm-studio-bench/results/benchmark_cache.db

The SQLite database stores individual test results and capability summaries, allowing you to:

Track performance over time
Compare results across models
Query specific capability metrics
Build custom dashboards from cached data

SQLite Metrics Matrix (Classic vs Capability)

The table below lists what is currently persisted in SQLite for both test types, so missing metrics are easy to spot.

Metric Group	Classic Benchmark (`benchmark_results`)	Capability Benchmark (`benchmark_results`, `source='compatibility'`)
Run identity	`id`, `model_key`, `model_name`, `quantization`, `timestamp`	`id`, `model_name`, `model_key`, `capability`, `test_id`, `test_name`, `timestamp`
Throughput/latency	`avg_tokens_per_sec`, `avg_ttft`, `avg_gen_time`, `tokens_per_sec_p50`, `tokens_per_sec_p95`, `tokens_per_sec_std`, `ttft_p50`, `ttft_p95`, `ttft_std`	`latency_ms`, `throughput_tokens_per_sec` (per test), `avg_latency_ms`, `avg_throughput` (summary)
Token volume	`prompt_tokens`, `completion_tokens`	`prompt_tokens`, `tokens_generated`
Quality metrics	Stored for parity columns but normally `NULL` for classic runs	`quality_score`, `rouge_score`, `f1_score`, `exact_match_score`, `accuracy_score`, `function_call_accuracy`, `avg_quality_score`, `avg_rouge`, `avg_f1`, `avg_exact_match`, `avg_accuracy`
Success/failure	`success`, `error_message`, `error_count`	`success`, `error_message` (per test), `total_tests`, `successful_tests`, `failed_tests`, `success_rate`, `error_count`
Hardware profiling	`gpu_type`, `gpu_offload`, `vram_mb`, `temp_celsius_min/max/avg`, `power_watts_min/max/avg`, `vram_gb_min/max/avg`, `gtt_gb_min/max/avg`, `cpu_percent_min/max/avg`, `ram_gb_min/max/avg`	Same run-level hardware fields are persisted on each capability test row
Inference/load params	`context_length`, `temperature`, `top_k_sampling`, `top_p_sampling`, `min_p_sampling`, `repeat_penalty`, `max_tokens`, `n_gpu_layers`, `n_batch`, `n_threads`, `flash_attention`, `rope_freq_base`, `rope_freq_scale`, `use_mmap`, `use_mlock`, `kv_cache_quant`	Same run-level inference/load fields are persisted on each capability test row
Environment/version	`lmstudio_version`, `app_version`, `nvidia_driver_version`, `rocm_driver_version`, `intel_driver_version`, `os_name`, `os_version`, `cpu_model`, `python_version`	Same environment/version fields are persisted on each capability test row
Derived/comparison	`tokens_per_sec_per_gb`, `tokens_per_sec_per_billion_params`, `speed_delta_pct`, `prev_timestamp`	Same derived/comparison fields are persisted on each capability test row
Raw text/reference	`prompt` (full input prompt), `raw_output`, `reference_output`	`prompt`, `raw_output`, `reference_output`

Quick gap summary

Missing in capability mode: TTFT distribution stats and classic-only aggregate throughput percentiles.
Missing in classic mode: meaningful per-test quality metrics (ROUGE/F1/Exact/Accuracy) because classic benchmarks do not execute capability test cases.

Variant selection in REST mode

Capability mode now forwards the exact requested model identifier, including any @quantization suffix, to the LM Studio REST API.
This keeps load, chat, and unload aligned with the selected variant and avoids silently falling back to a server-side default quantization.

3. Web Dashboard

Modern web UI with live streaming and configuration:

./run.py --webapp               # Starts on http://localhost:8080
./run.py -w                     # Short form

Agent Options

./run.py --agent MODEL_PATH [OPTIONS]

OPTIONS:
  --capabilities CAPS        Comma-separated capabilities
                            (general_text, reasoning, vision, tooling)
  --output-dir DIR          Output directory (default: output)
  --config FILE             YAML configuration file
  --formats FORMATS         Output formats: json,html (default: json,html)
  --max-tests N             Max tests per capability
  --context-length N        Model context length (default: 2048)
  --gpu-offload RATIO       GPU offload ratio 0.0-1.0 (default: 1.0)
  --temperature TEMP        Generation temperature (default: 0.1)
  -v, --verbose             Enable verbose logging

Test Data and Prompts

The following test files are available:

tests/
├── data/
│   ├── text/
│   │   ├── qa_samples.json              # QA examples
│   │   ├── reasoning_samples.json       # Reasoning examples
│   │   └── tooling_samples.json         # Tool-calling examples
│   └── images/
│       └── README.md                    # Vision datasets
└── prompts/
    ├── general_text_qa.md
    ├── general_text_summarization.md
    ├── reasoning_logical.md
    ├── reasoning_math.md
    ├── tooling_function_call.md
    ├── vision_caption.md
    └── vision_vqa.md

Example Executions

# All capabilities (auto-detected)
./run.py --agent "my-model" --output-dir results/

# Only General Text and Reasoning
./run.py --agent "my-model" --capabilities general_text,reasoning

# With custom config
./run.py --agent "my-model" --config config/bench.yaml

# Verbose with all details
./run.py --agent "my-model" --verbose --max-tests 20

# Classic benchmark still available
./run.py --limit 10 --runs 3

Code Structure

cli/
├── main.py                  # CLI entrypoint for agent
├── __main__.py              # Makes cli package executable
├── benchmark.py             # Classic benchmark runner
├── metrics.py               # Metric implementations
├── reporting.py             # JSON & HTML report generation
└── report_template.html.template

config/
└── bench.yaml               # Default configuration

agents/
├── benchmark.py           # Benchmark executor
├── runner.py                # Test orchestration
└── capabilities.py          # Capability detection

core/
├── config.py                # Configuration loading
├── paths.py                 # XDG/user path handling
├── client.py                # LM Studio REST API client
└── tray.py                  # Linux tray controller

Documentation

README-bench.md - Detailed agent documentation
ARCHITECTURE.md - System architecture
CONFIGURATION.md - Configuration guide

Logging

Capability benchmark logs use automatic level icons in addition to benchmark-specific emoji markers:

🐛 Debug
ℹ️ Info
⚠️ Warning
❌ Error
🔥 Critical

Capability-Driven Benchmark Agent for LM Studio Bench

This benchmark agent implements capability-driven evaluation for language models and multimodal models. It detects model capabilities, runs targeted tests, computes quality metrics, and generates comprehensive reports.

Features

Automatic capability detection (general text, reasoning, vision, tooling)
Per-capability test suites with standardized prompts
Quality metrics: ROUGE, F1, Exact Match, Accuracy, Function Call Accuracy
Performance metrics: tokens/sec, latency
Machine-readable JSON and human-friendly HTML reports
CLI interface with extensive configuration options
Docker support for containerized execution
GitHub Actions integration for CI/CD benchmarking

Quick Start

Local Execution

Run a benchmark on a model:

python -m cli.main "path/to/model" --output-dir output

Run across installed models:

python -m cli.main --all-models --output-dir output
python -m cli.main --random-models 5 --output-dir output

With specific capabilities:

python -m cli.main "model-id" \
  --capabilities general_text,reasoning \
  --output-dir results

Using Docker

Build the Docker image:

docker build -f scripts/Dockerfile.bench -t lm-bench-agent .

Run benchmark in container:

docker run -v $(pwd)/output:/app/output \
  lm-bench-agent "model-path" \
  --output-dir /app/output

Capabilities

The agent supports four primary capabilities:

1. General Text

Tests basic language understanding and generation:

Question answering
Summarization
Classification

Metrics: ROUGE-1, ROUGE-L, F1

2. Reasoning

Tests logical and mathematical reasoning:

Logical reasoning (syllogisms)
Math problem solving
Chain-of-thought reasoning

Metrics: Exact Match, F1, Accuracy

3. Vision

Tests multimodal understanding (requires vision models):

Image captioning
Visual Question Answering (VQA)
OCR and visual reasoning

Metrics: Accuracy, ROUGE-L

4. Tooling

Tests function calling and tool use:

Function selection
Parameter extraction
API interaction patterns

Metrics: Function Call Accuracy, Parameter Accuracy

CLI Reference

Basic Usage

python -m cli.main MODEL_PATH [OPTIONS]

Arguments

MODEL_PATH: Path to model or model identifier (required)

Options

Model Configuration

--model-name NAME: Override model name (default: derived from path)
--all-models: Run the capability benchmark for all installed models
--random-models N: Run the capability benchmark for N random installed models
--capabilities CAPS: Comma-separated capabilities to test
- Options: general_text,reasoning,vision,tooling
- Default: Auto-detect from model metadata

Output Configuration

--output-dir DIR: Output directory (default: output)
--formats FMTS: Output formats: json,html (default: both)

Test Configuration

--max-tests N: Maximum tests per capability (default: 10)
--config FILE: Path to YAML configuration file

Model Parameters

--context-length N: Model context length (default: 2048)
--gpu-offload RATIO: GPU offload ratio 0.0-1.0 (default: 1.0)
--temperature T: Generation temperature (default: 0.1)

Other

--verbose, -v: Enable verbose logging

Examples

Benchmark with custom configuration:

python -m cli.main "mymodel" \
  --config custom_config.yaml \
  --max-tests 20 \
  --verbose

Test only reasoning capability:

python -m cli.main "reasoning-model" \
  --capabilities reasoning \
  --temperature 0.0 \
  --max-tests 50

Generate only JSON output:

python -m cli.main "model" \
  --formats json \
  --output-dir json_results

Run against random installed models:

python -m cli.main --random-models 3 --capabilities general_text,reasoning

Runtime Behavior

When running across multiple installed models, a single model failure is logged and skipped so the benchmark can continue.
For embedding models loaded through the LM Studio REST API, the loader automatically retries without offload_kv_cache_to_gpu if LM Studio rejects that option.
Log output includes automatic level icons such as ℹ️, ⚠️, and ❌ in addition to benchmark-specific emoji markers.

Configuration File

The agent reads configuration from config/bench.yaml by default. Override with --config flag.

Configuration Schema

context_length: 2048
gpu_offload: 1.0
temperature: 0.1
max_tokens: 256
max_tests_per_capability: 10
use_rest_api: true

data_dir: tests/data
prompts_dir: tests/prompts

timeout_seconds: 300

metric_weights:
  general_text:
    rouge-1: 0.3
    rouge-l: 0.4
    f1: 0.3
  reasoning:
    exact_match: 0.5
    f1: 0.3
    accuracy: 0.2
  vision:
    accuracy: 0.6
    rouge-l: 0.4
  tooling:
    function_call_accuracy: 0.7
    accuracy: 0.3

composite_score_weights:
  quality: 0.6
  performance: 0.2
  efficiency: 0.2

lmstudio:
  host: localhost
  ports:
    - 1234
    - 1235
  api_token: null

Key Configuration Options

context_length: Maximum context length for model
gpu_offload: GPU memory allocation (0.0 = CPU only, 1.0 = full GPU)
max_tests_per_capability: Limit tests to prevent long runs
metric_weights: Per-capability metric importance
composite_score_weights: Overall score composition

Output Format

JSON Report

The JSON report follows this schema:

{
  "schema_version": "1.0",
  "generated_at": "2025-01-15T10:30:00",
  "report": {
    "model_name": "model-name",
    "model_path": "path/to/model",
    "capabilities": ["general_text", "reasoning"],
    "timestamp": "2025-01-15T10:30:00",
    "summary": {
      "total_tests": 20,
      "successful_tests": 19,
      "success_rate": 0.95,
      "avg_latency_ms": 245.6,
      "avg_quality_score": 0.823,
      "avg_throughput_tokens_per_sec": 42.3,
      "by_capability": {
        "general_text": {
          "test_count": 10,
          "avg_quality_score": 0.856,
          "success_rate": 1.0
        }
      }
    },
    "results": [
      {
        "test_id": "qa_001",
        "capability": "general_text",
        "latency_ms": 230.5,
        "tokens_generated": 12,
        "throughput": 52.1,
        "quality_score": 0.89,
        "metrics": [
          {
            "name": "rouge-1",
            "value": 0.85,
            "normalized": 0.85
          }
        ],
        "error": null
      }
    ],
    "config": {},
    "raw_outputs_dir": "output/raw"
  }
}

HTML Report

The HTML report provides:

Summary statistics with visual indicators
Per-test results table with status, latency, and quality scores
Capability breakdown with aggregated metrics
Color-coded quality scores (green/yellow/red)

Raw Outputs

Individual test outputs are saved in output/raw/:

{
  "test_id": "qa_001",
  "capability": "general_text",
  "prompt": "What is the capital of France?",
  "response": "Paris",
  "latency_ms": 230.5,
  "tokens_generated": 12,
  "throughput": 52.1,
  "timestamp": 1642244400.123,
  "error": null
}

GitHub Actions Integration

The workflow .github/workflows/bench.yml enables CI benchmarking.

Triggering the Workflow

Manual Trigger

Go to Actions tab in GitHub
Select "Capability-Driven Benchmark"
Click "Run workflow"
Enter model path and capabilities
Click "Run workflow"

Scheduled Trigger

Runs automatically every Sunday at midnight (UTC).

Push Trigger

Runs on push to main or dev branches.

Note: the benchmark step currently reads the model path only from manual workflow_dispatch inputs. Push- and schedule-triggered runs therefore skip the actual benchmark unless you adapt the workflow to read the model path from another configuration source (for example, a repository variable or secret).

Workflow Outputs

The workflow uploads three artifacts:

benchmark-results-json: JSON reports (30-day retention)
benchmark-results-html: HTML reports (30-day retention)
benchmark-raw-outputs: Raw test outputs (7-day retention)

For pull requests, a summary comment is posted with key metrics.

Adding Test Data

General Text Tests

Add test cases to tests/data/text/qa_samples.json:

{
  "id": "qa_004",
  "prompt": "Your question here",
  "reference": "Expected answer",
  "category": "domain"
}

Reasoning Tests

Add to tests/data/text/reasoning_samples.json:

{
  "id": "reasoning_004",
  "prompt": "Problem statement",
  "reference": "Answer",
  "reasoning": "Explanation of solution",
  "category": "math"
}

Vision Tests

Place images in tests/data/images/ and reference them in test cases.

Tooling Tests

Add to tests/data/text/tooling_samples.json:

{
  "id": "tool_004",
  "task": "Task description",
  "expected_function": "function_name",
  "expected_parameters": {"param": "value"},
  "category": "function_calling"
}

Customizing Prompts

Prompt templates are in tests/prompts/:

general_text_qa.md: Question answering
general_text_summarization.md: Summarization
reasoning_logical.md: Logical reasoning
reasoning_math.md: Math problems
vision_caption.md: Image captioning
vision_vqa.md: Visual QA
tooling_function_call.md: Function calling

Edit templates to adjust instruction format or add few-shot examples.

Troubleshooting

Model Loading Fails

Ensure LM Studio is running and the model is available:

lms status
lms models list

No Tests Execute

Check that test data files exist:

ls tests/data/text/

Verify capabilities are correctly specified:

python -m cli.main "model" --capabilities general_text --verbose

Metrics Are Zero

This usually means:

Model output format doesn't match expected format
Reference answers need normalization
Wrong capability assigned to test

Check raw outputs in output/raw/ to inspect actual responses.

Timeout Errors

Increase timeout in config:

timeout_seconds: 600

Or reduce test count:

python -m cli.main "model" --max-tests 5

API Integration

Using as a Library

from pathlib import Path
from agents.runner import BenchmarkRunner
from cli.reporting import generate_reports

config = {
    "context_length": 2048,
    "max_tests_per_capability": 5,
    "use_rest_api": True
}

runner = BenchmarkRunner(
    config=config,
    output_dir=Path("output")
)

report = runner.run(
    model_path="mymodel",
    model_name="MyModel",
    capabilities=["general_text"]
)

outputs = generate_reports(
    report_data=report,
    output_dir=Path("output"),
    formats=["json", "html"]
)

print(f"JSON: {outputs['json']}")
print(f"HTML: {outputs['html']}")

Custom Model Adapter

Implement ModelAdapter interface:

from agents.benchmark import ModelAdapter, InferenceResult

class CustomAdapter(ModelAdapter):
    def load(self, model_path, **kwargs):
        pass

    def unload(self):
        pass

    def infer(self, prompt, image_path=None, **kwargs):
        return InferenceResult(...)

    def is_loaded(self):
        return True

Use with runner:

adapter = CustomAdapter()
report = runner.run(
    model_path="model",
    adapter=adapter
)

Architecture

Components

agents/capabilities.py: Capability detection logic
agents/benchmark.py: Core benchmark agent and model adapters
agents/runner.py: Test orchestration and loading
cli/metrics.py: Metric implementations
cli/reporting.py: Report generation (JSON, HTML)
cli/main.py: Command-line interface
config/bench.yaml: Default configuration
tests/data/: Test datasets
tests/prompts/: Prompt templates

Data Flow

CLI parses arguments and loads configuration
Runner detects capabilities from model metadata or flags
Test loader creates test cases for detected capabilities
Model adapter loads the model
Agent runs each test case:
- Executes inference
- Saves raw output
- Computes metrics
Reporter generates JSON and HTML from results
Outputs are saved to disk

License

This benchmark agent is part of LM-Studio-Bench and follows the same license.

Contributing

Contributions are welcome:

Add new capabilities
Implement new metrics
Expand test datasets
Improve prompt templates
Enhance reporting formats

Follow the coding standards in .github/instructions/code-standards.instructions.md.

SQLite Metric Parity Map

This table is intentionally compact: one metric per row.

Legend:

[x] = metric is stored in both test modes
[ ] = metric is missing in at least one mode

Notes:

Capability rows normalize quantization to an uppercase label such as Q4_K_M; classic rows keep the classic benchmark format such as q4_k_m.
Capability lmstudio_version stores a parsed version or pkg_version (commit:<sha>), not the raw lms version banner output.
Capability REST runs forward the exact model variant key, including the @quantization suffix, to LM Studio load/chat/unload requests.
Classic rows intentionally leave capability-only fields such as quality_score, raw_output, reference_output, capability, and test_id empty.
Historical rows created before recent schema/runtime fixes may still contain NULL values in parity columns. New rows should populate them.

Metric	benchmark_results (classic)	benchmark_results (compatibility)	Stored in both tests
Row id	`id`	`id`	`[x]`
Model name	`model_name`	`model_name`	`[x]`
Timestamp	`timestamp`	`timestamp`	`[x]`
Model path/source	`model_key`	`model_key`	`[x]`
Capability label	`capability`	`capability`	`[x]`
Test case id	`test_id`	`test_id`	`[x]`
Test case name	`test_name`	`test_name`	`[x]`
Quantization	`quantization`	`quantization`	`[x]`
Inference params hash	`inference_params_hash`	`inference_params_hash`	`[x]`
Tokens per second	`avg_tokens_per_sec`	`avg_tokens_per_sec`	`[x]`
Latency	`avg_gen_time`	`avg_gen_time`	`[x]`
TTFT	`avg_ttft`	`avg_ttft`	`[x]`
Prompt token count	`prompt_tokens`	`prompt_tokens`	`[x]`
Completion/generated tokens	`completion_tokens`	`tokens_generated`	`[x]`
Primary quality score	`quality_score`	`quality_score`	`[x]`
ROUGE	`rouge_score`	`rouge_score`	`[x]`
F1	`f1_score`	`f1_score`	`[x]`
Exact match	`exact_match_score`	`exact_match_score`	`[x]`
Accuracy	`accuracy_score`	`accuracy_score`	`[x]`
Function-call accuracy	`function_call_accuracy`	`function_call_accuracy`	`[x]`
Success flag	`success`	`success`	`[x]`
Error message	`error_message`	`error_message`	`[x]`
Error counter	`error_count`	`error_count`	`[x]`
Total tests per capability	`-`	aggregate `COUNT(*)` by capability	`[ ]`
Successful tests per capability	`-`	aggregate `SUM(success = 1)`	`[ ]`
Failed tests per capability	`-`	aggregate `SUM(success != 1)`	`[ ]`
Success rate per capability	`-`	derived aggregate (`successful / total`)	`[ ]`
GPU type	`gpu_type`	`gpu_type`	`[x]`
GPU offload ratio	`gpu_offload`	`gpu_offload`	`[x]`
VRAM (MB)	`vram_mb`	`vram_mb`	`[x]`
Temperature stats	`temp_celsius_min/max/avg`	`temp_celsius_min/max/avg`	`[x]`
Power stats	`power_watts_min/max/avg`	`power_watts_min/max/avg`	`[x]`
VRAM GB stats	`vram_gb_min/max/avg`	`vram_gb_min/max/avg`	`[x]`
GTT GB stats	`gtt_gb_min/max/avg`	`gtt_gb_min/max/avg`	`[x]`
CPU usage stats	`cpu_percent_min/max/avg`	`cpu_percent_min/max/avg`	`[x]`
RAM GB stats	`ram_gb_min/max/avg`	`ram_gb_min/max/avg`	`[x]`
Context length	`context_length`	`context_length`	`[x]`
Temperature sampling param	`temperature`	`temperature`	`[x]`
Top-K sampling param	`top_k_sampling`	`top_k_sampling`	`[x]`
Top-P sampling param	`top_p_sampling`	`top_p_sampling`	`[x]`
Min-P sampling param	`min_p_sampling`	`min_p_sampling`	`[x]`
Repeat penalty	`repeat_penalty`	`repeat_penalty`	`[x]`
Max tokens param	`max_tokens`	`max_tokens`	`[x]`
GPU layer setting	`n_gpu_layers`	`n_gpu_layers`	`[x]`
Batch setting	`n_batch`	`n_batch`	`[x]`
Thread setting	`n_threads`	`n_threads`	`[x]`
Flash attention setting	`flash_attention`	`flash_attention`	`[x]`
RoPE base setting	`rope_freq_base`	`rope_freq_base`	`[x]`
RoPE scale setting	`rope_freq_scale`	`rope_freq_scale`	`[x]`
mmap setting	`use_mmap`	`use_mmap`	`[x]`
mlock setting	`use_mlock`	`use_mlock`	`[x]`
KV cache quant setting	`kv_cache_quant`	`kv_cache_quant`	`[x]`
LM Studio version	`lmstudio_version`	`lmstudio_version`	`[x]`
App version	`app_version`	`app_version`	`[x]`
Driver versions	`nvidia/rocm/intel_driver_version`	`nvidia/rocm/intel_driver_version`	`[x]`
OS info	`os_name`, `os_version`	`os_name`, `os_version`	`[x]`
CPU model	`cpu_model`	`cpu_model`	`[x]`
Python version	`python_version`	`python_version`	`[x]`
Benchmark duration	`benchmark_duration_seconds`	`benchmark_duration_seconds`	`[x]`
Raw model output	`raw_output`	`raw_output`	`[x]`
Reference output	`reference_output`	`reference_output`	`[x]`
Efficiency per GB	`tokens_per_sec_per_gb`	`tokens_per_sec_per_gb`	`[x]`
Efficiency per B params	`tokens_per_sec_per_billion_params`	`tokens_per_sec_per_billion_params`	`[x]`
Speed delta vs previous	`speed_delta_pct`	`speed_delta_pct`	`[x]`
Previous timestamp link	`prev_timestamp`	`prev_timestamp`	`[x]`
Prompt hash	`prompt_hash`	`prompt_hash`	`[x]`
Full params hash	`params_hash`	`params_hash`	`[x]`
Prompt text	`prompt`	`prompt`	`[x]`

Historical Validation Queries

Use these queries to find older rows that predate parity fixes.

-- Classic rows that still miss parity fields introduced later.
SELECT id, model_name, timestamp,
         quantization, lmstudio_version, app_version, success
FROM benchmark_results
WHERE quantization IS NULL
    OR lmstudio_version IS NULL
    OR app_version IS NULL
    OR success IS NULL
ORDER BY id DESC;

-- Compatibility rows that still miss core parity fields.
SELECT id, model_name, capability, test_id,
         quantization, lmstudio_version, app_version,
         prompt_hash, params_hash
FROM benchmark_results
WHERE source = 'compatibility'
        AND (
            quantization IS NULL
            OR lmstudio_version IS NULL
            OR app_version IS NULL
            OR prompt_hash IS NULL
            OR params_hash IS NULL
        )
ORDER BY id DESC;

-- Compatibility summary directly from benchmark_results.
SELECT model_name,
             capability,
             COUNT(*) AS total_tests,
             SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) AS successful_tests,
             SUM(CASE WHEN success = 1 THEN 0 ELSE 1 END) AS failed_tests,
             AVG(avg_gen_time) AS avg_latency_ms,
             AVG(throughput_tokens_per_sec) AS avg_throughput,
             AVG(quality_score) AS avg_quality_score,
             AVG(rouge_score) AS avg_rouge,
             AVG(f1_score) AS avg_f1,
             AVG(exact_match_score) AS avg_exact_match,
             AVG(accuracy_score) AS avg_accuracy
FROM benchmark_results
WHERE source = 'compatibility'
GROUP BY model_name, capability
ORDER BY MAX(id) DESC;

Architecture Documentation

Comprehensive architecture documentation with Mermaid diagrams showing how the Python modules interact and how CLI arguments and configuration files are processed.

Architecture Documentation

System Architecture Overview

graph TB
    User([User]) --> RunPy[run.py<br/>Entry Point]

    RunPy -->|--webapp/-w flag| WebApp[web/app.py<br/>FastAPI Server]
    RunPy -->|benchmark mode| Benchmark[cli/benchmark.py<br/>Benchmark Engine]
    
    Benchmark --> ConfigLoader[core/config.py<br/>Configuration Manager]
    Benchmark --> PresetManager[core/presets.py<br/>Preset Manager]
    Benchmark --> RestClient[core/client.py<br/>REST API Client]
    
    ConfigLoader -->|reads| ProjectConfig[config/defaults.json<br/>Project Defaults]
    ConfigLoader -->|reads| UserConfig[~/.config/lm-studio-bench/defaults.json<br/>User Overrides]
    ConfigLoader -->|provides| DefaultConfig[(DEFAULT_CONFIG<br/>Merged)]
    
    Benchmark -->|uses| LMStudio[LM Studio Server<br/>localhost:1234/1235]
    RestClient -->|HTTP API v1| LMStudio
    
    Benchmark -->|writes| ResultsDB[(~/.local/share/lm-studio-bench/results/<br/>benchmark_cache.db)]
    Benchmark -->|exports| Reports[JSON/CSV/PDF/HTML<br/>Reports]
    
    WebApp -->|launches| Benchmark
    WebApp -->|reads| ResultsDB
    WebApp -->|serves| Dashboard[Web Dashboard<br/>http://localhost:PORT]
    RunPy -->|starts background process| Tray[core/tray.py<br/>Linux Tray Controller]
    Tray -->|polls /api/status| WebApp
    Tray -->|calls /api/benchmark/*| WebApp
    Tray -->|Quit calls /api/system/shutdown| WebApp
    
    style RunPy fill:#e1f5ff
    style Benchmark fill:#ffe1e1
    style ConfigLoader fill:#e1ffe1
    style RestClient fill:#fff4e1
    style DefaultsJSON fill:#f0f0f0
    style LMStudio fill:#e8deff

Key Components:

run.py: Wrapper script that decides between web dashboard and CLI benchmark mode
benchmark.py: Main benchmark engine with argparse, model discovery, and execution
config_loader.py: Loads and merges configuration from JSON file with built-in defaults
core/presets.py: Manages readonly/user presets and maps presets to CLI args
tools/hardware_monitor.py: Shared GPUMonitor and HardwareMonitor implementation for classic and capability flows
rest_client.py: REST API client for LM Studio v1 endpoints (optional mode)
web/app.py: FastAPI web dashboard with live streaming and results browser
tray.py: Linux AppIndicator tray controller for benchmark controls

Startup Flow

AppImage Entry Point

When the AppImage is executed, the bundled lmstudio-bench shell script runs before run.py and splits on whether real arguments are present:

flowchart TD
    AppImg([LM-Studio-Bench.AppImage args]) --> CheckArgs{Real args<br/>besides --debug/-d?}
    CheckArgs -->|No args| TrayOnly[exec tray.py --url http://localhost:1234<br/>stays in system tray]
    CheckArgs -->|Any other arg| RunPy[delegate to run.py + args]

    style AppImg fill:#d0e8ff
    style TrayOnly fill:#e1ffe1
    style RunPy fill:#ffe1ff

--debug / -d is exempt: ./AppImage --debug still enters tray-only mode with verbose logging.

run.py Flow

flowchart TD
    Start([./run.py args]) --> CheckHelp{--help or -h?}
    CheckHelp -->|Yes| ShowHelp[Show Extended Help<br/>+ benchmark.py --help]
    CheckHelp -->|No| CheckWebFlag{--webapp or -w<br/>in args?}

    CheckWebFlag -->|Yes| RemoveFlag[Remove --webapp/-w<br/>from args]
    RemoveFlag --> ResolvePort[Extract or assign<br/>web port]
    ResolvePort --> StartTrayWeb[start tray.py<br/>with --url dashboard]
    StartTrayWeb --> FindWebApp{web/app.py<br/>exists?}
    FindWebApp -->|Yes| StartWeb[subprocess.call<br/>python web/app.py + args]
    FindWebApp -->|No| ErrorWeb[Error: app.py not found]

    CheckWebFlag -->|No| StartTrayCLI[start tray.py<br/>with localhost:1234]
    StartTrayCLI --> FindBenchmark{cli/benchmark.py<br/>exists?}
    FindBenchmark -->|Yes| StartBenchmark[subprocess.call<br/>python cli/benchmark.py + args]
    FindBenchmark -->|No| ErrorBench[Error: benchmark.py not found]

    ShowHelp --> Exit1([exit 0])
    StartWeb --> Exit2([exit with app.py status])
    StartBenchmark --> Exit3([exit with benchmark.py status])
    ErrorWeb --> Exit4([exit 1])
    ErrorBench --> Exit5([exit 1])

    style Start fill:#e1f5ff
    style StartWeb fill:#ffe1ff
    style StartBenchmark fill:#ffe1e1

Decision Logic (run.py):

Help Mode (--help/-h): Displays extended help combining run.py explanation + benchmark.py CLI options
Web Mode (--webapp/-w): Launches tray + FastAPI dashboard on a free local port
Benchmark Mode (default): Launches tray + benchmark.py with all CLI arguments

AppImage vs. run.py — default behaviour difference:

Invocation	No-argument default
`./LM-Studio-Bench.AppImage`	Tray-only (stays in panel, no benchmark)
`./run.py`	Tray + benchmark.py (runs full benchmark)

Setup Flow (Installation & Configuration)

flowchart TD
    Start([./setup.sh args]) --> ParseArgs{Parse Arguments}
    
    ParseArgs -->|--help| ShowHelp["Show Usage Info<br/>+ Exit 0"]
    ParseArgs -->|--dry-run| DryMode["Set DRY_RUN=1<br/>Set INTERACTIVE=0"]
    ParseArgs -->|--yes| AutoMode["Set INTERACTIVE=0<br/>Auto-answer 'no'"]
    ParseArgs -->|--interactive| InterMode["Set INTERACTIVE=1<br/>Force Interactive"]
    
    DryMode --> LogSetup["Setup Logging<br/>logs/setup_YYYYMMDD_HHMMSS.log"]
    AutoMode --> LogSetup
    InterMode --> LogSetup
    
    LogSetup --> CheckLinux{OS = Linux?}
    CheckLinux -->|No| ErrorOS["❌ Error:<br/>Not Linux"]
    CheckLinux -->|Yes| DetectPKG["✅ Detect Package Manager<br/>apt/dnf/pacman/zypper/apk"]
    
    ErrorOS --> Exit1([Exit 1])
    
    DetectPKG --> CoreDeps["🔧 Check Core Dependencies<br/>Python3, Git, curl, pkg-config"]
    CoreDeps --> SysLibs["📦 Check System Libraries<br/>gobject-introspection, cairo, PyGObject"]
    
    SysLibs --> CheckLMS["🔍 Check LM Studio Stack<br/>lms CLI / llmster-headless"]
    CheckLMS -->|Found| LMSFound["✅ LM Studio/llmster<br/>detected"]
    CheckLMS -->|Not Found| LMSMissing["⚠️ LM Studio missing<br/>Offer download link"]
    
    LMSFound --> GPUDetect["🎮 Detect GPU<br/>lspci → NVIDIA/AMD/Intel"]
    LMSMissing --> GPUDetect
    
    GPUDetect --> GPUTools{GPU Found?}
    GPUTools -->|NVIDIA| NVIDIACheck["Check nvidia-smi<br/>+ Install if needed"]
    GPUTools -->|AMD| AMDCheck["Check rocm-smi<br/>+ AMD Driver Check"]
    GPUTools -->|Intel| IntelCheck["Check intel_gpu_top<br/>+ Install if needed"]
    GPUTools -->|None| NoGPU["⚠️ No GPU detected"]
    
    NVIDIACheck --> CreateVenv["🐍 Create Python venv<br/>python3 -m venv .venv"]
    AMDCheck --> AMDDriver["🔍 Check AMD Drivers<br/>amdgpu, libdrm, ROCm"]
    IntelCheck --> CreateVenv
    NoGPU --> CreateVenv
    AMDDriver --> CreateVenv
    
    CreateVenv -->|venv already exists| RecreatChoice{"Recreate .venv?"}
    CreateVenv -->|New venv| VenvOK["✅ venv created<br/>.venv/"]
    
    RecreatChoice -->|Yes| VenvOK
    RecreatChoice -->|No| UseExisting["Use existing .venv"]
    
    VenvOK --> InstallReqs["📥 Install Requirements<br/>pip install -r requirements.txt"]
    UseExisting --> InstallReqs
    
    InstallReqs --> CheckConflict["Check pip conflicts<br/>pip check"]
    CheckConflict --> Summary["📋 Print Summary<br/>Next steps (activation, run, etc)"]
    
    Summary --> LogExit["📄 Save log file<br/>logs/setup_latest.log → symlink"]
    LogExit --> Exit0([Exit 0])
    
    ShowHelp --> Exit0
    
    style Start fill:#e1f5ff
    style LogSetup fill:#fff4e1
    style DetectPKG fill:#e1ffe1
    style CoreDeps fill:#e1ffe1
    style CreateVenv fill:#ffe1e1
    style InstallReqs fill:#ffe1e1
    style Summary fill:#f0e1ff
    style ErrorOS fill:#ffcccc
    style LMSMissing fill:#fff9e1

Setup Flow Summary:

Parse Arguments: Handle --help, --dry-run, --yes, --interactive flags
Logging Setup: Create timestamped log file in logs/setup_YYYYMMDD_HHMMSS.log
Environment Checks:
- Verify Linux OS
- Detect package manager (apt/dnf/pacman/zypper/apk)
- Check core dependencies (Python 3, Git, curl, pkg-config)
- Verify system libraries (gobject-introspection, cairo, PyGObject for tray support)
LM Studio Stack:
- Check for lms CLI or llmster headless binary
- Offer download link if missing
GPU & Monitoring Tools:
- Detect GPU type via lspci (NVIDIA, AMD, Intel)
- Install/check GPU-specific tools (nvidia-smi, rocm-smi, intel_gpu_top)
- For AMD: Check drivers, ROCm, libdrm, X.Org AMDGPU driver
Python Environment:
- Create virtual environment (.venv/)
- Install Python dependencies from requirements.txt
- Check for pip conflicts
Summary:
- Print next steps for user:
  - Activate venv: source .venv/bin/activate
  - Run webapp: python run.py --webapp
  - Run CLI: python run.py
- Log file symlink: logs/setup_latest.log

Modes:

Mode	Behavior
`--help`	Show usage and exit
`--dry-run`	Preview all actions (no changes)
`--yes`	Non-interactive (auto-answer 'no' to optional prompts)
`--interactive`	Force interactive mode (default if TTY detected)

Tray Control Flow (Linux)

flowchart TD
    TrayStart([tray.py start]) --> Poll[Poll /api/status<br/>every 3 seconds]
    Poll --> Reachable{API reachable?}

    Reachable -->|No| IconRed[Set icon: red<br/>error/unreachable]
    Reachable -->|Yes| ReadStatus[Read status field]

    ReadStatus -->|idle| IconGray[Set icon: gray]
    ReadStatus -->|running| IconGreen[Set icon: green]
    ReadStatus -->|paused| IconYellow[Set icon: yellow]

    ReadStatus --> BtnLogic[Update Start/Pause/Stop states]
    BtnLogic --> UserAction{User action}

    UserAction -->|Start| StartCall[POST /api/benchmark/start]
    UserAction -->|Pause/Resume| PauseCall[POST /api/benchmark/pause or resume]
    UserAction -->|Stop| StopCall[POST /api/benchmark/stop]
    UserAction -->|Quit| QuitCall[POST /api/system/shutdown]

    QuitCall --> ExitTray[GTK main loop exit]

Tray behavior summary:

Dynamic status icons: gray (idle), green (running), yellow (paused), red (API error/unreachable)
Smart controls: Start enabled in idle/error, Pause and Stop enabled only in running or paused state
Quit path: Tray triggers graceful shutdown endpoint, then exits

Tray Quit Sequence (Linux)

sequenceDiagram
    participant U as User
    participant T as Tray (GTK/AppIndicator)
    participant A as web/app.py (FastAPI)
    participant B as Benchmark Manager
    participant P as Process Signal Handler

    U->>T: Click Quit
    T->>A: POST /api/system/shutdown
    A->>B: stop_benchmark()
    B-->>A: benchmark stopped or no-op
    A-->>T: 200 OK (shutdown accepted)
    A->>P: Start delayed SIGTERM thread
    T->>T: Stop polling + GTK main_quit()
    P->>A: Send SIGTERM to process
    A-->>A: Uvicorn graceful shutdown

Configuration Loading

flowchart TD
    Start([config_loader.py<br/>import]) --> BaseConfig[BASE_DEFAULT_CONFIG<br/>Hard-coded Defaults]

    BaseConfig --> LoadFunc[load_default_config]
    LoadFunc --> ReadProject[Read config/defaults.json<br/>Project Defaults]
    
    ReadProject --> CheckUser{~/.config/lm-studio-bench/<br/>defaults.json exists?}
    
    CheckUser -->|Yes| ReadUser[Read User Config<br/>Deep Merge]
    CheckUser -->|No| UseProject[Use Project Only]
    CheckFile -->|No| UseBase[Use BASE_DEFAULT_CONFIG]
    
    ReadJSON --> ParseJSON[Parse JSON]
    ParseJSON --> DeepMerge[_deep_merge<br/>Base + User Config]
    
    DeepMerge --> NormalizePorts[_normalize_ports<br/>Ensure valid LM Studio ports]
    UseBase --> NormalizePorts
    
    NormalizePorts --> FinalConfig[(DEFAULT_CONFIG<br/>Global Singleton)]
    
    FinalConfig --> BenchmarkImport[benchmark.py imports<br/>DEFAULT_CONFIG]
    FinalConfig --> WebAppImport[web/app.py imports<br/>DEFAULT_CONFIG]
    
    style BaseConfig fill:#f0f0f0
    style FinalConfig fill:#e1ffe1
    style DeepMerge fill:#fff4e1

Configuration Layers:

Layer	Source	Priority
1. Hard-coded	`BASE_DEFAULT_CONFIG` in config_loader.py	Lowest
2. User Config	`~/.config/lm-studio-bench/defaults.json`	Medium
3. Project Config	`config/defaults.json`	Low
3. CLI Arguments	argparse in benchmark.py	Highest

Merge Strategy:

_deep_merge() recursively merges nested dictionaries
User config values override base config
None values in user config are skipped (base value retained)

Configuration Priority

flowchart LR
    CLI[CLI Arguments<br/>--runs 5<br/>--context 4096] -->|Highest Priority| Merge[Configuration<br/>Merge]

    UserCfg[~/.config/.../defaults.json<br/>context_length: 4096] -->|High Priority| Merge
    ProjCfg[config/defaults.json<br/>num_runs: 3<br/>context_length: 2048] -->|Medium Priority| Merge
    
    Base[BASE_DEFAULT_CONFIG<br/>prompt: default<br/>temperature: 0.1] -->|Lowest Priority| Merge
    
    Merge --> Final[Final Configuration<br/>runs=5<br/>context=4096<br/>temperature=0.1]
    
    style CLI fill:#ffe1e1
    style JSON fill:#fff4e1
    style Base fill:#f0f0f0
    style Final fill:#e1ffe1

Example Priority Resolution:

# BASE_DEFAULT_CONFIG
{
  "num_runs": 3,
  "context_length": 2048,
  "prompt": "Is the sky blue?"
}

# config/defaults.json
{
  "num_runs": 5,
  "prompt": "Explain machine learning"
}

# CLI: ./run.py --runs 1 --context 4096

# FINAL RESULT:
{
  "num_runs": 1,           # ← CLI override
  "context_length": 4096,  # ← CLI override
  "prompt": "Explain..."   # ← JSON override (no CLI arg)
}

Benchmark Execution Flow

flowchart TD
    Start([benchmark.py main]) --> ParseArgs[Parse CLI Arguments<br/>argparse.ArgumentParser]

    ParseArgs --> LoadConfig[Load DEFAULT_CONFIG<br/>from config_loader]
    
    LoadConfig --> CheckFlags{Special Flags?}
    
    CheckFlags -->|--list-cache| ListCache[Display Cache Entries<br/>exit]
    CheckFlags -->|--export-cache| ExportCache[Export Cache to JSON<br/>exit]
    CheckFlags -->|--export-only| ExportOnly[Generate Reports Only<br/>skip benchmark]
    CheckFlags -->|Normal Mode| CreateBenchmark[Create LMStudioBenchmark<br/>instance]
    
    CreateBenchmark --> MergeConfig[Merge Config Layers:<br/>CLI > JSON > Base]
    
    MergeConfig --> InitComponents[Initialize Components:<br/>• GPUMonitor<br/>• BenchmarkCache<br/>• HardwareMonitor<br/>• REST Client optional]
    
    InitComponents --> CheckServer{LM Studio<br/>Server Running?}
    
    CheckServer -->|No| StartServer[Auto-start Server<br/>lms server start]
    CheckServer -->|Yes| DiscoverModels[Discover Models<br/>lms ls --json]
    StartServer --> DiscoverModels
    
    DiscoverModels --> FilterModels[Apply Filters:<br/>--quants, --arch<br/>--only-vision, etc.]
    
    FilterModels --> CheckCache{use_cache<br/>enabled?}
    
    CheckCache -->|Yes| LoadCache[Load Cached Results<br/>SQLite lookup]
    CheckCache -->|No| SkipCache[Skip Cache]
    
    LoadCache --> RunBenchmarks[Run Benchmarks<br/>for Each Model]
    SkipCache --> RunBenchmarks
    
    RunBenchmarks --> TestModel[Test Model:<br/>1. Load Model<br/>2. Warmup Run<br/>3. N Measurement Runs<br/>4. Collect Stats]
    
    TestModel --> Profiling{Profiling<br/>enabled?}
    
    Profiling -->|Yes| MonitorHW[Monitor GPU/CPU/RAM<br/>Background Thread]
    Profiling -->|No| SkipMonitor[Skip Monitoring]
    
    MonitorHW --> SaveCache[Save Results to Cache<br/>SQLite INSERT]
    SkipMonitor --> SaveCache
    
    SaveCache --> NextModel{More Models?}
    
    NextModel -->|Yes| RunBenchmarks
    NextModel -->|No| Export[Export Reports:<br/>JSON, CSV, PDF, HTML]
    
    Export --> End([Done])
    
    ListCache --> End
    ExportCache --> End
    ExportOnly --> Export
    
    style Start fill:#e1f5ff
    style CreateBenchmark fill:#ffe1e1
    style RunBenchmarks fill:#ffe1ff
    style Export fill:#e1ffe1

Key Execution Steps:

Argument Parsing: 49 CLI arguments processed by argparse
Configuration Merge: CLI args override JSON file, JSON overrides base
Component Initialization: GPU monitor, cache, profiler, REST client
Model Discovery: lms ls --json fetches all installed models
Filtering: Regex, quantization, architecture, capabilities filters
Cache Lookup: Skip already-tested models (unless --retest)
Benchmark Loop: For each model: load → warmup → measure (N runs) → unload
Hardware Monitoring: Optional background thread for GPU/CPU/RAM stats
Cache Storage: Save results to SQLite for future runs
Report Generation: Export to JSON/CSV/PDF/HTML

REST API vs SDK Mode

flowchart TD
    Start([Benchmark Init]) --> CheckMode{use_rest_api?<br/>CLI or config}

    CheckMode -->|True| InitREST[Initialize REST Client<br/>LMStudioRESTClient]
    CheckMode -->|False| InitSDK[Use Python SDK<br/>lmstudio package]
    
    InitREST --> RESTURL[base_url from config:<br/>http://localhost:1234]
    RESTURL --> RESTToken{api_token<br/>set?}
    
    RESTToken -->|Yes| RESTAuth[Add Bearer Token<br/>to headers]
    RESTToken -->|No| RESTNoAuth[No Authentication]
    
    RESTAuth --> RESTReady[REST Client Ready]
    RESTNoAuth --> RESTReady
    
    RESTReady --> RESTFeatures[REST API Features:<br/>• Download Progress<br/>• MCP Integration<br/>• Stateful Chat<br/>• Response Caching<br/>• Parallel Inference<br/>• Unified KV Cache]
    
    InitSDK --> SDKReady[SDK Ready]
    SDKReady --> SDKFeatures[SDK Features:<br/>• Simple Python API<br/>• Model Loading<br/>• Inference<br/>• Basic Stats]
    
    RESTFeatures --> Benchmark[Run Benchmarks]
    SDKFeatures --> Benchmark
    
    Benchmark --> RESTCall{Mode?}
    
    RESTCall -->|REST| CallREST[HTTP POST /v1/chat/completions<br/>+ parse response stats]
    RESTCall -->|SDK| CallSDK[client.llm.predict<br/>+ parse Model response]
    
    CallREST --> Results[Collect Results:<br/>TTFT, tokens/s, VRAM]
    CallSDK --> Results
    
    style InitREST fill:#e1f5ff
    style InitSDK fill:#ffe1e1
    style RESTFeatures fill:#e1ffe1
    style SDKFeatures fill:#fff4e1

Mode Comparison:

Feature	REST API Mode	SDK/CLI Mode
Configuration	`use_rest_api: true` in config or `--use-rest-api`	Default mode
Endpoint	HTTP `/v1/chat/completions`	Python SDK `client.llm.predict()`
Stats	Detailed (TTFT, prompt/completion tokens, tok/s)	Basic (tokens/s only)
Authentication	Optional Bearer token	Not needed
Parallel Inference	✅ `--n-parallel` (continuous batching)	❌ Sequential only
Stateful Chats	✅ response_id tracking	❌ Stateless
MCP Integration	✅ `mcp_integrations` parameter	❌ Not available
Response Caching	✅ MD5 hash caching (10,000x speedup)	❌ No caching
Download Progress	✅ Real-time model loading status	❌ No progress

Configuration Example:

{
  "lmstudio": {
    "host": "localhost",
    "ports": [1234, 1235],
    "use_rest_api": true,
    "api_token": "lms_your_token_here"
  }
}

Component Details

1. run.py (Entry Point)

Responsibilities:

Parse --webapp/-w flag
Route to web dashboard or benchmark
Show extended help (--help)

Key Functions:

Flag detection: "--webapp" in sys.argv or "-w" in sys.argv
Subprocess launching: subprocess.call([sys.executable, script] + args)

2. config_loader.py (Configuration Manager)

Responsibilities:

Load config/defaults.json (project) + ~/.config/lm-studio-bench/defaults.json (user overrides)
Merge with BASE_DEFAULT_CONFIG
Provide DEFAULT_CONFIG singleton

Key Functions:

load_default_config(): Loads and merges config
_deep_merge(): Recursive dict merge
_normalize_ports(): Validates LM Studio ports

Configuration Fields:

Section	Fields
Root	`prompt`, `context_length`, `num_runs`
lmstudio	`host`, `ports`, `api_token`, `use_rest_api`
inference	`temperature`, `top_k_sampling`, `top_p_sampling`, `min_p_sampling`, `repeat_penalty`, `max_tokens`
load	`n_gpu_layers`, `n_batch`, `n_threads`, `flash_attention`, `rope_freq_base`, `rope_freq_scale`, `use_mmap`, `use_mlock`, `kv_cache_quant`

3. benchmark.py (Main Engine)

Responsibilities:

Parse 49 CLI arguments
Manage benchmark lifecycle
Model discovery and filtering
Cache management (SQLite)
Runtime-safe cache schema migration for optional columns
Hardware monitoring
Report generation

Key Classes:

LMStudioBenchmark: Main orchestrator
BenchmarkCache: SQLite caching
tools/hardware_monitor.py: Shared GPU detection and live profiling (GPUMonitor, HardwareMonitor)
ModelDiscovery: Model listing and metadata

Reliability Behaviors (2026-03):

Runtime cache migration: Missing optional SQLite columns are added automatically at startup and, if needed, once again during insert error recovery.
Inference retry guard: If LM Studio returns a server error containing Model unloaded, the benchmark reloads the model and retries inference once.

CLI Arguments (49 total):

Category	Arguments
Basic	`--runs`, `--context`, `--prompt`, `--limit`, `--dev-mode`
Presets	`--list-presets`, `--preset`
Filter	`--only-vision`, `--only-tools`, `--quants`, `--arch`, `--params`, `--min-context`, `--max-size`, `--include-models`, `--exclude-models`
Cache	`--retest`, `--list-cache`, `--export-cache`, `--export-only`
Profiling	`--enable-profiling`, `--max-temp`, `--max-power`, `--disable-gtt`
Inference	`--temperature`, `--top-k`, `--top-p`, `--min-p`, `--repeat-penalty`, `--max-tokens`
Load Config	`--n-gpu-layers`, `--n-batch`, `--n-threads`, `--flash-attention`, `--rope-freq-base`, `--rope-freq-scale`, `--use-mmap`, `--use-mlock`, `--kv-cache-quant`
REST API	`--use-rest-api`, `--api-token`, `--n-parallel`, `--unified-kv-cache`
Comparison	`--compare-with`, `--rank-by`

4. rest_client.py (REST API Client)

Responsibilities:

HTTP communication with LM Studio v1 API
Model loading and unloading
Chat completions with stats
Download progress tracking
MCP integration
Stateful chat history
Response caching

Key Classes:

LMStudioRESTClient: Main REST client
ModelInfo: Model metadata
ChatStats: Response statistics (TTFT, tokens/s, etc.)
ModelCapabilities: Vision, tools detection

New Features (✨ 2026-02-23):

Download Progress Tracking
- wait_for_completion() with progress callbacks
- Real-time model loading status
MCP Integration
- mcp_integrations parameter in chat requests
- Model Context Protocol support
Stateful Chat History
- use_stateful=True for conversation continuity
- last_response_id tracking
Response Caching
- MD5 hash-based caching
- 10,000x+ speedup for repeated prompts
- enable_cache parameter

Example Usage:

client = LMStudioRESTClient(
    base_url="http://localhost:1234",
    api_token="lms_token"
)

# Load model with progress tracking
def on_progress(percent, status):
    print(f"Loading: {percent:.1f}% - {status}")

client.load_model("model@q4", wait_for_completion=True, progress_callback=on_progress)

# Chat with caching
response = client.chat(
    model="model@q4",
    messages=[{"role": "user", "content": "Hello"}],
    enable_cache=True,  # 10,000x speedup for repeated prompts
    use_stateful=True   # Conversation continuity
)

5. tray.py (Linux Tray Controller)

Responsibilities:

Provide Linux AppIndicator tray UI with benchmark controls
Poll benchmark status and update icon/button state
Trigger benchmark actions via web API
Trigger graceful full shutdown via /api/system/shutdown

Key Behaviors:

3-second polling loop via GLib timeout
Icon states: gray (idle), green (running), yellow (paused), red (error)
Control state logic:
- Start enabled in idle and recovery/error state
- Pause/Stop enabled only while benchmark is active

6. web/app.py + dashboard.html.jinja (Dashboard Analytics)

Responsibilities:

Aggregate benchmark history for fast visual summaries
Serve chart-ready payloads via /api/dashboard/stats
Render Home/Results overview charts in the browser with Plotly
Support quick navigation from ranking tables to model comparison

Home View (Executive Summary):

KPI cards: cached models, avg speed, median (P50), P95, architectures, quantizations
Top 10 bar chart (speed ranking)
Quantization donut chart (distribution)

Results View (Exploration):

Scatter: Speed vs VRAM
Heatmap: Model x Quantization -> avg tokens/s
Shared data source with table (/api/results), so table and charts stay consistent

Quick Compare Flow:

Compare actions in Home and Results tables call openComparisonForModel(modelName)
Function opens Comparison view, selects the model, then loads full historical trends via /api/comparison/{model_name}

Dashboard Summary Fields (/api/dashboard/stats):

speed_summary (min, p50, avg, p95, max)
top_models_extended (Top 10 models)
quantization_distribution
architecture_distribution
efficiency_top

Data Flow Summary

graph LR
    User([User]) -->|./run.py --runs 5| CLI[CLI Arguments]

    ProjJSON[config/defaults.json] --> Config[Configuration<br/>Merge]
    UserJSON[~/.config/.../defaults.json] --> Config
    CLI --> Config
    Base[BASE_DEFAULT_CONFIG] --> Config
    
    Config --> Benchmark[Benchmark<br/>Execution]
    
    Benchmark -->|lms ls| Models[Model<br/>Discovery]
    Models --> Filter[Model<br/>Filtering]
    
    Filter --> Cache{Cache<br/>Hit?}
    Cache -->|Yes| Skip[Skip Test]
    Cache -->|No| Test[Run Test]
    
    Test --> LMStudio[LM Studio<br/>Server]
    LMStudio --> Results[Collect<br/>Results]
    
    Results --> DB[(SQLite<br/>Cache)]
    Results --> Reports[JSON/CSV<br/>PDF/HTML]
    
    Skip --> Reports
    
    style CLI fill:#ffe1e1
    style Config fill:#e1ffe1
    style Cache fill:#fff4e1
    style Reports fill:#e1f5ff

Testing Architecture

LM-Studio-Bench includes a comprehensive test suite with 900+ tests and strong coverage to ensure reliability and maintainability.

Test Organization

graph TB
    Tests[tests/] --> Fixtures[conftest.py<br/>Test Fixtures & Utilities]

    Tests --> BenchmarkTests[test_benchmark.py<br/>55+ tests]
    Tests --> HardwareTests[test_hardware_monitor.py<br/>57+ tests]
    Tests --> AppTests[test_app.py<br/>23+ tests]
    Tests --> APITests[test_api_endpoints.py<br/>32+ tests]
    Tests --> RestTests[test_rest_client.py<br/>22+ tests]
    Tests --> TrayTests[test_tray.py<br/>26+ tests]
    Tests --> PresetTests[test_preset_manager.py<br/>19+ tests]
    Tests --> ConfigTests[test_config_loader.py<br/>9+ tests]
    Tests --> PathTests[test_user_paths.py<br/>4+ tests]
    Tests --> VersionTests[test_version_checker.py<br/>7+ tests]
    Tests --> MetadataTests[test_scrape_metadata.py<br/>24+ tests]
    Tests --> RunTests[test_run.py<br/>10+ tests]

    BenchmarkTests --> Benchmark[cli/benchmark.py]
    HardwareTests --> HardwareMon[tools/hardware_monitor.py]
    AppTests --> WebApp[web/app.py]
    APITests --> WebApp
    RestTests --> RestClient[core/client.py]
    TrayTests --> Tray[core/tray.py]
    PresetTests --> PresetMgr[core/presets.py]
    ConfigTests --> ConfigLoader[core/config.py]
    PathTests --> UserPaths[core/paths.py]
    VersionTests --> VersionChecker[core/version.py]
    MetadataTests --> Metadata[tools/scrape_metadata.py]
    RunTests --> RunPy[run.py]

    style Tests fill:#e1f5ff
    style Fixtures fill:#fff4e1
    style BenchmarkTests fill:#ffe1e1
    style AppTests fill:#e1ffe1

Test Coverage by Component

Component	Test Module	Test Count	Coverage
Benchmark Engine	`test_benchmark.py`	55+	High
Web Dashboard	`test_app.py`	23+	Medium
API Endpoints	`test_api_endpoints.py`	32+	High
REST Client	`test_rest_client.py`	22+	High
Linux Tray	`test_tray.py`	26+	Medium
Preset Manager	`test_preset_manager.py`	19+	High
Config Loader	`test_config_loader.py`	9+	High
User Paths	`test_user_paths.py`	4+	High
Version Checker	`test_version_checker.py`	7+	High
Metadata Scraping	`test_scrape_metadata.py`	24+	Medium
Entry Point	`test_run.py`	10+	Medium

Testing Approach

Unit Testing:

Mock external dependencies (LM Studio API, system commands, file I/O)
Isolated test cases that can run in any order
Fast execution (no real API calls or file system operations)
Use pytest fixtures for common setup and teardown

Test Fixtures (conftest.py):

Mock LM Studio client and server responses
Temporary directories for file operations
Mock system commands (nvidia-smi, rocm-smi, etc.)
Sample configuration and model data

Continuous Integration:

GitHub Actions runs full test suite on every PR
Code quality checks (flake8, pylint)
Security scans (Bandit, CodeQL, Snyk)
Test results reported in PR status checks

Running Tests:

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific module
pytest tests/test_benchmark.py

# Run with coverage report
pytest --cov=core --cov=cli --cov=agents --cov=web --cov=tools --cov=run --cov-report=html

# Run tests matching a pattern
pytest -k "test_gpu_detection"

LM Studio Benchmark Docs