LM Studio Benchmark Documentation

Welcome to the LM Studio Benchmark documentation! This tool helps you measure and compare token/s performance across all your locally installed LLM models and their quantizations.

What is this?

A Python benchmark tool for LM Studio with a modern web dashboard that:

  • Automatically tests all local LLM models and quantizations
  • Measures token/s speeds with warmup and multiple runs
  • Exports results in JSON, CSV, PDF, and interactive HTML formats
  • Detects GPU capabilities (NVIDIA, AMD, Intel) and monitors VRAM usage
  • Provides a web dashboard with live charts and filtering options
  • Includes Linux tray controls with live status icons and quick actions

Features at a Glance

✅ Multi-model benchmarking with intelligent GPU offload ✅ Vision & tool-calling model detection ✅ Progressive VRAM management (automatic fallback) ✅ Caching system (skip already-tested models) ✅ Filter by quantization, architecture, params, context length ✅ Live web dashboard with 27 themes ✅ Linux tray controller with dynamic benchmark status icons ✅ REST API mode with parallel inference support ✅ Download progress tracking, MCP integration, stateful chats ✅ Response caching with 10,000x+ speedup for repeated prompts

Getting Started

Check out the Quickstart Guide to begin benchmarking your models!

🚀 Quick Start Guide - LM Studio Benchmark Tool

Installation

cd ~/LM-Studio-Bench

# 1) Preview setup (no changes)
./setup.sh --dry-run

# 2) Prepare system + Python environment (recommended)
./setup.sh

# 3) Activate virtual environment
source .venv/bin/activate

If you skip setup.sh, use this manual fallback:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Start Web UI

./run.py --webapp

✅ Opens browser automatically at http://localhost:8080 ✅ Live streaming of benchmark output via WebSocket ✅ Browse all cached results with interactive tables ✅ System info (GPU model detection, LM Studio health, hardware details) ✅ Dark mode by default with 27 theme options ✅ All CLI parameters available as web form with tooltips ✅ Advanced filtering (quantization, architecture, size, context-length) ✅ Separate logs: ~/.local/share/lm-studio-bench/logs/webapp_*.log and ~/.local/share/lm-studio-bench/logs/benchmark_*.log ✅ Linux tray control with dynamic status icon and quick actions

Dashboard Features:

  • Start Benchmark: Configure and run benchmarks from web interface
    • Filter by quantization, architecture, parameter size
    • Rank results by speed, efficiency, TTFT, or VRAM
    • Set hardware limits (max GPU temp, max power draw)
    • Tooltip help for all options
  • System Info: OS, Kernel, CPU, GPU (with detailed model names)
  • LM Studio Health: Live healthcheck status (HTTP API + CLI fallback)
  • Live Output: Real-time streaming with colored logs and progress
  • Results Browser: Filter and sort all cached benchmark results
  • Export: Download JSON/CSV/PDF/HTML reports
  • Network Access: Access from other devices on same network

Linux Tray Control

When GTK/AppIndicator dependencies are installed, a tray controller starts with the web app.

  • Dynamic status icon:
    • Gray: idle
    • Green: running
    • Yellow: paused
    • Red: API unreachable/error
  • Smart controls:
    • Start enabled in idle/error states
    • Pause/Stop enabled only in running/paused states
  • Auto refresh: status and controls refresh every 3 seconds
  • Quit behavior: tray Quit triggers graceful full shutdown

Network Access

# Access dashboard from other devices
http://your-ip:8080

# Example:
http://192.168.1.100:8080

💻 Command Line (CLI)

Simple Benchmark (All Models)

./run.py

✅ Tests all installed models with 3 runs each (~1-2 hours) ✅ Automatically saves results to ~/.local/share/lm-studio-bench/results/ ✅ Clean output with emoji icons and formatted model lists ✅ Detailed logs saved to ~/.local/share/lm-studio-bench/logs/benchmark_YYYYMMDD_HHMMSS.log

Monitor Logs in Real-Time

# Watch benchmark execution
tail -f ~/.local/share/lm-studio-bench/logs/benchmark_*.log

# Watch web dashboard
tail -f ~/.local/share/lm-studio-bench/logs/webapp_*.log

# Search for errors
grep ERROR ~/.local/share/lm-studio-bench/logs/benchmark_*.log

Quick Test (3 NEW Models)

./run.py --limit 3 --runs 1

✅ Fast test with 3 NEW untested models (~5-10 minutes) ✅ Already tested models automatically loaded from cache ✅ Limit applies ONLY to new models, all cached models included

Development Mode (Fastest)

./run.py --dev-mode

✅ Automatically selects smallest model ✅ Single run for quick validation (~30 seconds) ✅ Perfect for testing changes

Test Single Model

./run.py --limit 1 --runs 1

✅ Single model benchmark (~1-2 minutes)

Advanced Features

1️⃣ Hardware Profiling (6 Live Charts)

Enable Complete Hardware Monitoring:

./run.py --enable-profiling --runs 1 --limit 3

Monitored Metrics:

  • 🌡️ GPU Temperature (°C)
  • ⚡ GPU Power (W)
  • 💾 GPU VRAM (GB)
  • 🧠 GPU GTT (GB) - AMD only
  • 🖥️ System CPU usage (%)
  • 💾 System RAM usage (GB)

✅ All metrics are displayed live in the WebApp ✅ 6 interactive Plotly.js charts with Min/Max/Avg stats ✅ Moving average for stable RAM curves ✅ Each metric is measured every second

With Safety Limits:

./run.py --enable-profiling --max-temp 85 --max-power 350

✅ Interrupts benchmark when limits are exceeded

2️⃣ AMD GTT Support (Shared System RAM)

Enable GTT (Default):

./run.py --limit 3

✅ Automatically uses VRAM + GTT (e.g. 2GB VRAM + 46GB GTT = 48GB) ✅ Enables larger models on AMD APUs/iGPUs ✅ Shown in logs: "💾 Memory: 0.4GB VRAM + 44.7GB GTT = 45.1GB total"

Disable GTT (VRAM-only):

./run.py --disable-gtt --limit 3

✅ Only uses dedicated VRAM ✅ More conservative offload levels ✅ Useful for benchmarking VRAM-only performance

3️⃣ Filtering Models

By Quantization:

./run.py --quants q4,q5 --limit 5

By Architecture:

./run.py --arch llama,mistral --limit 5

By Parameter Size:

./run.py --params 7B,8B --limit 5

By Context Length:

./run.py --min-context 32000 --limit 3

By Model Size:

./run.py --max-size 10 --limit 5

Vision Models Only:

./run.py --only-vision --runs 1

Regex-based Filtering (Include):

# Only Qwen or Phi models
./run.py --include-models "qwen|phi" --runs 1

# Only Llama 7B models
./run.py --include-models "llama.*7b" --runs 1

# Only Q4 quantizations
./run.py --include-models ".*q4.*" --runs 1

Regex-based Filtering (Exclude):

# Exclude uncensored models
./run.py --exclude-models "uncensored" --runs 1

# Exclude Q2 and Q3 quantizations
./run.py --exclude-models "q2|q3" --runs 1

# Exclude all vision models
./run.py --exclude-models ".*vision.*" --runs 1

Combined Filters (AND logic):

# Include llama, exclude q2, only tools
./run.py --include-models "llama" --exclude-models "q2" --only-tools --runs 1

# Vision models, 7B params, max 12GB
./run.py --only-vision --params 7B --max-size 12 --runs 1

3️⃣ Ranking & Sorting

Sort by Efficiency (Default: Speed):

./run.py --limit 5 --rank-by efficiency

Sort by TTFT (Lower = Better):

./run.py --limit 5 --rank-by ttft

Sort by VRAM Usage (Lower = Better):

./run.py --limit 5 --rank-by vram

4️⃣ Cache Management

View Cached Results:

./run.py --list-cache

✅ Shows all cached models with performance metrics

Force Retest (Ignore Cache):

./run.py --retest --limit 3

✅ Re-runs benchmarks even if cached

Regenerate Reports from Database:

./run.py --export-only

✅ Generates JSON/CSV/PDF/HTML from cached results in <1s ✅ No benchmarking - instant report generation ✅ Supports all filters (--params, --quants, --arch, etc.)

Examples:

# All cached models
./run.py --export-only

# Only 7B models from cache
./run.py --export-only --params 7B

# Q4 quantizations with historical comparison
./run.py --export-only --quants q4 --compare-with latest

✅ Retests models even if cached

Export Cache as JSON:

./run.py --export-cache my_backup.json

✅ Exports entire cache database

Cache Behavior:

  • First run: Tests all models (~2 hours for 20 models)
  • Second run: Loads from cache (~1 second!)
  • Automatic invalidation on parameter changes (prompt, context, temperature)
  • Shows "X of Y models cached" before starting

Compare with Latest Benchmark:

./run.py --limit 3 --runs 1 --compare-with latest

📊 Shows performance delta (%) vs previous run

Compare with Specific Benchmark:

./run.py --limit 3 --runs 1 --compare-with benchmark_results_20260104_170000.json

6️⃣ Custom Configuration

Adjust Number of Runs:

./run.py --runs 5 --limit 2

Custom Context Length:

./run.py --context 4096 --limit 2 --runs 1

Custom Prompt:

./run.py -P "Your custom prompt here" --limit 2 --runs 1

7️⃣ Presets (Fast Setup)

Show available presets:

./run.py --list-presets

Load a built-in preset:

# Default presets (readonly)
./run.py --preset default_classic              # Classic benchmark (default)
./run.py --preset default_compatibility_test   # Capability-driven test

# Other presets
./run.py --preset quick_test
./run.py --preset high_quality
./run.py --preset resource_limited

Load preset and override values:

./run.py --preset quick_test --runs 2 --context 2048
./run.py --preset default_classic --runs 5 --context 4096

Backwards Compatibility:

./run.py --preset default      # Automatically loads default_classic

Notes:

  • Default presets include explicit values for all benchmark form fields, so preset comparisons do not show null values for missing keys.
  • default_classic is optimized for full model benchmarking (3 runs)
  • default_compatibility_test (alias: default_compatability_test) is optimized for focused capability testing (1 run)
  • Capability-driven runs over many installed models continue when a single model fails to load; the failed model is logged and skipped.
  • Embedding models are retried automatically without KV-cache offload if LM Studio rejects that load option.
  • Legacy keys in imported/user presets are normalized automatically (context_length/top_k/top_p/min_p -> current key names).

📊 Output Formats

Each benchmark generates 4 files:

JSON Format

{
  "model_name": "qwen/qwen3-8b",
  "quantization": "q4_k_m",
  "avg_tokens_per_sec": 8.15,
  "tokens_per_sec_per_gb": 1.74,
  "speed_delta_pct": -0.2,
  ...
}

✅ Structured data for analysis

CSV Format

model_name,quantization,avg_tokens_per_sec,tokens_per_sec_per_gb,speed_delta_pct
qwen/qwen3-8b,q4_k_m,8.15,1.74,-0.2

✅ Excel/Sheets compatible

PDF Report

  • Model rankings (sortable)
  • Best-of-Quantization analysis
  • Quantization comparison tables (Q4 vs Q5 vs Q6)
  • Performance statistics & percentiles
  • Delta display (Δ% column)

HTML Report (Interactive Plotly)

  • Bar chart: Top 10 models
  • Scatter plot: Size vs Performance
  • Scatter plot: Efficiency analysis
  • NEW: Trend chart showing performance over time
  • Summary statistics with gradient backgrounds

📈 Feature Showcase

Example: Complete Analysis

./run.py \
  --quants q4,q5,q6 \
  --limit 5 \
  --runs 1 \
  --rank-by efficiency \
  --compare-with latest

Output:

  • ✅ Filters to 5 models with 3 quantizations each
  • ✅ Ranks by efficiency (Tokens/s per GB)
  • ✅ Shows delta vs previous benchmark
  • ✅ Generates all 4 export formats
  • ✅ Includes percentile statistics (P50, P95, P99)
  • ✅ Shows quantization comparison
  • ✅ Displays performance trends if history available

🎯 Key Metrics

MetricDescriptionUnit
SpeedThroughputtokens/s
EfficiencySpeed per GB model sizetokens/s/GB
TTFTTime to First Tokenms
DeltaChange vs previous%
VRAMMemory usedMB

📁 File Structure

results/
├── benchmark_results_20260104_170000.json
├── benchmark_results_20260104_170000.csv
├── benchmark_results_20260104_170000.pdf
└── benchmark_results_20260104_170000.html

🐛 Troubleshooting

No models found

  • Ensure LM Studio is installed and running
  • Check lms ls --json output

Server not responding

  • Start LM Studio server manually
  • Check ~/.lmstudio/server-logs/

Permission denied on results/

mkdir -p results/
chmod 755 results/
  • FEATURES.md - Complete feature list
  • PLAN.md - Implementation roadmap
  • requirements.txt - Python dependencies
  • errors.log - Debug information

Version: 1.0 (Phases 1-4 Complete) | Updated: 2026-01-04

Configuration Reference

Complete documentation of all CLI arguments and configuration options for the LM Studio Benchmark Tool.


Table of Contents

  1. Overview
  2. Configuration Files
  3. CLI Arguments
  4. Examples

Overview

The benchmark tool can be configured in three ways:

  1. Project Defaults: config/defaults.json (in Git)
  2. User Configuration: ~/.config/lm-studio-bench/defaults.json (optional overrides)
  3. CLI Arguments: Override all config values

Priority: CLI Arguments > User Config > Project Defaults > Hard-coded Defaults

Configuration Files

Project Configuration (config/defaults.json)

The project configuration file contains all default settings for the benchmark. This file is shipped with the project and tracked in Git.

Location: <project_root>/config/defaults.json

User Configuration (~/.config/lm-studio-bench/defaults.json)

Optional user-specific configuration overrides. Only specify fields you want to customize.

Location: ~/.config/lm-studio-bench/defaults.json

Example (minimal user config):

{
  "num_runs": 5,
  "lmstudio": {
    "use_rest_api": true
  }
}

This overrides only num_runs and use_rest_api, all other values come from project defaults.

Complete Structure

{
  "prompt": "Is the sky blue?",
  "context_length": 2048,
  "num_runs": 3,
  "retest": false,
  "enable_profiling": false,
  "lmstudio": {
    "host": "localhost",
    "ports": [1234, 1235],
    "api_token": null,
    "use_rest_api": true
  },
  "inference": {
    "temperature": 0.1,
    "top_k_sampling": 40,
    "top_p_sampling": 0.9,
    "min_p_sampling": 0.05,
    "repeat_penalty": 1.2,
    "max_tokens": 256
  },
  "load": {
    "n_gpu_layers": -1,
    "n_batch": 512,
    "n_threads": -1,
    "flash_attention": true,
    "rope_freq_base": 10000,
    "rope_freq_scale": 1.0,
    "use_mmap": true,
    "use_mlock": false,
    "kv_cache_quant": "f16"
  }
}

Field Descriptions

Basic Settings

FieldTypeDefaultDescription
promptstring"Is the sky blue?"Default test prompt for all benchmarks
context_lengthinteger2048Context length in tokens
num_runsinteger3Number of measurements per model/quantization
retestbooleanfalseIgnore cache and benchmark all selected models again
enable_profilingbooleanfalseEnable temperature/power monitoring

LM Studio Server (lmstudio)

FieldTypeDefaultDescription
hoststring"localhost"LM Studio server hostname
portsarray[1234, 1235]Ports for server discovery (tries both)
api_tokenstring/nullnullAPI permission token (REST API authentication)
use_rest_apibooleantrueUse REST API v1 instead of SDK/CLI

Inference Parameters (inference)

FieldTypeDefaultDescription
temperaturefloat0.1Sampling temperature (0.0-2.0, low=deterministic)
top_k_samplinginteger40Top-K sampling (limits choice to K most likely tokens)
top_p_samplingfloat0.9Top-P / Nucleus sampling (cumulative probability)
min_p_samplingfloat0.05Min-P sampling (minimum probability threshold)
repeat_penaltyfloat1.2Repeat penalty (prevents repetitions, 1.0=off)
max_tokensinteger256Maximum output tokens

Load Config (load)

FieldTypeDefaultDescription
n_gpu_layersinteger-1GPU layers (-1=auto/all, 0=CPU only, >0=specific)
n_batchinteger512Batch size for prompt processing
n_threadsinteger-1CPU threads (-1=auto/all)
flash_attentionbooleantrueFlash attention (faster computation)
rope_freq_basefloat10000RoPE frequency base
rope_freq_scalefloat1.0RoPE frequency scaling
use_mmapbooleantrueMemory mapping (faster model load)
use_mlockbooleanfalseMemory locking (prevents swapping)
kv_cache_quantstring"f16"KV cache quantization (f32/f16/q8_0/q4_0/etc.)

Preset Defaults and Compatibility

The tool includes two readonly default presets:

default_classic - Classic Benchmark Mode

Default preset for standard model benchmarking. Contains explicit values for all benchmark fields to avoid null values in preset comparisons.

  • benchmark_mode: classic
  • preset_mode: classic
  • runs: 3
  • context: 2048
  • Capability fields (agent_model, agent_capabilities, agent_max_tests): null

Backwards Compatibility: Loading --preset default automatically loads default_classic.

default_compatibility_test - Capability-Driven Test Mode

Default preset for focused capability testing of a single model.

Alias: The legacy name default_compatability_test is accepted as an alias for this preset for backward compatibility.

  • benchmark_mode: capability
  • preset_mode: capability
  • runs: 1
  • context: 2048
  • agent_model: qwen2.5-7b-instruct
  • agent_capabilities: general_text,reasoning
  • agent_max_tests: 10
  • No null values - all fields have explicit defaults

Compatibility mapping is applied automatically when loading and comparing presets with legacy keys:

  • context_length -> context
  • num_runs -> runs
  • top_k -> top_k_sampling
  • top_p -> top_p_sampling
  • min_p -> min_p_sampling

CLI Arguments

All CLI arguments override the corresponding values from both config files.

Basic Options

--runs, -r (integer)

Number of measurements per model/quantization.

./run.py --runs 1              # Fast: only 1 measurement
./run.py --runs 5              # Accurate: 5 measurements (average)

Default: 3


--context, -c (integer)

Context length in tokens.

./run.py --context 4096        # 4K context
./run.py --context 32768       # 32K context

Default: 2048


--list-presets

List all available presets (readonly + user presets) and exit.

./run.py --list-presets

--preset, -p (string)

Load a preset before parsing all remaining CLI arguments. If omitted, default_classic is used. The legacy alias default still loads default_classic automatically.

./run.py --preset quick_test
./run.py --preset high_quality --runs 3
./run.py --preset default_classic
./run.py --preset default_compatability_test

Built-in readonly presets:

  • default_classic
  • default_compatability_test
  • default (alias for default_classic)
  • quick_test
  • high_quality
  • resource_limited

Readonly preset names cannot be saved, deleted, or imported as user presets. This restriction also applies to the legacy alias default.

For capability-driven runs across many models, individual model load failures are logged and skipped so the benchmark can continue with the remaining models.


--prompt, -P (string)

Default test prompt.

./run.py --prompt "Explain machine learning"
./run.py -P "Explain machine learning"

Default: "Is the sky blue?"


--limit, -l (integer)

Maximum number of models to test.

./run.py --limit 1             # Only 1 model (usually smallest)
./run.py --limit 5             # First 5 models

Default: None (all models)


--dev-mode

Development mode: Automatically tests the smallest model with 1 run.

./run.py --dev-mode            # Equivalent to --limit 1 --runs 1

Default: false


Filter Options

--only-vision

Test only models with vision capability (multimodal).

./run.py --only-vision --runs 2

Default: false


--only-tools

Test only models with tool-calling support.

./run.py --only-tools --runs 2

Default: false


--quants (string)

Test only specific quantizations (comma-separated).

./run.py --quants "q4,q5,q6"     # Only Q4/Q5/Q6
./run.py --quants "q8"           # Only Q8

Default: None (all quants)


--arch (string)

Test only specific architectures (comma-separated).

./run.py --arch "llama,mistral"  # Only Llama and Mistral
./run.py --arch "qwen"           # Only Qwen

Default: None (all architectures)


--params (string)

Test only specific parameter sizes (comma-separated).

./run.py --params "3B,7B,8B"     # 3B, 7B and 8B models
./run.py --params "1B"           # Only 1B models

Default: None (all sizes)


--min-context (integer)

Minimum context length in tokens.

./run.py --min-context 32000     # Only models with ≥32K context

Default: None (no minimum)


--max-size (float)

Maximum model size in GB.

./run.py --max-size 10.0         # Only models ≤10GB
./run.py --max-size 5.0          # Only models ≤5GB

Default: None (no limit)


--include-models (string)

Only test models matching the regex pattern.

./run.py --include-models "llama.*7b"      # All 7B Llama models
./run.py --include-models "qwen|phi"       # Qwen OR Phi

Default: None (all models)


--exclude-models (string)

Exclude models matching the regex pattern.

./run.py --exclude-models ".*uncensored.*" # No uncensored models
./run.py --exclude-models "test|exp"       # No test/experimental

Default: None (no exclusions)


--compare-with (string)

Compare with previous results.

./run.py --compare-with "20260104_172200.json"
./run.py --compare-with "latest"           # Latest result

Default: None (no comparison)


--rank-by (choice)

Sort results by metric.

Options: speed, efficiency, ttft, vram

./run.py --rank-by speed         # By tokens/s
./run.py --rank-by efficiency    # By tokens/s per GB VRAM
./run.py --rank-by ttft          # By Time to First Token
./run.py --rank-by vram          # By VRAM usage (low→high)

Default: speed



Cache Management

--retest

Ignore cache and retest all models.

./run.py --retest                # Overwrites old results

Default: false (uses cache if available)


--list-cache

Display all cached models and exit.

./run.py --list-cache

Output: Table with all cache entries


--export-cache (string)

Export cache contents as JSON.

./run.py --export-cache "cache_export.json"

Exits the program after export.


--export-only

Generate reports from cache without new tests.

./run.py --export-only           # Creates JSON/CSV/PDF/HTML

Default: false


Hardware Profiling

--enable-profiling

Enable hardware profiling (GPU temp & power).

./run.py --enable-profiling

Default: false


--max-temp (float)

Maximum GPU temperature in °C (warning).

./run.py --enable-profiling --max-temp 80.0

Default: None (no warning)


--max-power (float)

Maximum GPU power draw in Watts (warning).

./run.py --enable-profiling --max-power 400.0

Default: None (no warning)


--disable-gtt

Disable GTT (Shared System RAM) for AMD GPUs.

./run.py --disable-gtt           # Only dedicated VRAM

Default: false (GTT enabled)

Note: Only relevant for AMD iGPUs (e.g., Radeon 890M).


Inference Parameters

All override values from config files:

--temperature (float)

./run.py --temperature 0.7       # More creative responses
./run.py --temperature 0.0       # Deterministic

--top-k, --top-k-sampling (integer)

./run.py --top-k 50

--top-p, --top-p-sampling (float)

./run.py --top-p 0.95

--min-p, --min-p-sampling (float)

./run.py --min-p 0.05

--repeat-penalty (float)

./run.py --repeat-penalty 1.3

--max-tokens (integer)

./run.py --max-tokens 512

Load Config (Performance Tuning)

All override values from config files:

--n-gpu-layers (integer)

./run.py --n-gpu-layers -1       # All layers on GPU (default)
./run.py --n-gpu-layers 0        # CPU only
./run.py --n-gpu-layers 20       # First 20 layers on GPU

--n-batch (integer)

./run.py --n-batch 1024          # Larger batches (faster)
./run.py --n-batch 128           # Smaller batches (less VRAM)

--n-threads (integer)

./run.py --n-threads -1          # Auto (default)
./run.py --n-threads 8           # 8 CPU threads

--flash-attention / --no-flash-attention

./run.py --flash-attention       # Enabled (default)
./run.py --no-flash-attention    # Disabled

--rope-freq-base (float)

./run.py --rope-freq-base 10000.0

--rope-freq-scale (float)

./run.py --rope-freq-scale 1.0

--use-mmap / --no-mmap

./run.py --use-mmap              # Enabled (default)
./run.py --no-mmap               # Disabled

--use-mlock

./run.py --use-mlock             # Enabled (prevents swapping)

--kv-cache-quant (choice)

Options: f32, f16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1

./run.py --kv-cache-quant q8_0   # 8-bit quantization (saves VRAM)
./run.py --kv-cache-quant f16    # Half-precision (balanced)

Default: null (model default)


REST API Mode

Uses LM Studio REST API v1 instead of Python SDK/CLI.

--use-rest-api

./run.py --use-rest-api --limit 1

Benefits:

  • More detailed stats (TTFT, tok/s)
  • Stateful chats (response_id tracking)
  • Parallel requests (continuous batching)
  • MCP integration
  • Response caching

Default: false (uses SDK/CLI)


--api-token (string)

API permission token for REST API authentication.

./run.py --use-rest-api --api-token "lms_your_token_here"

Default: null (no token, server must be open)

Create: LM Studio → Settings → Server → Generate Token


--n-parallel (integer)

Max parallel predictions per model (REST API only).

./run.py --use-rest-api --n-parallel 8

Default: 4

Requirement: LM Studio 0.4.0+, continuous batching support


--unified-kv-cache

Enable unified KV cache (REST API only).

./run.py --use-rest-api --unified-kv-cache --n-parallel 8

Benefit: Optimizes VRAM for parallel requests

Default: false


Examples

Quick Test of One Model

./run.py --limit 1 --runs 1
# Or shorter:
./run.py --dev-mode

All 7B Llama Models with Q4/Q5/Q6 Quants

./run.py --include-models "llama.*7b" --quants "q4,q5,q6" --runs 2

Vision Models Only with Hardware Profiling

./run.py --only-vision --enable-profiling --max-temp 80.0 --max-power 400.0

REST API with Parallel Requests

./run.py --use-rest-api --n-parallel 8 --unified-kv-cache --limit 5

Export Without New Tests

./run.py --export-only

Custom Inference Parameters

./run.py --temperature 0.7 --top-p 0.95 --max-tokens 512 --limit 3

Preset Workflow

./run.py --list-presets
./run.py --preset quick_test
./run.py --preset resource_limited --max-size 10 --runs 2

Performance Tuning (VRAM-optimized)

./run.py --n-batch 128 --kv-cache-quant q8_0 --limit 5

Manage Cache

./run.py --list-cache                     # Display cache contents
./run.py --export-cache "backup.json"     # Export cache
./run.py --retest --limit 1               # Ignore cache

Configuration Priority

  1. CLI Arguments (highest priority)
  2. User Config (~/.config/lm-studio-bench/defaults.json)
  3. Project Config (config/defaults.json)
  4. Hard-coded Defaults (in code)

Example:

# User config has "num_runs": 5
# Project config has "num_runs": 3
./run.py --runs 1     # → uses 1 (CLI overrides)
./run.py              # → uses 5 (from user config)

Tips & Best Practices

1. Persistent REST API Config

If you mainly use REST API:

config/defaults.json:

{
  "lmstudio": {
    "use_rest_api": true,
    "api_token": "lms_your_token"
  }
}

Then simply:

./run.py --limit 1   # automatically uses REST API

2. VRAM Optimization

When VRAM is limited:

./run.py --kv-cache-quant q8_0 --n-batch 128 --max-size 10.0

3. Fast Development

./run.py --dev-mode   # Tests only smallest model with 1 run

4. Reproducible Benchmarks

./run.py --temperature 0.0 --runs 5 --retest

5. Hardware Monitoring

./run.py --enable-profiling --max-temp 80.0 --max-power 400.0

Logging Configuration

The benchmark tool generates timestamped log files for debugging and monitoring.

Log File Locations

logs/
├── benchmark_YYYYMMDD_HHMMSS.log    # Benchmark execution logs
└── webapp_YYYYMMDD_HHMMSS.log       # Web dashboard logs

Log Format

Each log entry follows this format:

YYYY-MM-DD HH:MM:SS,mmm - LEVEL - LEVEL_ICON message
2026-03-22 13:35:32,445 - INFO - ℹ️ Starting benchmark...

Log Levels

The tool uses standard Python logging levels:

LevelUsageExamples
INFOGeneral information and progressModel loading, benchmark completion, hardware metrics
WARNINGNon-fatal issues and fallbacksGPU tool missing, using CLI fallback, skipped models
ERRORRuntime errors requiring attentionModel load failure, API unavailable, VRAM exceeded

Level Icons

Each log level also gets an automatic icon prefix:

LevelIcon
DEBUG🐛
INFOℹ️
WARNING⚠️
ERROR
CRITICAL🔥

Hardware Metrics in Logs

When hardware profiling is enabled (--enable-profiling), metrics appear with emoji indicators:

🌡️ GPU Temp: 42°C
⚡ GPU Power: 125W
💾 GPU VRAM: 8.2GB
🧠 GPU GTT: 0.0GB
🖥️ CPU: 35.2%
💾 RAM: 18.5GB

Third-Party Library Logging

The following libraries have suppressed debug output for cleaner logs:

LibraryLevelReason
httpxWARNINGHTTP client noise
lmstudioWARNINGSDK debug output
urllib3WARNINGHTTP library noise
websocketsWARNINGWebSocket protocol noise

Viewing Logs

Real-time monitoring:

# Watch benchmark execution
tail -f ~/.local/share/lm-studio-bench/logs/benchmark_*.log

# Watch web dashboard
tail -f ~/.local/share/lm-studio-bench/logs/webapp_*.log

Search and filter:

# Find errors
grep ERROR ~/.local/share/lm-studio-bench/logs/benchmark_*.log

# Find warnings
grep WARNING ~/.local/share/lm-studio-bench/logs/benchmark_*.log

# Find specific model errors
grep "model_name_pattern" \
  ~/.local/share/lm-studio-bench/logs/benchmark_*.log

# Count log entries by level
grep -c INFO ~/.local/share/lm-studio-bench/logs/benchmark_*.log
grep -c ERROR ~/.local/share/lm-studio-bench/logs/benchmark_*.log

See Also

Hardware Monitoring Live Charts - Guide

✅ Status: Fully Implemented with GPU Detection

Hardware monitoring is now fully functional with stable live charts for all metrics and improved GPU model detection.

Monitoring logic is shared in tools/hardware_monitor.py and used by both classic benchmark flows and capability-driven agent flows.

📊 Implemented Metrics

GPU Detection and Model Info

The system automatically detects all installed GPUs:

  1. NVIDIA GPUs

    • Detection: nvidia-smi --query-gpu=name
    • VRAM: nvidia-smi --query-gpu=memory.total
    • Temperature: nvidia-smi --query-gpu=temperature.gpu
    • Power: nvidia-smi --query-gpu=power.draw
  2. AMD GPUs

    • rocm-smi detection: rocm-smi --showproductname
    • Device ID mapping: lspci -d 1002:{device_id}
    • Example: 1002:150e → "Radeon Graphics (Ryzen 9 7950X3D)"
    • rocm-smi search path: /usr/bin, /usr/local/bin, /opt/rocm-*/bin/
    • VRAM: rocm-smi --showmeminfo vram
    • GTT: rocm-smi --showmeminfo gtt
    • Temperature: rocm-smi --showtemp
  3. iGPU detection

    • Extract from CPU string: regex r'Radeon\s+(\d+[A-Za-z]*)'
    • Shows integrated Radeon graphics separately
    • Prevents redundancy with dedicated GPUs

GPU Metrics

  1. 🌡️ GPU Temperature (°C) - Red

    • NVIDIA: nvidia-smi --query-gpu=temperature.gpu
    • AMD: rocm-smi --showtemp
    • Intel: intel-gpu-top (if available)
  2. ⚡ GPU Power (W) - Blue

    • NVIDIA: nvidia-smi --query-gpu=power.draw
    • AMD: rocm-smi (Current Socket Graphics Package Power)
    • Intel: alternative measurement methods
  3. 💾 GPU VRAM Usage (GB) - Green

    • NVIDIA: nvidia-smi --query-gpu=memory.used
    • AMD: rocm-smi --showmeminfo vram (in bytes)
  4. 🧠 GPU GTT Usage (GB) - Purple

    • AMD only: rocm-smi --showmeminfo gtt
    • System RAM that is used as VRAM
    • Example: 2GB VRAM + 46GB GTT = 48GB effective

System Metrics (with --enable-profiling)

  1. 🖥️ CPU Usage (%) - Orange

    • psutil.cpu_percent(interval=0.1)
    • 0-100% range
    • System-wide, not per process
  2. 💾 System RAM Usage (GB) - Cyan

    • psutil.virtual_memory().used
    • Smoothing: moving average over 3 samples
    • Prevents spikes from cache/buffer fluctuations
    • Very stable curves

🔧 Activation

Hardware monitoring is automatically enabled with:

# WebApp with hardware monitoring
./run.py --webapp

# CLI with hardware monitoring
./run.py --enable-profiling

# Only with specific models
./run.py --limit 2 --enable-profiling

📝 Logger Output

When --enable-profiling is active, the benchmark prints metrics every second:

🌡️ GPU Temp: 45.3°C
⚡ GPU Power: 125.5W
💾 GPU VRAM: 8.2GB
🧠 GPU GTT: 0.0GB
🖥️ CPU: 35.2%
💾 RAM: 18.5GB

These outputs are:

  • ✅ Saved in ~/.local/share/lm-studio-bench/logs/benchmark_YYYYMMDD_HHMMSS.log
  • ✅ Shown in the WebApp terminal
  • ✅ Visualized as charts

🎯 Data Flow

Backend (cli/benchmark.py / agents/benchmark.py)
   ↓
Shared Module (tools/hardware_monitor.py)
  ↓
HardwareMonitor._monitor_loop()
  ├─ _get_temperature()
  ├─ _get_power_draw()
  ├─ _get_vram_usage()
  ├─ _get_gtt_usage()
  ├─ _get_cpu_usage()
  └─ _get_ram_usage()
       ↓
logger.info() → stdout + log file
       ↓
WebApp Backend (app.py)
  ├─ _consume_output() Task (blocking readline)
  ├─ parse_hardware_metrics() (Regex patterns)
  └─ hardware_history dict
       ↓
WebSocket
  └─ Sends every 2 seconds (last 60 entries)
       ↓
Frontend (dashboard.html.jinja)
  └─ 6 Plotly.js charts with live updates

Before each profiling run, HardwareMonitor.start() calls _reset_measurements(). This clears prior temperature, power, VRAM, GTT, CPU and RAM samples, so chart data and exported min/max/avg values only reflect the current run.

🐛 Fixes and Optimizations

Fix 1: rocm-smi 7.0.1 Format Change

Problem: rocm-smi changed its output format Solution: regex parser extracts the last number from the line

match = re.search(r'[\d.]+\s*$', line.strip())

Fix 2: Logger Routing

Problem: hardware data did not appear in log files Solution: print()logger.info() for stdout + file

All hardware metrics are logged using Python's standard logging module:

logger.info(f"🌡️ GPU Temp: {temp:.1f}°C")
logger.info(f"💾 Memory: {vram_mb:.1f}MB VRAM + {gtt_mb:.1f}MB GTT")

This ensures metrics appear in both:

  • stdout - Real-time display in terminal
  • log files - ~/.local/share/lm-studio-bench/logs/benchmark_YYYYMMDD_HHMMSS.log for permanent record
  • WebApp - Streamed via WebSocket to dashboard

Fix 3: WebApp Output Streaming

Problem: WebApp showed only 10% of the hardware data Solution: asyncio.wait_for() → blocking readline() in executor

Fix 4: RAM Monitoring Spikes

Problem: RAM chart jumped between 1.8GB and 28.3GB Solution: moving average over 3 samples → very stable curve

Fix 5: Runtime Counter Does Not Stop

Problem: runtime counter continued after benchmark end Solution: clearInterval(uptimeInterval) on completion

Fix 6: WebApp Initialization Race Conditions

Problem: links were not interactive, light mode on startup Solution: 3x DOMContentLoaded events → 1x consolidated event

📊 Chart Properties

All charts update every 2 seconds with:

  • Min/Max/Avg statistics - real-time calculation
  • Last 60 data points - about 2 minutes of history
  • Responsive design - adapts to window size
  • Dark mode - default for all charts
  • Hover tooltips - show exact values on hover

LM Studio CLI - Available LLM Metadata with GPU Analysis

📋 Quick Reference

Main metadata query commands

lms ls --json           # All downloaded models with metadata
lms ps --json           # Currently loaded models
lms status              # Server status + model size
lms version             # LM Studio version

🎯 GPU Support and Hardware Requirements

Automatic GPU detection in the benchmark

The benchmark system automatically detects all your GPUs and specs:

NVIDIA GPUs:

  • Automatic detection via nvidia-smi
  • VRAM size recorded for offload optimization
  • Temperature and power are monitored

AMD GPUs (rocm-smi):

  • Detailed device ID mapping for GPU model names
  • VRAM and GTT memory are tracked separately
  • rocm-smi search paths: /usr/bin, /usr/local/bin, /opt/rocm-*/bin/

iGPU detection:

  • Radeon iGPUs are extracted from the CPU string
  • Regex pattern: Radeon\s+(\d+[A-Za-z]*)
  • Shows, for example, "Radeon 890M (Ryzen 9 7950X3D)" separately

📊 Full Metadata Fields (15 fields per model)

Category 1: Model identification (5 fields)

FieldTypeExampleDescription
typestring"llm"Model type (llm, embedding)
modelKeystring"mistralai/ministral-3-3b"Unique model ID
displayNamestring"Ministral 3 3B"Display name
publisherstring"mistralai"Model publisher/developer
pathstring"mistralai/ministral-3-3b"Local storage path

Category 2: Technical specifications (4 fields)

FieldTypeExampleDescription
architecturestring"mistral3", "gemma3", "llama"Model architecture
formatstring"gguf"File format (GGUF, etc.)
paramsStringstring"3B", "7B", "13B"Parameter size
sizeBytesnumber2986817071Size in bytes

Category 3: Model capabilities (3 fields)

FieldTypeExampleDescription
visionbooleantrue / falseCan process images?
trainedForToolUsebooleantrue / falseSupports tool calling?
maxContextLengthnumber131072, 262144Maximum context length in tokens

Category 4: Quantization and variants (4 fields)

FieldTypeExampleDescription
quantization.namestring"Q4_K_M", "Q8_0", "F16"Quantization method
quantization.bitsnumber4, 8, 16Bits per weight
variantsarray[@q4_k_m, @q8_0]All available quantizations
selectedVariantstring"mistralai/ministral-3-3b@q4_k_m"Current selection

🔍 Practical Examples with Your Models

Example 1: List vision models

lms ls --json | jq '.[] | select(.vision == true) | {displayName, paramsString, maxContextLength}'

Output:

  • Gemma 3 4B (4B) - 131072 tokens
  • Ministral 3 3B (3B) - 262144 tokens
  • Qwen3 Vl 8B (8B) - 262144 tokens

The command uses the jq filter shown above.

Example 2: Tool-calling models only

lms ls --json | jq '.[] | select(.trainedForToolUse == true) | .displayName'

Example 3: Sort models by size

lms ls --json | jq 'sort_by(.sizeBytes) | .[] | {displayName, sizeGB: (.sizeBytes/1024/1024/1024|round*100/100)}'

Example 4: Models with large context length (≥128k tokens)

lms ls --json | jq '.[] | select(.maxContextLength >= 131072) | {modelKey, maxContextLength}'

Example 5: Model architecture distribution

lms ls --json | jq -r '.[] | .architecture' | sort | uniq -c

🐍 Python SDK Access

SDK methods for metadata queries

import lmstudio

# 1. Fetch all downloaded models
models = lmstudio.list_downloaded_models()
for model in models:
    print(f"Model: {model.model_key}")
    print(f"  Size: {model.info.sizeBytes / 1024**3:.2f} GB")
    print(f"  Vision: {model.info.vision}")
    print(f"  Maximum context length: {model.info.maxContextLength} tokens")
    print(f"  Architecture: {model.info.architecture}")
    print()

# 2. Currently loaded models
loaded_models = lmstudio.list_loaded_models()
for llm in loaded_models:
    print(f"Loaded: {llm.identifier}")

# 3. Filter models
vision_models = [m for m in models if m.info.vision]
print(f"Vision models: {len(vision_models)}")

# 4. Sort by size
large_models = sorted(models, key=lambda m: m.info.sizeBytes, reverse=True)[:3]
for model in large_models:
    print(f"{model.info.displayName}: {model.info.sizeBytes / 1024**3:.2f} GB")

💡 Common Use Cases

Use case 1: Quick performance tests

Filter only small models < 1GB for fast benchmarks:

lms ls --json | jq '.[] | select(.sizeBytes < 1000000000) | .modelKey'

Use case 2: Long-form processing

Models with large context for document analysis:

lms ls --json | jq '.[] | select(.maxContextLength >= 100000) | .displayName'

Use case 3: Image processing

Multi-modal models for vision tasks:

lms ls --json | jq '.[] | select(.vision == true) | .modelKey'

Use case 4: Tool integration

Models with function calling for agent systems:

lms ls --json | jq '.[] | select(.trainedForToolUse == true) | .displayName'

Use case 5: Quantization comparison

All available quantizations for a model:

lms ls "google/gemma-3-1b" --json | jq '.variants[]'

🎯 Benchmarking with Metadata

Integration into benchmark scripts:

import subprocess
import json

# Load model metadata
result = subprocess.run(
    ['lms', 'ls', '--json'],
    capture_output=True,
    text=True,
    check=False
)
models = json.loads(result.stdout)

# Filter for benchmarking
benchmark_candidates = [
    m for m in models
    if m['sizeBytes'] < 5e9  # < 5GB
    and m['vision'] is False  # Text only
]

print(f"Benchmark candidates: {len(benchmark_candidates)}")
for model in benchmark_candidates:
    print(f"  - {model['displayName']} ({model['paramsString']})")

📝 Tips and Tricks

Convert size

# Bytes to GB
python3 -c "print(f'{2986817071/1024**3:.2f} GB')"  # Output: 2.78 GB

JSON pretty print

lms ls --json | jq '.' | less

Quick statistics

# Average model size
lms ls --json | jq '[.[].sizeBytes] | add / length / 1024 / 1024 / 1024'

# Largest model
lms ls --json | jq 'max_by(.sizeBytes) | .displayName'

# Models per architecture
lms ls --json | jq 'group_by(.architecture) | map({architecture: .[0].architecture, count: length})'
lms status              # Server status (shows loaded models too)
lms version             # LM Studio version
lms load <model>        # Load a model
lms unload --all        # Unload all models

Troubleshooting

No output for lms ls --json

  • Ensure the LM Studio server is running: lms server start
  • Check for port conflicts

jq not installed

  • Install: sudo apt install jq (Linux) or brew install jq (macOS)
  • Alternative: use Python parsing

Unlimited output

  • Use | head -n 5 to limit
  • Or pipe to less for paging: | less

User Data & Configuration Locations

This project follows the XDG Base Directory Specification for storing user data and configuration.


Directory Structure

Project Directory

The project directory contains read-only defaults and optional compatibility locations:

<project>/
├── config/
│   └── defaults.json       # Project defaults (in Git)
├── results/                # Optional: legacy/manual compatibility location
└── logs/                   # Optional: legacy/manual debug location

User Directories (XDG Standard)

User-specific data is stored in standard XDG locations:

~/.config/lm-studio-bench/
├── defaults.json           # User configuration overrides (optional)
└── presets/
    ├── my_fast_test.json   # User preset example
    └── my_quality.json     # User preset example

~/.local/share/lm-studio-bench/results/
├── benchmark_results_<timestamp>.json
├── benchmark_results_<timestamp>.csv
├── benchmark_results_<timestamp>.pdf
├── benchmark_results_<timestamp>.html
├── benchmark_cache.db      # SQLite benchmark cache
├── model_metadata.db       # Model metadata cache
└── metadata/
    └── <model_id>/
        └── metadata.json   # Optional per-model metadata fallback

~/.local/share/lm-studio-bench/logs/
├── benchmark_<timestamp>.log
├── benchmark_latest.log    # Symlink to newest benchmark log
├── webapp_<timestamp>.log
├── webapp_latest.log       # Symlink to newest webapp log
├── runapp_<timestamp>.log
├── runapp_latest.log       # Symlink to newest launcher log
├── trayapp_<timestamp>.log
└── trayapp_latest.log      # Symlink to newest tray log

Configuration Loading

Configuration is loaded with the following priority:

  1. CLI Arguments (highest priority)
  2. User Config (~/.config/lm-studio-bench/defaults.json)
  3. Project Config (config/defaults.json)
  4. Hard-coded Defaults (in code)

Example

Project (config/defaults.json):

{
  "num_runs": 3,
  "context_length": 2048,
  "lmstudio": {
    "use_rest_api": false
  }
}

User (~/.config/lm-studio-bench/defaults.json):

{
  "num_runs": 5,
  "lmstudio": {
    "use_rest_api": true
  }
}

Result (merged configuration):

{
  "num_runs": 5,              // User override
  "context_length": 2048,     // Project default
  "lmstudio": {
    "use_rest_api": true      // User override
  }
}

With CLI:

./run.py --runs 10 --context 4096

Final configuration:

  • num_runs: 10 (CLI)
  • context_length: 4096 (CLI)
  • use_rest_api: true (User config)

Creating User Configuration

Step 1: Create Config Directory

mkdir -p ~/.config/lm-studio-bench

Step 2: Create User Config File

nano ~/.config/lm-studio-bench/defaults.json

Step 3: Add Your Overrides

Only include fields you want to override:

{
  "num_runs": 5,
  "context_length": 4096,
  "inference": {
    "temperature": 0.7
  }
}

Important: You only need to specify fields you want to change. All other values will use project defaults.


Directory Initialization

On first run, the tool automatically:

  1. Creates user data directories (~/.config/... and ~/.local/share/...)
  2. Places new results in ~/.local/share/lm-studio-bench/results/
  3. Places runtime logs in ~/.local/share/lm-studio-bench/logs/

Note: Legacy files in project-local results/ are not automatically moved. If you still use that location, move them manually to the XDG path.


Benefits of XDG Structure

For Users

  • Persistent User Settings: Configuration survives project updates
  • Cleaner Project Directory: User data separated from code
  • Standard Locations: Follows Linux conventions
  • Easy Backups: Backup ~/.local/share/lm-studio-bench/ and ~/.config/lm-studio-bench/
  • Multi-User Support: Each user has their own data

For Developers

  • No Git Conflicts: User data not in version control
  • Clean Updates: git pull doesn't affect user data
  • Portable: Project directory can be moved/deleted without losing user data

Environment Variables

You can override paths with environment variables:

# Override config directory
export XDG_CONFIG_HOME="$HOME/my-configs"

# Override data directory
export XDG_DATA_HOME="$HOME/my-data"

# Now config is in: $HOME/my-configs/lm-studio-bench/defaults.json
# Now results are in: $HOME/my-data/lm-studio-bench/results/

FAQ

Q: Where are my benchmark results stored?

A: ~/.local/share/lm-studio-bench/results/

If you pass --output-dir, report files (JSON/CSV/HTML/PDF) are written there. The SQLite cache databases still live in the user results directory.

Q: Where are the SQLite databases stored?

A:

  • ~/.local/share/lm-studio-bench/results/benchmark_cache.db
  • ~/.local/share/lm-studio-bench/results/model_metadata.db

Q: Where do I put custom configuration?

A: ~/.config/lm-studio-bench/defaults.json

Only include fields you want to override from project defaults.

Q: Where are user presets stored?

A: ~/.config/lm-studio-bench/presets/

Built-in readonly presets (default_classic, default_compatibility_test, default as a legacy alias, quick_test, high_quality, resource_limited) are not stored as files.

Readonly preset names cannot be overwritten or deleted by user presets, including the alias default.

Q: What happens to my old results?

A: They are not auto-migrated from legacy project-local folders. Move them manually to ~/.local/share/lm-studio-bench/results/.

Q: Can I use the old config/defaults.json?

A: Yes! It's still used as project defaults. User config in ~/.config/ overrides it.

Q: How do I reset to project defaults?

A: Delete your user config:

rm ~/.config/lm-studio-bench/defaults.json

Q: How do I backup my data?

A: Backup these directories:

# Configuration
tar -czf lms-bench-config.tar.gz ~/.config/lm-studio-bench/

# Results and cache
tar -czf lms-bench-data.tar.gz ~/.local/share/lm-studio-bench/

Q: What about logs?

A: Logs are stored in:

~/.local/share/lm-studio-bench/logs/

This includes benchmark, web app, tray, and launcher logs.


See Also

LM Studio REST API v1 Integration

Overview

The benchmark tool now supports LM Studio's native REST API v1 (/api/v1/*) in addition to the existing Python SDK/CLI mode. This enables advanced features such as stateful chats, parallel requests, and more precise metrics.

New Features

1. REST API Mode (--use-rest-api)

  • Uses /api/v1/chat for inference instead of the Python SDK
  • Stateful chat management (response_id tracking)
  • Detailed stats in the response (TTF, tokens/s, tokens in/out)
  • Streaming events for more accurate measurement

2. Model Management via API

  • GET /api/v1/models — list with capabilities (vision, tool-use)
  • POST /api/v1/models/load — explicit load with configuration
  • POST /api/v1/models/unload — explicit unload
  • POST /api/v1/models/download — download model via API

3. Improved Capabilities Detection

  • Vision support: capabilities.vision flag from the API
  • Tool calling: capabilities.trained_for_tool_use flag
  • Use the --only-vision or --only-tools filters

4. Parallel Inference (LM Studio 0.4.0+)

  • --n-parallel N — max concurrent predictions (default: 4)
  • --unified-kv-cache — optimizes VRAM usage for parallel requests
  • Continuous batching support (llama.cpp 2.0+)

5. API Authentication

  • --api-token TOKEN — permission key for protected servers
  • Config: lmstudio.api_token in config/defaults.json

Usage

Basic usage (REST API mode)

# REST API with default settings
./run.py --use-rest-api --limit 1

# With API token
./run.py --use-rest-api --api-token "your-token-here" --limit 1

# With parallel requests (LM Studio 0.4.0+)
./run.py --use-rest-api --n-parallel 8 --unified-kv-cache --limit 1

Filter by capabilities

# Test only vision-capable models
./run.py --use-rest-api --only-vision --runs 2

# Test only tool-calling models
./run.py --use-rest-api --only-tools --runs 2

Config file (persistent)

config/defaults.json:

{
  "lmstudio": {
    "host": "localhost",
    "ports": [1234, 1235],
    "api_token": "your-token-here",
    "use_rest_api": true
  }
}

Then simply:

./run.py --limit 1  # will automatically use REST API from config

Comparison: SDK vs. REST API

FeatureSDK/CLI ModeREST API Mode
Model Loadinglms load CLIPOST /api/v1/models/load
Inferencelmstudio.llm()POST /api/v1/chat
StatsSDK stats objectDetailed response stats
StreamingSDK streamSSE stream (Server-Sent Events)
Parallel Requests✅ (with --n-parallel)
Stateful Chats✅ (response_id tracking)
CapabilitiesMetadata parsingNative API fields
Authentication✅ (permission keys)

API Response Format

Dashboard summary API (/api/dashboard/stats)

The web dashboard now exposes additional summary fields for quick visual analysis of benchmark history. The endpoint is consumed by the Home and Results views to render KPI cards and charts.

New response fields:

  • speed_summary: min, p50, avg, p95, max tokens/s
  • top_models_extended: Top 10 models by speed (model, quantization, speed, VRAM, architecture)
  • quantization_distribution: count per quantization
  • architecture_distribution: count per architecture
  • efficiency_top: top models ranked by tokens_per_sec_per_gb

Example (excerpt):

{
  "speed_summary": {
    "min": 22.44,
    "p50": 48.17,
    "avg": 51.26,
    "p95": 86.11,
    "max": 93.88
  },
  "top_models_extended": [
    {
      "model_name": "qwen/qwen3-4b@q4_k_m",
      "quantization": "q4_k_m",
      "speed": 93.88,
      "vram_mb": "6144",
      "architecture": "qwen3"
    }
  ],
  "quantization_distribution": {
    "q4_k_m": 22,
    "q5_k_m": 13
  }
}

/api/v1/chat stats

{
  "text": "... generated text ...",
  "stats": {
    "tokens_in": 42,
    "tokens_out": 128,
    "time_to_first_token_ms": 234.5,
    "total_time_ms": 1523.8,
    "tokens_per_second": 84.02
  }
}

/api/v1/models capabilities

{
  "models": [
    {
      "key": "llava-1.6-vicuna-7b-q4_k_m",
      "capabilities": {
        "vision": true,
        "trained_for_tool_use": false
      }
    },
    {
      "key": "qwen-2.5-coder-14b-instruct-q5_k_m",
      "capabilities": {
        "vision": false,
        "trained_for_tool_use": true
      }
    }
  ]
}

Implementation details

New files

  • core/client.py: REST API client with wrapper functions
    • LMStudioRESTClient: main class
    • ModelInfo, ModelCapabilities, ChatStats: data classes
    • is_vision_model(), is_tool_model(): helpers

Modified files

  • cli/benchmark.py:

    • _run_inference(): dispatcher (SDK vs REST)
    • _run_inference_rest(): REST-based inference
    • _run_inference_sdk(): SDK-based inference (renamed)
    • _load_model_rest(), _unload_model_rest(): REST model management
  • config/defaults.json: added api_token, use_rest_api fields

  • core/config.py: new config fields in BASE_DEFAULT_CONFIG

CLI flags

--use-rest-api              Enable REST API mode
--api-token TOKEN           API permission token
--n-parallel N              Max parallel predictions (REST only)
--unified-kv-cache          Unified KV cache (REST only)

Troubleshooting

Server unreachable

# Check whether LM Studio is running
curl http://localhost:1234/

# Healthcheck via CLI
lms server status

API token errors

# Generate token in Settings > Server
# Save it in config or pass via CLI
./run.py --use-rest-api --api-token "lms_..."

REST vs SDK performance

  • REST: more precise stats, more features
  • SDK: slightly faster (direct Python access)
  • For benchmarking, REST is recommended (better metrics)

Additional REST Client Features

1. Download Progress Tracking

The REST client now supports real-time download progress monitoring:

from rest_client import LMStudioRESTClient

client = LMStudioRESTClient()

def on_progress(status):
    if status["state"] == "downloading":
        print(f"Progress: {status['progress'] * 100:.1f}%")

# Wait for download to complete with progress updates
success = client.download_model(
    model_key="qwen/qwen3-1.7b",
    wait_for_completion=True,
    progress_callback=on_progress
)

API: Polls /api/v1/models/download/status every 2 seconds until completion.

2. MCP Integration

Model Context Protocol (MCP) servers can now be attached to chat requests:

# LM Studio v1 API format
mcp_integrations = [
    {
        "type": "ephemeral_mcp",
        "server_label": "filesystem",
        "server_url": "http://localhost:3001/mcp"
    }
]

result = client.chat_stream(
    messages=[{"role": "user", "content": "List files in /tmp"}],
    model="qwen/qwen3-4b",
    mcp_integrations=mcp_integrations
)

Note: Requires MCP server running. Integrations are passed in the integrations array field.

3. Stateful Chat History

Enable multi-turn conversations with automatic response_id tracking:

client = LMStudioRESTClient()

# First message
result1 = client.chat_stream(
    messages=[{"role": "user", "content": "What is 2+2?"}],
    model="qwen/qwen3-4b",
    use_stateful=True
)
# response_id stored automatically

# Second message - automatically includes previous_response_id
result2 = client.chat_stream(
    messages=[{"role": "user", "content": "Add 3 to that."}],
    model="qwen/qwen3-4b",
    use_stateful=True
)
# Server can maintain conversation context

# Reset state when starting new conversation
client.reset_stateful_chat()

API: Extracts response_id from chat.end event, sends previous_response_id in subsequent requests.

4. Response Caching

Identical requests are cached in memory for instant responses:

client = LMStudioRESTClient(enable_cache=True)

# First request - hits API (slow)
result1 = client.chat_stream(
    messages=[{"role": "user", "content": "Count to 5"}],
    model="qwen/qwen3-4b",
    temperature=0.5
)
# Time: ~0.5s

# Second identical request - hits cache (instant)
result2 = client.chat_stream(
    messages=[{"role": "user", "content": "Count to 5"}],
    model="qwen/qwen3-4b",
    temperature=0.5
)
# Time: ~0.0s (10,000x faster!)

# Cache management
cache_size = len(client._RESPONSE_CACHE)  # Check cache size
cleared = client.clear_cache()             # Clear all cached responses

Cache Key: MD5 hash of (messages, model, temperature)
Bypassed: When using use_stateful=True or mcp_integrations (non-deterministic)

Capability-Driven Benchmark Agent Integration

The new Capability-Driven Benchmark Agent functionality is fully integrated into the project and is now available via run.py.

3 Operating Modes

The system now supports 3 different operating modes:

1. Classic Benchmark (Default)

Measures token/s speed across all installed models:

./run.py --limit 5              # Test 5 models
./run.py --export-only          # Generate reports from cache
./run.py --runs 1               # Fast-mode with 1 measurement

Metrics: Tokens/s, latency, VRAM usage

2. Capability-Driven Agent ⭐ NEW

Tests model capabilities with quality metrics:

./run.py --agent "model-id"     # Automatically test all capabilities

# With specific capabilities
./run.py --agent "llama-13b" --capabilities general_text,reasoning

# With output format options
./run.py --agent "llama-13b" --output-dir ./results/ --formats json,html

# Verbose mode
./run.py --agent "llama-13b" --verbose

Detectable Capabilities:

  • general_text - Basic language understanding (QA, summarization, classification)
  • reasoning - Logical and mathematical reasoning
  • vision - Multimodal understanding (image captioning, VQA, OCR)
  • tooling - Tool calling and function execution

Metrics per Capability:

  • Quality: ROUGE, F1, Exact Match, Accuracy, Function Call Accuracy
  • Performance: Tokens/s, latency
  • Reports: JSON + HTML with visualizations
  • Storage: SQLite database for historical tracking and comparison

Runtime Resilience:

  • Multi-model capability runs continue when a single model fails to load or execute; failed models are logged and skipped.
  • Embedding models are retried automatically without offload_kv_cache_to_gpu if LM Studio rejects that load option.

Data Storage:

Results are automatically saved to:

  • JSON Reports: ./output/benchmark_results_*.json
  • HTML Reports: ./output/benchmark_results_*.html
  • SQLite Cache: ~/.local/share/lm-studio-bench/results/benchmark_cache.db

The SQLite database stores individual test results and capability summaries, allowing you to:

  • Track performance over time
  • Compare results across models
  • Query specific capability metrics
  • Build custom dashboards from cached data

SQLite Metrics Matrix (Classic vs Capability)

The table below lists what is currently persisted in SQLite for both test types, so missing metrics are easy to spot.

Metric GroupClassic Benchmark (benchmark_results)Capability Benchmark (benchmark_results, source='compatibility')
Run identityid, model_key, model_name, quantization, timestampid, model_name, model_key, capability, test_id, test_name, timestamp
Throughput/latencyavg_tokens_per_sec, avg_ttft, avg_gen_time, tokens_per_sec_p50, tokens_per_sec_p95, tokens_per_sec_std, ttft_p50, ttft_p95, ttft_stdlatency_ms, throughput_tokens_per_sec (per test), avg_latency_ms, avg_throughput (summary)
Token volumeprompt_tokens, completion_tokensprompt_tokens, tokens_generated
Quality metricsStored for parity columns but normally NULL for classic runsquality_score, rouge_score, f1_score, exact_match_score, accuracy_score, function_call_accuracy, avg_quality_score, avg_rouge, avg_f1, avg_exact_match, avg_accuracy
Success/failuresuccess, error_message, error_countsuccess, error_message (per test), total_tests, successful_tests, failed_tests, success_rate, error_count
Hardware profilinggpu_type, gpu_offload, vram_mb, temp_celsius_min/max/avg, power_watts_min/max/avg, vram_gb_min/max/avg, gtt_gb_min/max/avg, cpu_percent_min/max/avg, ram_gb_min/max/avgSame run-level hardware fields are persisted on each capability test row
Inference/load paramscontext_length, temperature, top_k_sampling, top_p_sampling, min_p_sampling, repeat_penalty, max_tokens, n_gpu_layers, n_batch, n_threads, flash_attention, rope_freq_base, rope_freq_scale, use_mmap, use_mlock, kv_cache_quantSame run-level inference/load fields are persisted on each capability test row
Environment/versionlmstudio_version, app_version, nvidia_driver_version, rocm_driver_version, intel_driver_version, os_name, os_version, cpu_model, python_versionSame environment/version fields are persisted on each capability test row
Derived/comparisontokens_per_sec_per_gb, tokens_per_sec_per_billion_params, speed_delta_pct, prev_timestampSame derived/comparison fields are persisted on each capability test row
Raw text/referenceprompt (full input prompt), raw_output, reference_outputprompt, raw_output, reference_output

Quick gap summary

  • Missing in capability mode: TTFT distribution stats and classic-only aggregate throughput percentiles.
  • Missing in classic mode: meaningful per-test quality metrics (ROUGE/F1/Exact/Accuracy) because classic benchmarks do not execute capability test cases.

Variant selection in REST mode

  • Capability mode now forwards the exact requested model identifier, including any @quantization suffix, to the LM Studio REST API.
  • This keeps load, chat, and unload aligned with the selected variant and avoids silently falling back to a server-side default quantization.

3. Web Dashboard

Modern web UI with live streaming and configuration:

./run.py --webapp               # Starts on http://localhost:8080
./run.py -w                     # Short form

Agent Options

./run.py --agent MODEL_PATH [OPTIONS]

OPTIONS:
  --capabilities CAPS        Comma-separated capabilities
                            (general_text, reasoning, vision, tooling)
  --output-dir DIR          Output directory (default: output)
  --config FILE             YAML configuration file
  --formats FORMATS         Output formats: json,html (default: json,html)
  --max-tests N             Max tests per capability
  --context-length N        Model context length (default: 2048)
  --gpu-offload RATIO       GPU offload ratio 0.0-1.0 (default: 1.0)
  --temperature TEMP        Generation temperature (default: 0.1)
  -v, --verbose             Enable verbose logging

Test Data and Prompts

The following test files are available:

tests/
├── data/
│   ├── text/
│   │   ├── qa_samples.json              # QA examples
│   │   ├── reasoning_samples.json       # Reasoning examples
│   │   └── tooling_samples.json         # Tool-calling examples
│   └── images/
│       └── README.md                    # Vision datasets
└── prompts/
    ├── general_text_qa.md
    ├── general_text_summarization.md
    ├── reasoning_logical.md
    ├── reasoning_math.md
    ├── tooling_function_call.md
    ├── vision_caption.md
    └── vision_vqa.md

Example Executions

# All capabilities (auto-detected)
./run.py --agent "my-model" --output-dir results/

# Only General Text and Reasoning
./run.py --agent "my-model" --capabilities general_text,reasoning

# With custom config
./run.py --agent "my-model" --config config/bench.yaml

# Verbose with all details
./run.py --agent "my-model" --verbose --max-tests 20

# Classic benchmark still available
./run.py --limit 10 --runs 3

Code Structure

cli/
├── main.py                  # CLI entrypoint for agent
├── __main__.py              # Makes cli package executable
├── benchmark.py             # Classic benchmark runner
├── metrics.py               # Metric implementations
├── reporting.py             # JSON & HTML report generation
└── report_template.html.template

config/
└── bench.yaml               # Default configuration

agents/
├── benchmark.py           # Benchmark executor
├── runner.py                # Test orchestration
└── capabilities.py          # Capability detection

core/
├── config.py                # Configuration loading
├── paths.py                 # XDG/user path handling
├── client.py                # LM Studio REST API client
└── tray.py                  # Linux tray controller

Documentation

Logging

Capability benchmark logs use automatic level icons in addition to benchmark-specific emoji markers:

  • 🐛 Debug
  • ℹ️ Info
  • ⚠️ Warning
  • Error
  • 🔥 Critical

Capability-Driven Benchmark Agent for LM Studio Bench

This benchmark agent implements capability-driven evaluation for language models and multimodal models. It detects model capabilities, runs targeted tests, computes quality metrics, and generates comprehensive reports.

Features

  • Automatic capability detection (general text, reasoning, vision, tooling)
  • Per-capability test suites with standardized prompts
  • Quality metrics: ROUGE, F1, Exact Match, Accuracy, Function Call Accuracy
  • Performance metrics: tokens/sec, latency
  • Machine-readable JSON and human-friendly HTML reports
  • CLI interface with extensive configuration options
  • Docker support for containerized execution
  • GitHub Actions integration for CI/CD benchmarking

Quick Start

Local Execution

Run a benchmark on a model:

python -m cli.main "path/to/model" --output-dir output

Run across installed models:

python -m cli.main --all-models --output-dir output
python -m cli.main --random-models 5 --output-dir output

With specific capabilities:

python -m cli.main "model-id" \
  --capabilities general_text,reasoning \
  --output-dir results

Using Docker

Build the Docker image:

docker build -f scripts/Dockerfile.bench -t lm-bench-agent .

Run benchmark in container:

docker run -v $(pwd)/output:/app/output \
  lm-bench-agent "model-path" \
  --output-dir /app/output

Capabilities

The agent supports four primary capabilities:

1. General Text

Tests basic language understanding and generation:

  • Question answering
  • Summarization
  • Classification

Metrics: ROUGE-1, ROUGE-L, F1

2. Reasoning

Tests logical and mathematical reasoning:

  • Logical reasoning (syllogisms)
  • Math problem solving
  • Chain-of-thought reasoning

Metrics: Exact Match, F1, Accuracy

3. Vision

Tests multimodal understanding (requires vision models):

  • Image captioning
  • Visual Question Answering (VQA)
  • OCR and visual reasoning

Metrics: Accuracy, ROUGE-L

4. Tooling

Tests function calling and tool use:

  • Function selection
  • Parameter extraction
  • API interaction patterns

Metrics: Function Call Accuracy, Parameter Accuracy

CLI Reference

Basic Usage

python -m cli.main MODEL_PATH [OPTIONS]

Arguments

  • MODEL_PATH: Path to model or model identifier (required)

Options

Model Configuration

  • --model-name NAME: Override model name (default: derived from path)
  • --all-models: Run the capability benchmark for all installed models
  • --random-models N: Run the capability benchmark for N random installed models
  • --capabilities CAPS: Comma-separated capabilities to test
    • Options: general_text,reasoning,vision,tooling
    • Default: Auto-detect from model metadata

Output Configuration

  • --output-dir DIR: Output directory (default: output)
  • --formats FMTS: Output formats: json,html (default: both)

Test Configuration

  • --max-tests N: Maximum tests per capability (default: 10)
  • --config FILE: Path to YAML configuration file

Model Parameters

  • --context-length N: Model context length (default: 2048)
  • --gpu-offload RATIO: GPU offload ratio 0.0-1.0 (default: 1.0)
  • --temperature T: Generation temperature (default: 0.1)

Other

  • --verbose, -v: Enable verbose logging

Examples

Benchmark with custom configuration:

python -m cli.main "mymodel" \
  --config custom_config.yaml \
  --max-tests 20 \
  --verbose

Test only reasoning capability:

python -m cli.main "reasoning-model" \
  --capabilities reasoning \
  --temperature 0.0 \
  --max-tests 50

Generate only JSON output:

python -m cli.main "model" \
  --formats json \
  --output-dir json_results

Run against random installed models:

python -m cli.main --random-models 3 --capabilities general_text,reasoning

Runtime Behavior

  • When running across multiple installed models, a single model failure is logged and skipped so the benchmark can continue.
  • For embedding models loaded through the LM Studio REST API, the loader automatically retries without offload_kv_cache_to_gpu if LM Studio rejects that option.
  • Log output includes automatic level icons such as ℹ️, ⚠️, and in addition to benchmark-specific emoji markers.

Configuration File

The agent reads configuration from config/bench.yaml by default. Override with --config flag.

Configuration Schema

context_length: 2048
gpu_offload: 1.0
temperature: 0.1
max_tokens: 256
max_tests_per_capability: 10
use_rest_api: true

data_dir: tests/data
prompts_dir: tests/prompts

timeout_seconds: 300

metric_weights:
  general_text:
    rouge-1: 0.3
    rouge-l: 0.4
    f1: 0.3
  reasoning:
    exact_match: 0.5
    f1: 0.3
    accuracy: 0.2
  vision:
    accuracy: 0.6
    rouge-l: 0.4
  tooling:
    function_call_accuracy: 0.7
    accuracy: 0.3

composite_score_weights:
  quality: 0.6
  performance: 0.2
  efficiency: 0.2

lmstudio:
  host: localhost
  ports:
    - 1234
    - 1235
  api_token: null

Key Configuration Options

  • context_length: Maximum context length for model
  • gpu_offload: GPU memory allocation (0.0 = CPU only, 1.0 = full GPU)
  • max_tests_per_capability: Limit tests to prevent long runs
  • metric_weights: Per-capability metric importance
  • composite_score_weights: Overall score composition

Output Format

JSON Report

The JSON report follows this schema:

{
  "schema_version": "1.0",
  "generated_at": "2025-01-15T10:30:00",
  "report": {
    "model_name": "model-name",
    "model_path": "path/to/model",
    "capabilities": ["general_text", "reasoning"],
    "timestamp": "2025-01-15T10:30:00",
    "summary": {
      "total_tests": 20,
      "successful_tests": 19,
      "success_rate": 0.95,
      "avg_latency_ms": 245.6,
      "avg_quality_score": 0.823,
      "avg_throughput_tokens_per_sec": 42.3,
      "by_capability": {
        "general_text": {
          "test_count": 10,
          "avg_quality_score": 0.856,
          "success_rate": 1.0
        }
      }
    },
    "results": [
      {
        "test_id": "qa_001",
        "capability": "general_text",
        "latency_ms": 230.5,
        "tokens_generated": 12,
        "throughput": 52.1,
        "quality_score": 0.89,
        "metrics": [
          {
            "name": "rouge-1",
            "value": 0.85,
            "normalized": 0.85
          }
        ],
        "error": null
      }
    ],
    "config": {},
    "raw_outputs_dir": "output/raw"
  }
}

HTML Report

The HTML report provides:

  • Summary statistics with visual indicators
  • Per-test results table with status, latency, and quality scores
  • Capability breakdown with aggregated metrics
  • Color-coded quality scores (green/yellow/red)

Raw Outputs

Individual test outputs are saved in output/raw/:

{
  "test_id": "qa_001",
  "capability": "general_text",
  "prompt": "What is the capital of France?",
  "response": "Paris",
  "latency_ms": 230.5,
  "tokens_generated": 12,
  "throughput": 52.1,
  "timestamp": 1642244400.123,
  "error": null
}

GitHub Actions Integration

The workflow .github/workflows/bench.yml enables CI benchmarking.

Triggering the Workflow

Manual Trigger

  1. Go to Actions tab in GitHub
  2. Select "Capability-Driven Benchmark"
  3. Click "Run workflow"
  4. Enter model path and capabilities
  5. Click "Run workflow"

Scheduled Trigger

Runs automatically every Sunday at midnight (UTC).

Push Trigger

Runs on push to main or dev branches.

Note: the benchmark step currently reads the model path only from manual workflow_dispatch inputs. Push- and schedule-triggered runs therefore skip the actual benchmark unless you adapt the workflow to read the model path from another configuration source (for example, a repository variable or secret).

Workflow Outputs

The workflow uploads three artifacts:

  1. benchmark-results-json: JSON reports (30-day retention)
  2. benchmark-results-html: HTML reports (30-day retention)
  3. benchmark-raw-outputs: Raw test outputs (7-day retention)

For pull requests, a summary comment is posted with key metrics.

Adding Test Data

General Text Tests

Add test cases to tests/data/text/qa_samples.json:

{
  "id": "qa_004",
  "prompt": "Your question here",
  "reference": "Expected answer",
  "category": "domain"
}

Reasoning Tests

Add to tests/data/text/reasoning_samples.json:

{
  "id": "reasoning_004",
  "prompt": "Problem statement",
  "reference": "Answer",
  "reasoning": "Explanation of solution",
  "category": "math"
}

Vision Tests

Place images in tests/data/images/ and reference them in test cases.

Tooling Tests

Add to tests/data/text/tooling_samples.json:

{
  "id": "tool_004",
  "task": "Task description",
  "expected_function": "function_name",
  "expected_parameters": {"param": "value"},
  "category": "function_calling"
}

Customizing Prompts

Prompt templates are in tests/prompts/:

  • general_text_qa.md: Question answering
  • general_text_summarization.md: Summarization
  • reasoning_logical.md: Logical reasoning
  • reasoning_math.md: Math problems
  • vision_caption.md: Image captioning
  • vision_vqa.md: Visual QA
  • tooling_function_call.md: Function calling

Edit templates to adjust instruction format or add few-shot examples.

Troubleshooting

Model Loading Fails

Ensure LM Studio is running and the model is available:

lms status
lms models list

No Tests Execute

Check that test data files exist:

ls tests/data/text/

Verify capabilities are correctly specified:

python -m cli.main "model" --capabilities general_text --verbose

Metrics Are Zero

This usually means:

  • Model output format doesn't match expected format
  • Reference answers need normalization
  • Wrong capability assigned to test

Check raw outputs in output/raw/ to inspect actual responses.

Timeout Errors

Increase timeout in config:

timeout_seconds: 600

Or reduce test count:

python -m cli.main "model" --max-tests 5

API Integration

Using as a Library

from pathlib import Path
from agents.runner import BenchmarkRunner
from cli.reporting import generate_reports

config = {
    "context_length": 2048,
    "max_tests_per_capability": 5,
    "use_rest_api": True
}

runner = BenchmarkRunner(
    config=config,
    output_dir=Path("output")
)

report = runner.run(
    model_path="mymodel",
    model_name="MyModel",
    capabilities=["general_text"]
)

outputs = generate_reports(
    report_data=report,
    output_dir=Path("output"),
    formats=["json", "html"]
)

print(f"JSON: {outputs['json']}")
print(f"HTML: {outputs['html']}")

Custom Model Adapter

Implement ModelAdapter interface:

from agents.benchmark import ModelAdapter, InferenceResult

class CustomAdapter(ModelAdapter):
    def load(self, model_path, **kwargs):
        pass

    def unload(self):
        pass

    def infer(self, prompt, image_path=None, **kwargs):
        return InferenceResult(...)

    def is_loaded(self):
        return True

Use with runner:

adapter = CustomAdapter()
report = runner.run(
    model_path="model",
    adapter=adapter
)

Architecture

Components

  • agents/capabilities.py: Capability detection logic
  • agents/benchmark.py: Core benchmark agent and model adapters
  • agents/runner.py: Test orchestration and loading
  • cli/metrics.py: Metric implementations
  • cli/reporting.py: Report generation (JSON, HTML)
  • cli/main.py: Command-line interface
  • config/bench.yaml: Default configuration
  • tests/data/: Test datasets
  • tests/prompts/: Prompt templates

Data Flow

  1. CLI parses arguments and loads configuration
  2. Runner detects capabilities from model metadata or flags
  3. Test loader creates test cases for detected capabilities
  4. Model adapter loads the model
  5. Agent runs each test case:
    • Executes inference
    • Saves raw output
    • Computes metrics
  6. Reporter generates JSON and HTML from results
  7. Outputs are saved to disk

License

This benchmark agent is part of LM-Studio-Bench and follows the same license.

Contributing

Contributions are welcome:

  • Add new capabilities
  • Implement new metrics
  • Expand test datasets
  • Improve prompt templates
  • Enhance reporting formats

Follow the coding standards in .github/instructions/code-standards.instructions.md.

SQLite Metric Parity Map

This table is intentionally compact: one metric per row.

Legend:

  • [x] = metric is stored in both test modes
  • [ ] = metric is missing in at least one mode

Notes:

  • Capability rows normalize quantization to an uppercase label such as Q4_K_M; classic rows keep the classic benchmark format such as q4_k_m.

  • Capability lmstudio_version stores a parsed version or pkg_version (commit:<sha>), not the raw lms version banner output.

  • Capability REST runs forward the exact model variant key, including the @quantization suffix, to LM Studio load/chat/unload requests.

  • Classic rows intentionally leave capability-only fields such as quality_score, raw_output, reference_output, capability, and test_id empty.

  • Historical rows created before recent schema/runtime fixes may still contain NULL values in parity columns. New rows should populate them.

Metricbenchmark_results (classic)benchmark_results (compatibility)Stored in both tests
Row ididid[x]
Model namemodel_namemodel_name[x]
Timestamptimestamptimestamp[x]
Model path/sourcemodel_keymodel_key[x]
Capability labelcapabilitycapability[x]
Test case idtest_idtest_id[x]
Test case nametest_nametest_name[x]
Quantizationquantizationquantization[x]
Inference params hashinference_params_hashinference_params_hash[x]
Tokens per secondavg_tokens_per_secavg_tokens_per_sec[x]
Latencyavg_gen_timeavg_gen_time[x]
TTFTavg_ttftavg_ttft[x]
Prompt token countprompt_tokensprompt_tokens[x]
Completion/generated tokenscompletion_tokenstokens_generated[x]
Primary quality scorequality_scorequality_score[x]
ROUGErouge_scorerouge_score[x]
F1f1_scoref1_score[x]
Exact matchexact_match_scoreexact_match_score[x]
Accuracyaccuracy_scoreaccuracy_score[x]
Function-call accuracyfunction_call_accuracyfunction_call_accuracy[x]
Success flagsuccesssuccess[x]
Error messageerror_messageerror_message[x]
Error countererror_counterror_count[x]
Total tests per capability-aggregate COUNT(*) by capability[ ]
Successful tests per capability-aggregate SUM(success = 1)[ ]
Failed tests per capability-aggregate SUM(success != 1)[ ]
Success rate per capability-derived aggregate (successful / total)[ ]
GPU typegpu_typegpu_type[x]
GPU offload ratiogpu_offloadgpu_offload[x]
VRAM (MB)vram_mbvram_mb[x]
Temperature statstemp_celsius_min/max/avgtemp_celsius_min/max/avg[x]
Power statspower_watts_min/max/avgpower_watts_min/max/avg[x]
VRAM GB statsvram_gb_min/max/avgvram_gb_min/max/avg[x]
GTT GB statsgtt_gb_min/max/avggtt_gb_min/max/avg[x]
CPU usage statscpu_percent_min/max/avgcpu_percent_min/max/avg[x]
RAM GB statsram_gb_min/max/avgram_gb_min/max/avg[x]
Context lengthcontext_lengthcontext_length[x]
Temperature sampling paramtemperaturetemperature[x]
Top-K sampling paramtop_k_samplingtop_k_sampling[x]
Top-P sampling paramtop_p_samplingtop_p_sampling[x]
Min-P sampling parammin_p_samplingmin_p_sampling[x]
Repeat penaltyrepeat_penaltyrepeat_penalty[x]
Max tokens parammax_tokensmax_tokens[x]
GPU layer settingn_gpu_layersn_gpu_layers[x]
Batch settingn_batchn_batch[x]
Thread settingn_threadsn_threads[x]
Flash attention settingflash_attentionflash_attention[x]
RoPE base settingrope_freq_baserope_freq_base[x]
RoPE scale settingrope_freq_scalerope_freq_scale[x]
mmap settinguse_mmapuse_mmap[x]
mlock settinguse_mlockuse_mlock[x]
KV cache quant settingkv_cache_quantkv_cache_quant[x]
LM Studio versionlmstudio_versionlmstudio_version[x]
App versionapp_versionapp_version[x]
Driver versionsnvidia/rocm/intel_driver_versionnvidia/rocm/intel_driver_version[x]
OS infoos_name, os_versionos_name, os_version[x]
CPU modelcpu_modelcpu_model[x]
Python versionpython_versionpython_version[x]
Benchmark durationbenchmark_duration_secondsbenchmark_duration_seconds[x]
Raw model outputraw_outputraw_output[x]
Reference outputreference_outputreference_output[x]
Efficiency per GBtokens_per_sec_per_gbtokens_per_sec_per_gb[x]
Efficiency per B paramstokens_per_sec_per_billion_paramstokens_per_sec_per_billion_params[x]
Speed delta vs previousspeed_delta_pctspeed_delta_pct[x]
Previous timestamp linkprev_timestampprev_timestamp[x]
Prompt hashprompt_hashprompt_hash[x]
Full params hashparams_hashparams_hash[x]
Prompt textpromptprompt[x]

Historical Validation Queries

Use these queries to find older rows that predate parity fixes.

-- Classic rows that still miss parity fields introduced later.
SELECT id, model_name, timestamp,
         quantization, lmstudio_version, app_version, success
FROM benchmark_results
WHERE quantization IS NULL
    OR lmstudio_version IS NULL
    OR app_version IS NULL
    OR success IS NULL
ORDER BY id DESC;

-- Compatibility rows that still miss core parity fields.
SELECT id, model_name, capability, test_id,
         quantization, lmstudio_version, app_version,
         prompt_hash, params_hash
FROM benchmark_results
WHERE source = 'compatibility'
        AND (
            quantization IS NULL
            OR lmstudio_version IS NULL
            OR app_version IS NULL
            OR prompt_hash IS NULL
            OR params_hash IS NULL
        )
ORDER BY id DESC;

-- Compatibility summary directly from benchmark_results.
SELECT model_name,
             capability,
             COUNT(*) AS total_tests,
             SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) AS successful_tests,
             SUM(CASE WHEN success = 1 THEN 0 ELSE 1 END) AS failed_tests,
             AVG(avg_gen_time) AS avg_latency_ms,
             AVG(throughput_tokens_per_sec) AS avg_throughput,
             AVG(quality_score) AS avg_quality_score,
             AVG(rouge_score) AS avg_rouge,
             AVG(f1_score) AS avg_f1,
             AVG(exact_match_score) AS avg_exact_match,
             AVG(accuracy_score) AS avg_accuracy
FROM benchmark_results
WHERE source = 'compatibility'
GROUP BY model_name, capability
ORDER BY MAX(id) DESC;

Architecture Documentation

Comprehensive architecture documentation with Mermaid diagrams showing how the Python modules interact and how CLI arguments and configuration files are processed.


Table of Contents


System Architecture Overview

graph TB
    User([User]) --> RunPy[run.py<br/>Entry Point]

    RunPy -->|--webapp/-w flag| WebApp[web/app.py<br/>FastAPI Server]
    RunPy -->|benchmark mode| Benchmark[cli/benchmark.py<br/>Benchmark Engine]
    
    Benchmark --> ConfigLoader[core/config.py<br/>Configuration Manager]
    Benchmark --> PresetManager[core/presets.py<br/>Preset Manager]
    Benchmark --> RestClient[core/client.py<br/>REST API Client]
    
    ConfigLoader -->|reads| ProjectConfig[config/defaults.json<br/>Project Defaults]
    ConfigLoader -->|reads| UserConfig[~/.config/lm-studio-bench/defaults.json<br/>User Overrides]
    ConfigLoader -->|provides| DefaultConfig[(DEFAULT_CONFIG<br/>Merged)]
    
    Benchmark -->|uses| LMStudio[LM Studio Server<br/>localhost:1234/1235]
    RestClient -->|HTTP API v1| LMStudio
    
    Benchmark -->|writes| ResultsDB[(~/.local/share/lm-studio-bench/results/<br/>benchmark_cache.db)]
    Benchmark -->|exports| Reports[JSON/CSV/PDF/HTML<br/>Reports]
    
    WebApp -->|launches| Benchmark
    WebApp -->|reads| ResultsDB
    WebApp -->|serves| Dashboard[Web Dashboard<br/>http://localhost:PORT]
    RunPy -->|starts background process| Tray[core/tray.py<br/>Linux Tray Controller]
    Tray -->|polls /api/status| WebApp
    Tray -->|calls /api/benchmark/*| WebApp
    Tray -->|Quit calls /api/system/shutdown| WebApp
    
    style RunPy fill:#e1f5ff
    style Benchmark fill:#ffe1e1
    style ConfigLoader fill:#e1ffe1
    style RestClient fill:#fff4e1
    style DefaultsJSON fill:#f0f0f0
    style LMStudio fill:#e8deff

Key Components:

  • run.py: Wrapper script that decides between web dashboard and CLI benchmark mode
  • benchmark.py: Main benchmark engine with argparse, model discovery, and execution
  • config_loader.py: Loads and merges configuration from JSON file with built-in defaults
  • core/presets.py: Manages readonly/user presets and maps presets to CLI args
  • tools/hardware_monitor.py: Shared GPUMonitor and HardwareMonitor implementation for classic and capability flows
  • rest_client.py: REST API client for LM Studio v1 endpoints (optional mode)
  • web/app.py: FastAPI web dashboard with live streaming and results browser
  • tray.py: Linux AppIndicator tray controller for benchmark controls

Startup Flow

AppImage Entry Point

When the AppImage is executed, the bundled lmstudio-bench shell script runs before run.py and splits on whether real arguments are present:

flowchart TD
    AppImg([LM-Studio-Bench.AppImage args]) --> CheckArgs{Real args<br/>besides --debug/-d?}
    CheckArgs -->|No args| TrayOnly[exec tray.py --url http://localhost:1234<br/>stays in system tray]
    CheckArgs -->|Any other arg| RunPy[delegate to run.py + args]

    style AppImg fill:#d0e8ff
    style TrayOnly fill:#e1ffe1
    style RunPy fill:#ffe1ff

--debug / -d is exempt: ./AppImage --debug still enters tray-only mode with verbose logging.

run.py Flow

flowchart TD
    Start([./run.py args]) --> CheckHelp{--help or -h?}
    CheckHelp -->|Yes| ShowHelp[Show Extended Help<br/>+ benchmark.py --help]
    CheckHelp -->|No| CheckWebFlag{--webapp or -w<br/>in args?}

    CheckWebFlag -->|Yes| RemoveFlag[Remove --webapp/-w<br/>from args]
    RemoveFlag --> ResolvePort[Extract or assign<br/>web port]
    ResolvePort --> StartTrayWeb[start tray.py<br/>with --url dashboard]
    StartTrayWeb --> FindWebApp{web/app.py<br/>exists?}
    FindWebApp -->|Yes| StartWeb[subprocess.call<br/>python web/app.py + args]
    FindWebApp -->|No| ErrorWeb[Error: app.py not found]

    CheckWebFlag -->|No| StartTrayCLI[start tray.py<br/>with localhost:1234]
    StartTrayCLI --> FindBenchmark{cli/benchmark.py<br/>exists?}
    FindBenchmark -->|Yes| StartBenchmark[subprocess.call<br/>python cli/benchmark.py + args]
    FindBenchmark -->|No| ErrorBench[Error: benchmark.py not found]

    ShowHelp --> Exit1([exit 0])
    StartWeb --> Exit2([exit with app.py status])
    StartBenchmark --> Exit3([exit with benchmark.py status])
    ErrorWeb --> Exit4([exit 1])
    ErrorBench --> Exit5([exit 1])

    style Start fill:#e1f5ff
    style StartWeb fill:#ffe1ff
    style StartBenchmark fill:#ffe1e1

Decision Logic (run.py):

  1. Help Mode (--help/-h): Displays extended help combining run.py explanation + benchmark.py CLI options
  2. Web Mode (--webapp/-w): Launches tray + FastAPI dashboard on a free local port
  3. Benchmark Mode (default): Launches tray + benchmark.py with all CLI arguments

AppImage vs. run.py — default behaviour difference:

InvocationNo-argument default
./LM-Studio-Bench.AppImageTray-only (stays in panel, no benchmark)
./run.pyTray + benchmark.py (runs full benchmark)

Setup Flow (Installation & Configuration)

flowchart TD
    Start([./setup.sh args]) --> ParseArgs{Parse Arguments}
    
    ParseArgs -->|--help| ShowHelp["Show Usage Info<br/>+ Exit 0"]
    ParseArgs -->|--dry-run| DryMode["Set DRY_RUN=1<br/>Set INTERACTIVE=0"]
    ParseArgs -->|--yes| AutoMode["Set INTERACTIVE=0<br/>Auto-answer 'no'"]
    ParseArgs -->|--interactive| InterMode["Set INTERACTIVE=1<br/>Force Interactive"]
    
    DryMode --> LogSetup["Setup Logging<br/>logs/setup_YYYYMMDD_HHMMSS.log"]
    AutoMode --> LogSetup
    InterMode --> LogSetup
    
    LogSetup --> CheckLinux{OS = Linux?}
    CheckLinux -->|No| ErrorOS["❌ Error:<br/>Not Linux"]
    CheckLinux -->|Yes| DetectPKG["✅ Detect Package Manager<br/>apt/dnf/pacman/zypper/apk"]
    
    ErrorOS --> Exit1([Exit 1])
    
    DetectPKG --> CoreDeps["🔧 Check Core Dependencies<br/>Python3, Git, curl, pkg-config"]
    CoreDeps --> SysLibs["📦 Check System Libraries<br/>gobject-introspection, cairo, PyGObject"]
    
    SysLibs --> CheckLMS["🔍 Check LM Studio Stack<br/>lms CLI / llmster-headless"]
    CheckLMS -->|Found| LMSFound["✅ LM Studio/llmster<br/>detected"]
    CheckLMS -->|Not Found| LMSMissing["⚠️ LM Studio missing<br/>Offer download link"]
    
    LMSFound --> GPUDetect["🎮 Detect GPU<br/>lspci → NVIDIA/AMD/Intel"]
    LMSMissing --> GPUDetect
    
    GPUDetect --> GPUTools{GPU Found?}
    GPUTools -->|NVIDIA| NVIDIACheck["Check nvidia-smi<br/>+ Install if needed"]
    GPUTools -->|AMD| AMDCheck["Check rocm-smi<br/>+ AMD Driver Check"]
    GPUTools -->|Intel| IntelCheck["Check intel_gpu_top<br/>+ Install if needed"]
    GPUTools -->|None| NoGPU["⚠️ No GPU detected"]
    
    NVIDIACheck --> CreateVenv["🐍 Create Python venv<br/>python3 -m venv .venv"]
    AMDCheck --> AMDDriver["🔍 Check AMD Drivers<br/>amdgpu, libdrm, ROCm"]
    IntelCheck --> CreateVenv
    NoGPU --> CreateVenv
    AMDDriver --> CreateVenv
    
    CreateVenv -->|venv already exists| RecreatChoice{"Recreate .venv?"}
    CreateVenv -->|New venv| VenvOK["✅ venv created<br/>.venv/"]
    
    RecreatChoice -->|Yes| VenvOK
    RecreatChoice -->|No| UseExisting["Use existing .venv"]
    
    VenvOK --> InstallReqs["📥 Install Requirements<br/>pip install -r requirements.txt"]
    UseExisting --> InstallReqs
    
    InstallReqs --> CheckConflict["Check pip conflicts<br/>pip check"]
    CheckConflict --> Summary["📋 Print Summary<br/>Next steps (activation, run, etc)"]
    
    Summary --> LogExit["📄 Save log file<br/>logs/setup_latest.log → symlink"]
    LogExit --> Exit0([Exit 0])
    
    ShowHelp --> Exit0
    
    style Start fill:#e1f5ff
    style LogSetup fill:#fff4e1
    style DetectPKG fill:#e1ffe1
    style CoreDeps fill:#e1ffe1
    style CreateVenv fill:#ffe1e1
    style InstallReqs fill:#ffe1e1
    style Summary fill:#f0e1ff
    style ErrorOS fill:#ffcccc
    style LMSMissing fill:#fff9e1

Setup Flow Summary:

  1. Parse Arguments: Handle --help, --dry-run, --yes, --interactive flags

  2. Logging Setup: Create timestamped log file in logs/setup_YYYYMMDD_HHMMSS.log

  3. Environment Checks:

    • Verify Linux OS
    • Detect package manager (apt/dnf/pacman/zypper/apk)
    • Check core dependencies (Python 3, Git, curl, pkg-config)
    • Verify system libraries (gobject-introspection, cairo, PyGObject for tray support)
  4. LM Studio Stack:

    • Check for lms CLI or llmster headless binary
    • Offer download link if missing
  5. GPU & Monitoring Tools:

    • Detect GPU type via lspci (NVIDIA, AMD, Intel)
    • Install/check GPU-specific tools (nvidia-smi, rocm-smi, intel_gpu_top)
    • For AMD: Check drivers, ROCm, libdrm, X.Org AMDGPU driver
  6. Python Environment:

    • Create virtual environment (.venv/)
    • Install Python dependencies from requirements.txt
    • Check for pip conflicts
  7. Summary:

    • Print next steps for user:
      • Activate venv: source .venv/bin/activate
      • Run webapp: python run.py --webapp
      • Run CLI: python run.py
    • Log file symlink: logs/setup_latest.log

Modes:

ModeBehavior
--helpShow usage and exit
--dry-runPreview all actions (no changes)
--yesNon-interactive (auto-answer 'no' to optional prompts)
--interactiveForce interactive mode (default if TTY detected)

Tray Control Flow (Linux)

flowchart TD
    TrayStart([tray.py start]) --> Poll[Poll /api/status<br/>every 3 seconds]
    Poll --> Reachable{API reachable?}

    Reachable -->|No| IconRed[Set icon: red<br/>error/unreachable]
    Reachable -->|Yes| ReadStatus[Read status field]

    ReadStatus -->|idle| IconGray[Set icon: gray]
    ReadStatus -->|running| IconGreen[Set icon: green]
    ReadStatus -->|paused| IconYellow[Set icon: yellow]

    ReadStatus --> BtnLogic[Update Start/Pause/Stop states]
    BtnLogic --> UserAction{User action}

    UserAction -->|Start| StartCall[POST /api/benchmark/start]
    UserAction -->|Pause/Resume| PauseCall[POST /api/benchmark/pause or resume]
    UserAction -->|Stop| StopCall[POST /api/benchmark/stop]
    UserAction -->|Quit| QuitCall[POST /api/system/shutdown]

    QuitCall --> ExitTray[GTK main loop exit]

Tray behavior summary:

  • Dynamic status icons: gray (idle), green (running), yellow (paused), red (API error/unreachable)
  • Smart controls: Start enabled in idle/error, Pause and Stop enabled only in running or paused state
  • Quit path: Tray triggers graceful shutdown endpoint, then exits

Tray Quit Sequence (Linux)

sequenceDiagram
    participant U as User
    participant T as Tray (GTK/AppIndicator)
    participant A as web/app.py (FastAPI)
    participant B as Benchmark Manager
    participant P as Process Signal Handler

    U->>T: Click Quit
    T->>A: POST /api/system/shutdown
    A->>B: stop_benchmark()
    B-->>A: benchmark stopped or no-op
    A-->>T: 200 OK (shutdown accepted)
    A->>P: Start delayed SIGTERM thread
    T->>T: Stop polling + GTK main_quit()
    P->>A: Send SIGTERM to process
    A-->>A: Uvicorn graceful shutdown

Configuration Loading

flowchart TD
    Start([config_loader.py<br/>import]) --> BaseConfig[BASE_DEFAULT_CONFIG<br/>Hard-coded Defaults]

    BaseConfig --> LoadFunc[load_default_config]
    LoadFunc --> ReadProject[Read config/defaults.json<br/>Project Defaults]
    
    ReadProject --> CheckUser{~/.config/lm-studio-bench/<br/>defaults.json exists?}
    
    CheckUser -->|Yes| ReadUser[Read User Config<br/>Deep Merge]
    CheckUser -->|No| UseProject[Use Project Only]
    CheckFile -->|No| UseBase[Use BASE_DEFAULT_CONFIG]
    
    ReadJSON --> ParseJSON[Parse JSON]
    ParseJSON --> DeepMerge[_deep_merge<br/>Base + User Config]
    
    DeepMerge --> NormalizePorts[_normalize_ports<br/>Ensure valid LM Studio ports]
    UseBase --> NormalizePorts
    
    NormalizePorts --> FinalConfig[(DEFAULT_CONFIG<br/>Global Singleton)]
    
    FinalConfig --> BenchmarkImport[benchmark.py imports<br/>DEFAULT_CONFIG]
    FinalConfig --> WebAppImport[web/app.py imports<br/>DEFAULT_CONFIG]
    
    style BaseConfig fill:#f0f0f0
    style FinalConfig fill:#e1ffe1
    style DeepMerge fill:#fff4e1

Configuration Layers:

LayerSourcePriority
1. Hard-codedBASE_DEFAULT_CONFIG in config_loader.pyLowest
2. User Config~/.config/lm-studio-bench/defaults.jsonMedium
3. Project Configconfig/defaults.jsonLow
3. CLI Argumentsargparse in benchmark.pyHighest

Merge Strategy:

  • _deep_merge() recursively merges nested dictionaries
  • User config values override base config
  • None values in user config are skipped (base value retained)

Configuration Priority

flowchart LR
    CLI[CLI Arguments<br/>--runs 5<br/>--context 4096] -->|Highest Priority| Merge[Configuration<br/>Merge]

    UserCfg[~/.config/.../defaults.json<br/>context_length: 4096] -->|High Priority| Merge
    ProjCfg[config/defaults.json<br/>num_runs: 3<br/>context_length: 2048] -->|Medium Priority| Merge
    
    Base[BASE_DEFAULT_CONFIG<br/>prompt: default<br/>temperature: 0.1] -->|Lowest Priority| Merge
    
    Merge --> Final[Final Configuration<br/>runs=5<br/>context=4096<br/>temperature=0.1]
    
    style CLI fill:#ffe1e1
    style JSON fill:#fff4e1
    style Base fill:#f0f0f0
    style Final fill:#e1ffe1

Example Priority Resolution:

# BASE_DEFAULT_CONFIG
{
  "num_runs": 3,
  "context_length": 2048,
  "prompt": "Is the sky blue?"
}

# config/defaults.json
{
  "num_runs": 5,
  "prompt": "Explain machine learning"
}

# CLI: ./run.py --runs 1 --context 4096

# FINAL RESULT:
{
  "num_runs": 1,           # ← CLI override
  "context_length": 4096,  # ← CLI override
  "prompt": "Explain..."   # ← JSON override (no CLI arg)
}

Benchmark Execution Flow

flowchart TD
    Start([benchmark.py main]) --> ParseArgs[Parse CLI Arguments<br/>argparse.ArgumentParser]

    ParseArgs --> LoadConfig[Load DEFAULT_CONFIG<br/>from config_loader]
    
    LoadConfig --> CheckFlags{Special Flags?}
    
    CheckFlags -->|--list-cache| ListCache[Display Cache Entries<br/>exit]
    CheckFlags -->|--export-cache| ExportCache[Export Cache to JSON<br/>exit]
    CheckFlags -->|--export-only| ExportOnly[Generate Reports Only<br/>skip benchmark]
    CheckFlags -->|Normal Mode| CreateBenchmark[Create LMStudioBenchmark<br/>instance]
    
    CreateBenchmark --> MergeConfig[Merge Config Layers:<br/>CLI > JSON > Base]
    
    MergeConfig --> InitComponents[Initialize Components:<br/>• GPUMonitor<br/>• BenchmarkCache<br/>• HardwareMonitor<br/>• REST Client optional]
    
    InitComponents --> CheckServer{LM Studio<br/>Server Running?}
    
    CheckServer -->|No| StartServer[Auto-start Server<br/>lms server start]
    CheckServer -->|Yes| DiscoverModels[Discover Models<br/>lms ls --json]
    StartServer --> DiscoverModels
    
    DiscoverModels --> FilterModels[Apply Filters:<br/>--quants, --arch<br/>--only-vision, etc.]
    
    FilterModels --> CheckCache{use_cache<br/>enabled?}
    
    CheckCache -->|Yes| LoadCache[Load Cached Results<br/>SQLite lookup]
    CheckCache -->|No| SkipCache[Skip Cache]
    
    LoadCache --> RunBenchmarks[Run Benchmarks<br/>for Each Model]
    SkipCache --> RunBenchmarks
    
    RunBenchmarks --> TestModel[Test Model:<br/>1. Load Model<br/>2. Warmup Run<br/>3. N Measurement Runs<br/>4. Collect Stats]
    
    TestModel --> Profiling{Profiling<br/>enabled?}
    
    Profiling -->|Yes| MonitorHW[Monitor GPU/CPU/RAM<br/>Background Thread]
    Profiling -->|No| SkipMonitor[Skip Monitoring]
    
    MonitorHW --> SaveCache[Save Results to Cache<br/>SQLite INSERT]
    SkipMonitor --> SaveCache
    
    SaveCache --> NextModel{More Models?}
    
    NextModel -->|Yes| RunBenchmarks
    NextModel -->|No| Export[Export Reports:<br/>JSON, CSV, PDF, HTML]
    
    Export --> End([Done])
    
    ListCache --> End
    ExportCache --> End
    ExportOnly --> Export
    
    style Start fill:#e1f5ff
    style CreateBenchmark fill:#ffe1e1
    style RunBenchmarks fill:#ffe1ff
    style Export fill:#e1ffe1

Key Execution Steps:

  1. Argument Parsing: 49 CLI arguments processed by argparse
  2. Configuration Merge: CLI args override JSON file, JSON overrides base
  3. Component Initialization: GPU monitor, cache, profiler, REST client
  4. Model Discovery: lms ls --json fetches all installed models
  5. Filtering: Regex, quantization, architecture, capabilities filters
  6. Cache Lookup: Skip already-tested models (unless --retest)
  7. Benchmark Loop: For each model: load → warmup → measure (N runs) → unload
  8. Hardware Monitoring: Optional background thread for GPU/CPU/RAM stats
  9. Cache Storage: Save results to SQLite for future runs
  10. Report Generation: Export to JSON/CSV/PDF/HTML

REST API vs SDK Mode

flowchart TD
    Start([Benchmark Init]) --> CheckMode{use_rest_api?<br/>CLI or config}

    CheckMode -->|True| InitREST[Initialize REST Client<br/>LMStudioRESTClient]
    CheckMode -->|False| InitSDK[Use Python SDK<br/>lmstudio package]
    
    InitREST --> RESTURL[base_url from config:<br/>http://localhost:1234]
    RESTURL --> RESTToken{api_token<br/>set?}
    
    RESTToken -->|Yes| RESTAuth[Add Bearer Token<br/>to headers]
    RESTToken -->|No| RESTNoAuth[No Authentication]
    
    RESTAuth --> RESTReady[REST Client Ready]
    RESTNoAuth --> RESTReady
    
    RESTReady --> RESTFeatures[REST API Features:<br/>• Download Progress<br/>• MCP Integration<br/>• Stateful Chat<br/>• Response Caching<br/>• Parallel Inference<br/>• Unified KV Cache]
    
    InitSDK --> SDKReady[SDK Ready]
    SDKReady --> SDKFeatures[SDK Features:<br/>• Simple Python API<br/>• Model Loading<br/>• Inference<br/>• Basic Stats]
    
    RESTFeatures --> Benchmark[Run Benchmarks]
    SDKFeatures --> Benchmark
    
    Benchmark --> RESTCall{Mode?}
    
    RESTCall -->|REST| CallREST[HTTP POST /v1/chat/completions<br/>+ parse response stats]
    RESTCall -->|SDK| CallSDK[client.llm.predict<br/>+ parse Model response]
    
    CallREST --> Results[Collect Results:<br/>TTFT, tokens/s, VRAM]
    CallSDK --> Results
    
    style InitREST fill:#e1f5ff
    style InitSDK fill:#ffe1e1
    style RESTFeatures fill:#e1ffe1
    style SDKFeatures fill:#fff4e1

Mode Comparison:

FeatureREST API ModeSDK/CLI Mode
Configurationuse_rest_api: true in config or --use-rest-apiDefault mode
EndpointHTTP /v1/chat/completionsPython SDK client.llm.predict()
StatsDetailed (TTFT, prompt/completion tokens, tok/s)Basic (tokens/s only)
AuthenticationOptional Bearer tokenNot needed
Parallel Inference--n-parallel (continuous batching)❌ Sequential only
Stateful Chats✅ response_id tracking❌ Stateless
MCP Integrationmcp_integrations parameter❌ Not available
Response Caching✅ MD5 hash caching (10,000x speedup)❌ No caching
Download Progress✅ Real-time model loading status❌ No progress

Configuration Example:

{
  "lmstudio": {
    "host": "localhost",
    "ports": [1234, 1235],
    "use_rest_api": true,
    "api_token": "lms_your_token_here"
  }
}

Component Details

1. run.py (Entry Point)

Responsibilities:

  • Parse --webapp/-w flag
  • Route to web dashboard or benchmark
  • Show extended help (--help)

Key Functions:

  • Flag detection: "--webapp" in sys.argv or "-w" in sys.argv
  • Subprocess launching: subprocess.call([sys.executable, script] + args)

2. config_loader.py (Configuration Manager)

Responsibilities:

  • Load config/defaults.json (project) + ~/.config/lm-studio-bench/defaults.json (user overrides)
  • Merge with BASE_DEFAULT_CONFIG
  • Provide DEFAULT_CONFIG singleton

Key Functions:

  • load_default_config(): Loads and merges config
  • _deep_merge(): Recursive dict merge
  • _normalize_ports(): Validates LM Studio ports

Configuration Fields:

SectionFields
Rootprompt, context_length, num_runs
lmstudiohost, ports, api_token, use_rest_api
inferencetemperature, top_k_sampling, top_p_sampling, min_p_sampling, repeat_penalty, max_tokens
loadn_gpu_layers, n_batch, n_threads, flash_attention, rope_freq_base, rope_freq_scale, use_mmap, use_mlock, kv_cache_quant

3. benchmark.py (Main Engine)

Responsibilities:

  • Parse 49 CLI arguments
  • Manage benchmark lifecycle
  • Model discovery and filtering
  • Cache management (SQLite)
  • Runtime-safe cache schema migration for optional columns
  • Hardware monitoring
  • Report generation

Key Classes:

  • LMStudioBenchmark: Main orchestrator
  • BenchmarkCache: SQLite caching
  • tools/hardware_monitor.py: Shared GPU detection and live profiling (GPUMonitor, HardwareMonitor)
  • ModelDiscovery: Model listing and metadata

Reliability Behaviors (2026-03):

  • Runtime cache migration: Missing optional SQLite columns are added automatically at startup and, if needed, once again during insert error recovery.
  • Inference retry guard: If LM Studio returns a server error containing Model unloaded, the benchmark reloads the model and retries inference once.

CLI Arguments (49 total):

CategoryArguments
Basic--runs, --context, --prompt, --limit, --dev-mode
Presets--list-presets, --preset
Filter--only-vision, --only-tools, --quants, --arch, --params, --min-context, --max-size, --include-models, --exclude-models
Cache--retest, --list-cache, --export-cache, --export-only
Profiling--enable-profiling, --max-temp, --max-power, --disable-gtt
Inference--temperature, --top-k, --top-p, --min-p, --repeat-penalty, --max-tokens
Load Config--n-gpu-layers, --n-batch, --n-threads, --flash-attention, --rope-freq-base, --rope-freq-scale, --use-mmap, --use-mlock, --kv-cache-quant
REST API--use-rest-api, --api-token, --n-parallel, --unified-kv-cache
Comparison--compare-with, --rank-by

4. rest_client.py (REST API Client)

Responsibilities:

  • HTTP communication with LM Studio v1 API
  • Model loading and unloading
  • Chat completions with stats
  • Download progress tracking
  • MCP integration
  • Stateful chat history
  • Response caching

Key Classes:

  • LMStudioRESTClient: Main REST client
  • ModelInfo: Model metadata
  • ChatStats: Response statistics (TTFT, tokens/s, etc.)
  • ModelCapabilities: Vision, tools detection

New Features (✨ 2026-02-23):

  1. Download Progress Tracking

    • wait_for_completion() with progress callbacks
    • Real-time model loading status
  2. MCP Integration

    • mcp_integrations parameter in chat requests
    • Model Context Protocol support
  3. Stateful Chat History

    • use_stateful=True for conversation continuity
    • last_response_id tracking
  4. Response Caching

    • MD5 hash-based caching
    • 10,000x+ speedup for repeated prompts
    • enable_cache parameter

Example Usage:

client = LMStudioRESTClient(
    base_url="http://localhost:1234",
    api_token="lms_token"
)

# Load model with progress tracking
def on_progress(percent, status):
    print(f"Loading: {percent:.1f}% - {status}")

client.load_model("model@q4", wait_for_completion=True, progress_callback=on_progress)

# Chat with caching
response = client.chat(
    model="model@q4",
    messages=[{"role": "user", "content": "Hello"}],
    enable_cache=True,  # 10,000x speedup for repeated prompts
    use_stateful=True   # Conversation continuity
)

5. tray.py (Linux Tray Controller)

Responsibilities:

  • Provide Linux AppIndicator tray UI with benchmark controls
  • Poll benchmark status and update icon/button state
  • Trigger benchmark actions via web API
  • Trigger graceful full shutdown via /api/system/shutdown

Key Behaviors:

  • 3-second polling loop via GLib timeout
  • Icon states: gray (idle), green (running), yellow (paused), red (error)
  • Control state logic:
    • Start enabled in idle and recovery/error state
    • Pause/Stop enabled only while benchmark is active

6. web/app.py + dashboard.html.jinja (Dashboard Analytics)

Responsibilities:

  • Aggregate benchmark history for fast visual summaries
  • Serve chart-ready payloads via /api/dashboard/stats
  • Render Home/Results overview charts in the browser with Plotly
  • Support quick navigation from ranking tables to model comparison

Home View (Executive Summary):

  • KPI cards: cached models, avg speed, median (P50), P95, architectures, quantizations
  • Top 10 bar chart (speed ranking)
  • Quantization donut chart (distribution)

Results View (Exploration):

  • Scatter: Speed vs VRAM
  • Heatmap: Model x Quantization -> avg tokens/s
  • Shared data source with table (/api/results), so table and charts stay consistent

Quick Compare Flow:

  • Compare actions in Home and Results tables call openComparisonForModel(modelName)
  • Function opens Comparison view, selects the model, then loads full historical trends via /api/comparison/{model_name}

Dashboard Summary Fields (/api/dashboard/stats):

  • speed_summary (min, p50, avg, p95, max)
  • top_models_extended (Top 10 models)
  • quantization_distribution
  • architecture_distribution
  • efficiency_top

Data Flow Summary

graph LR
    User([User]) -->|./run.py --runs 5| CLI[CLI Arguments]

    ProjJSON[config/defaults.json] --> Config[Configuration<br/>Merge]
    UserJSON[~/.config/.../defaults.json] --> Config
    CLI --> Config
    Base[BASE_DEFAULT_CONFIG] --> Config
    
    Config --> Benchmark[Benchmark<br/>Execution]
    
    Benchmark -->|lms ls| Models[Model<br/>Discovery]
    Models --> Filter[Model<br/>Filtering]
    
    Filter --> Cache{Cache<br/>Hit?}
    Cache -->|Yes| Skip[Skip Test]
    Cache -->|No| Test[Run Test]
    
    Test --> LMStudio[LM Studio<br/>Server]
    LMStudio --> Results[Collect<br/>Results]
    
    Results --> DB[(SQLite<br/>Cache)]
    Results --> Reports[JSON/CSV<br/>PDF/HTML]
    
    Skip --> Reports
    
    style CLI fill:#ffe1e1
    style Config fill:#e1ffe1
    style Cache fill:#fff4e1
    style Reports fill:#e1f5ff

Testing Architecture

LM-Studio-Bench includes a comprehensive test suite with 900+ tests and strong coverage to ensure reliability and maintainability.

Test Organization

graph TB
    Tests[tests/] --> Fixtures[conftest.py<br/>Test Fixtures & Utilities]

    Tests --> BenchmarkTests[test_benchmark.py<br/>55+ tests]
    Tests --> HardwareTests[test_hardware_monitor.py<br/>57+ tests]
    Tests --> AppTests[test_app.py<br/>23+ tests]
    Tests --> APITests[test_api_endpoints.py<br/>32+ tests]
    Tests --> RestTests[test_rest_client.py<br/>22+ tests]
    Tests --> TrayTests[test_tray.py<br/>26+ tests]
    Tests --> PresetTests[test_preset_manager.py<br/>19+ tests]
    Tests --> ConfigTests[test_config_loader.py<br/>9+ tests]
    Tests --> PathTests[test_user_paths.py<br/>4+ tests]
    Tests --> VersionTests[test_version_checker.py<br/>7+ tests]
    Tests --> MetadataTests[test_scrape_metadata.py<br/>24+ tests]
    Tests --> RunTests[test_run.py<br/>10+ tests]

    BenchmarkTests --> Benchmark[cli/benchmark.py]
    HardwareTests --> HardwareMon[tools/hardware_monitor.py]
    AppTests --> WebApp[web/app.py]
    APITests --> WebApp
    RestTests --> RestClient[core/client.py]
    TrayTests --> Tray[core/tray.py]
    PresetTests --> PresetMgr[core/presets.py]
    ConfigTests --> ConfigLoader[core/config.py]
    PathTests --> UserPaths[core/paths.py]
    VersionTests --> VersionChecker[core/version.py]
    MetadataTests --> Metadata[tools/scrape_metadata.py]
    RunTests --> RunPy[run.py]

    style Tests fill:#e1f5ff
    style Fixtures fill:#fff4e1
    style BenchmarkTests fill:#ffe1e1
    style AppTests fill:#e1ffe1

Test Coverage by Component

ComponentTest ModuleTest CountCoverage
Benchmark Enginetest_benchmark.py55+High
Web Dashboardtest_app.py23+Medium
API Endpointstest_api_endpoints.py32+High
REST Clienttest_rest_client.py22+High
Linux Traytest_tray.py26+Medium
Preset Managertest_preset_manager.py19+High
Config Loadertest_config_loader.py9+High
User Pathstest_user_paths.py4+High
Version Checkertest_version_checker.py7+High
Metadata Scrapingtest_scrape_metadata.py24+Medium
Entry Pointtest_run.py10+Medium

Testing Approach

Unit Testing:

  • Mock external dependencies (LM Studio API, system commands, file I/O)
  • Isolated test cases that can run in any order
  • Fast execution (no real API calls or file system operations)
  • Use pytest fixtures for common setup and teardown

Test Fixtures (conftest.py):

  • Mock LM Studio client and server responses
  • Temporary directories for file operations
  • Mock system commands (nvidia-smi, rocm-smi, etc.)
  • Sample configuration and model data

Continuous Integration:

  • GitHub Actions runs full test suite on every PR
  • Code quality checks (flake8, pylint)
  • Security scans (Bandit, CodeQL, Snyk)
  • Test results reported in PR status checks

Running Tests:

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run specific module
pytest tests/test_benchmark.py

# Run with coverage report
pytest --cov=core --cov=cli --cov=agents --cov=web --cov=tools --cov=run --cov-report=html

# Run tests matching a pattern
pytest -k "test_gpu_detection"

See Also