Capability-Driven Benchmark Agent for LM Studio Bench
This benchmark agent implements capability-driven evaluation for language models and multimodal models. It detects model capabilities, runs targeted tests, computes quality metrics, and generates comprehensive reports.
Features
- Automatic capability detection (general text, reasoning, vision, tooling)
- Per-capability test suites with standardized prompts
- Quality metrics: ROUGE, F1, Exact Match, Accuracy, Function Call Accuracy
- Performance metrics: tokens/sec, latency
- Machine-readable JSON and human-friendly HTML reports
- CLI interface with extensive configuration options
- Docker support for containerized execution
- GitHub Actions integration for CI/CD benchmarking
Quick Start
Local Execution
Run a benchmark on a model:
python -m cli.main "path/to/model" --output-dir output
Run across installed models:
python -m cli.main --all-models --output-dir output
python -m cli.main --random-models 5 --output-dir output
With specific capabilities:
python -m cli.main "model-id" \
--capabilities general_text,reasoning \
--output-dir results
Using Docker
Build the Docker image:
docker build -f scripts/Dockerfile.bench -t lm-bench-agent .
Run benchmark in container:
docker run -v $(pwd)/output:/app/output \
lm-bench-agent "model-path" \
--output-dir /app/output
Capabilities
The agent supports four primary capabilities:
1. General Text
Tests basic language understanding and generation:
- Question answering
- Summarization
- Classification
Metrics: ROUGE-1, ROUGE-L, F1
2. Reasoning
Tests logical and mathematical reasoning:
- Logical reasoning (syllogisms)
- Math problem solving
- Chain-of-thought reasoning
Metrics: Exact Match, F1, Accuracy
3. Vision
Tests multimodal understanding (requires vision models):
- Image captioning
- Visual Question Answering (VQA)
- OCR and visual reasoning
Metrics: Accuracy, ROUGE-L
4. Tooling
Tests function calling and tool use:
- Function selection
- Parameter extraction
- API interaction patterns
Metrics: Function Call Accuracy, Parameter Accuracy
CLI Reference
Basic Usage
python -m cli.main MODEL_PATH [OPTIONS]
Arguments
MODEL_PATH: Path to model or model identifier (required)
Options
Model Configuration
--model-name NAME: Override model name (default: derived from path)--all-models: Run the capability benchmark for all installed models--random-models N: Run the capability benchmark forNrandom installed models--capabilities CAPS: Comma-separated capabilities to test- Options:
general_text,reasoning,vision,tooling - Default: Auto-detect from model metadata
- Options:
Output Configuration
--output-dir DIR: Output directory (default:output)--formats FMTS: Output formats:json,html(default: both)
Test Configuration
--max-tests N: Maximum tests per capability (default: 10)--config FILE: Path to YAML configuration file
Model Parameters
--context-length N: Model context length (default: 2048)--gpu-offload RATIO: GPU offload ratio 0.0-1.0 (default: 1.0)--temperature T: Generation temperature (default: 0.1)
Other
--verbose,-v: Enable verbose logging
Examples
Benchmark with custom configuration:
python -m cli.main "mymodel" \
--config custom_config.yaml \
--max-tests 20 \
--verbose
Test only reasoning capability:
python -m cli.main "reasoning-model" \
--capabilities reasoning \
--temperature 0.0 \
--max-tests 50
Generate only JSON output:
python -m cli.main "model" \
--formats json \
--output-dir json_results
Run against random installed models:
python -m cli.main --random-models 3 --capabilities general_text,reasoning
Runtime Behavior
- When running across multiple installed models, a single model failure is logged and skipped so the benchmark can continue.
- For embedding models loaded through the LM Studio REST API, the loader
automatically retries without
offload_kv_cache_to_gpuif LM Studio rejects that option. - Log output includes automatic level icons such as
ℹ️,⚠️, and❌in addition to benchmark-specific emoji markers.
Configuration File
The agent reads configuration from config/bench.yaml by default. Override with --config flag.
Configuration Schema
context_length: 2048
gpu_offload: 1.0
temperature: 0.1
max_tokens: 256
max_tests_per_capability: 10
use_rest_api: true
data_dir: tests/data
prompts_dir: tests/prompts
timeout_seconds: 300
metric_weights:
general_text:
rouge-1: 0.3
rouge-l: 0.4
f1: 0.3
reasoning:
exact_match: 0.5
f1: 0.3
accuracy: 0.2
vision:
accuracy: 0.6
rouge-l: 0.4
tooling:
function_call_accuracy: 0.7
accuracy: 0.3
composite_score_weights:
quality: 0.6
performance: 0.2
efficiency: 0.2
lmstudio:
host: localhost
ports:
- 1234
- 1235
api_token: null
Key Configuration Options
context_length: Maximum context length for modelgpu_offload: GPU memory allocation (0.0 = CPU only, 1.0 = full GPU)max_tests_per_capability: Limit tests to prevent long runsmetric_weights: Per-capability metric importancecomposite_score_weights: Overall score composition
Output Format
JSON Report
The JSON report follows this schema:
{
"schema_version": "1.0",
"generated_at": "2025-01-15T10:30:00",
"report": {
"model_name": "model-name",
"model_path": "path/to/model",
"capabilities": ["general_text", "reasoning"],
"timestamp": "2025-01-15T10:30:00",
"summary": {
"total_tests": 20,
"successful_tests": 19,
"success_rate": 0.95,
"avg_latency_ms": 245.6,
"avg_quality_score": 0.823,
"avg_throughput_tokens_per_sec": 42.3,
"by_capability": {
"general_text": {
"test_count": 10,
"avg_quality_score": 0.856,
"success_rate": 1.0
}
}
},
"results": [
{
"test_id": "qa_001",
"capability": "general_text",
"latency_ms": 230.5,
"tokens_generated": 12,
"throughput": 52.1,
"quality_score": 0.89,
"metrics": [
{
"name": "rouge-1",
"value": 0.85,
"normalized": 0.85
}
],
"error": null
}
],
"config": {},
"raw_outputs_dir": "output/raw"
}
}
HTML Report
The HTML report provides:
- Summary statistics with visual indicators
- Per-test results table with status, latency, and quality scores
- Capability breakdown with aggregated metrics
- Color-coded quality scores (green/yellow/red)
Raw Outputs
Individual test outputs are saved in output/raw/:
{
"test_id": "qa_001",
"capability": "general_text",
"prompt": "What is the capital of France?",
"response": "Paris",
"latency_ms": 230.5,
"tokens_generated": 12,
"throughput": 52.1,
"timestamp": 1642244400.123,
"error": null
}
GitHub Actions Integration
The workflow .github/workflows/bench.yml enables CI benchmarking.
Triggering the Workflow
Manual Trigger
- Go to Actions tab in GitHub
- Select "Capability-Driven Benchmark"
- Click "Run workflow"
- Enter model path and capabilities
- Click "Run workflow"
Scheduled Trigger
Runs automatically every Sunday at midnight (UTC).
Push Trigger
Runs on push to main or dev branches.
Note: the benchmark step currently reads the model path only from
manual workflow_dispatch inputs. Push- and schedule-triggered
runs therefore skip the actual benchmark unless you adapt the
workflow to read the model path from another configuration source
(for example, a repository variable or secret).
Workflow Outputs
The workflow uploads three artifacts:
- benchmark-results-json: JSON reports (30-day retention)
- benchmark-results-html: HTML reports (30-day retention)
- benchmark-raw-outputs: Raw test outputs (7-day retention)
For pull requests, a summary comment is posted with key metrics.
Adding Test Data
General Text Tests
Add test cases to tests/data/text/qa_samples.json:
{
"id": "qa_004",
"prompt": "Your question here",
"reference": "Expected answer",
"category": "domain"
}
Reasoning Tests
Add to tests/data/text/reasoning_samples.json:
{
"id": "reasoning_004",
"prompt": "Problem statement",
"reference": "Answer",
"reasoning": "Explanation of solution",
"category": "math"
}
Vision Tests
Place images in tests/data/images/ and reference them in test cases.
Tooling Tests
Add to tests/data/text/tooling_samples.json:
{
"id": "tool_004",
"task": "Task description",
"expected_function": "function_name",
"expected_parameters": {"param": "value"},
"category": "function_calling"
}
Customizing Prompts
Prompt templates are in tests/prompts/:
general_text_qa.md: Question answeringgeneral_text_summarization.md: Summarizationreasoning_logical.md: Logical reasoningreasoning_math.md: Math problemsvision_caption.md: Image captioningvision_vqa.md: Visual QAtooling_function_call.md: Function calling
Edit templates to adjust instruction format or add few-shot examples.
Troubleshooting
Model Loading Fails
Ensure LM Studio is running and the model is available:
lms status
lms models list
No Tests Execute
Check that test data files exist:
ls tests/data/text/
Verify capabilities are correctly specified:
python -m cli.main "model" --capabilities general_text --verbose
Metrics Are Zero
This usually means:
- Model output format doesn't match expected format
- Reference answers need normalization
- Wrong capability assigned to test
Check raw outputs in output/raw/ to inspect actual responses.
Timeout Errors
Increase timeout in config:
timeout_seconds: 600
Or reduce test count:
python -m cli.main "model" --max-tests 5
API Integration
Using as a Library
from pathlib import Path
from agents.runner import BenchmarkRunner
from cli.reporting import generate_reports
config = {
"context_length": 2048,
"max_tests_per_capability": 5,
"use_rest_api": True
}
runner = BenchmarkRunner(
config=config,
output_dir=Path("output")
)
report = runner.run(
model_path="mymodel",
model_name="MyModel",
capabilities=["general_text"]
)
outputs = generate_reports(
report_data=report,
output_dir=Path("output"),
formats=["json", "html"]
)
print(f"JSON: {outputs['json']}")
print(f"HTML: {outputs['html']}")
Custom Model Adapter
Implement ModelAdapter interface:
from agents.benchmark import ModelAdapter, InferenceResult
class CustomAdapter(ModelAdapter):
def load(self, model_path, **kwargs):
pass
def unload(self):
pass
def infer(self, prompt, image_path=None, **kwargs):
return InferenceResult(...)
def is_loaded(self):
return True
Use with runner:
adapter = CustomAdapter()
report = runner.run(
model_path="model",
adapter=adapter
)
Architecture
Components
agents/capabilities.py: Capability detection logicagents/benchmark.py: Core benchmark agent and model adaptersagents/runner.py: Test orchestration and loadingcli/metrics.py: Metric implementationscli/reporting.py: Report generation (JSON, HTML)cli/main.py: Command-line interfaceconfig/bench.yaml: Default configurationtests/data/: Test datasetstests/prompts/: Prompt templates
Data Flow
- CLI parses arguments and loads configuration
- Runner detects capabilities from model metadata or flags
- Test loader creates test cases for detected capabilities
- Model adapter loads the model
- Agent runs each test case:
- Executes inference
- Saves raw output
- Computes metrics
- Reporter generates JSON and HTML from results
- Outputs are saved to disk
License
This benchmark agent is part of LM-Studio-Bench and follows the same license.
Contributing
Contributions are welcome:
- Add new capabilities
- Implement new metrics
- Expand test datasets
- Improve prompt templates
- Enhance reporting formats
Follow the coding standards in .github/instructions/code-standards.instructions.md.