SQLite Metric Parity Map

This table is intentionally compact: one metric per row.

Legend:

  • [x] = metric is stored in both test modes
  • [ ] = metric is missing in at least one mode

Notes:

  • Capability rows normalize quantization to an uppercase label such as Q4_K_M; classic rows keep the classic benchmark format such as q4_k_m.

  • Capability lmstudio_version stores a parsed version or pkg_version (commit:<sha>), not the raw lms version banner output.

  • Capability REST runs forward the exact model variant key, including the @quantization suffix, to LM Studio load/chat/unload requests.

  • Classic rows intentionally leave capability-only fields such as quality_score, raw_output, reference_output, capability, and test_id empty.

  • Historical rows created before recent schema/runtime fixes may still contain NULL values in parity columns. New rows should populate them.

Metricbenchmark_results (classic)benchmark_results (compatibility)Stored in both tests
Row ididid[x]
Model namemodel_namemodel_name[x]
Timestamptimestamptimestamp[x]
Model path/sourcemodel_keymodel_key[x]
Capability labelcapabilitycapability[x]
Test case idtest_idtest_id[x]
Test case nametest_nametest_name[x]
Quantizationquantizationquantization[x]
Inference params hashinference_params_hashinference_params_hash[x]
Tokens per secondavg_tokens_per_secavg_tokens_per_sec[x]
Latencyavg_gen_timeavg_gen_time[x]
TTFTavg_ttftavg_ttft[x]
Prompt token countprompt_tokensprompt_tokens[x]
Completion/generated tokenscompletion_tokenstokens_generated[x]
Primary quality scorequality_scorequality_score[x]
ROUGErouge_scorerouge_score[x]
F1f1_scoref1_score[x]
Exact matchexact_match_scoreexact_match_score[x]
Accuracyaccuracy_scoreaccuracy_score[x]
Function-call accuracyfunction_call_accuracyfunction_call_accuracy[x]
Success flagsuccesssuccess[x]
Error messageerror_messageerror_message[x]
Error countererror_counterror_count[x]
Total tests per capability-aggregate COUNT(*) by capability[ ]
Successful tests per capability-aggregate SUM(success = 1)[ ]
Failed tests per capability-aggregate SUM(success != 1)[ ]
Success rate per capability-derived aggregate (successful / total)[ ]
GPU typegpu_typegpu_type[x]
GPU offload ratiogpu_offloadgpu_offload[x]
VRAM (MB)vram_mbvram_mb[x]
Temperature statstemp_celsius_min/max/avgtemp_celsius_min/max/avg[x]
Power statspower_watts_min/max/avgpower_watts_min/max/avg[x]
VRAM GB statsvram_gb_min/max/avgvram_gb_min/max/avg[x]
GTT GB statsgtt_gb_min/max/avggtt_gb_min/max/avg[x]
CPU usage statscpu_percent_min/max/avgcpu_percent_min/max/avg[x]
RAM GB statsram_gb_min/max/avgram_gb_min/max/avg[x]
Context lengthcontext_lengthcontext_length[x]
Temperature sampling paramtemperaturetemperature[x]
Top-K sampling paramtop_k_samplingtop_k_sampling[x]
Top-P sampling paramtop_p_samplingtop_p_sampling[x]
Min-P sampling parammin_p_samplingmin_p_sampling[x]
Repeat penaltyrepeat_penaltyrepeat_penalty[x]
Max tokens parammax_tokensmax_tokens[x]
GPU layer settingn_gpu_layersn_gpu_layers[x]
Batch settingn_batchn_batch[x]
Thread settingn_threadsn_threads[x]
Flash attention settingflash_attentionflash_attention[x]
RoPE base settingrope_freq_baserope_freq_base[x]
RoPE scale settingrope_freq_scalerope_freq_scale[x]
mmap settinguse_mmapuse_mmap[x]
mlock settinguse_mlockuse_mlock[x]
KV cache quant settingkv_cache_quantkv_cache_quant[x]
LM Studio versionlmstudio_versionlmstudio_version[x]
App versionapp_versionapp_version[x]
Driver versionsnvidia/rocm/intel_driver_versionnvidia/rocm/intel_driver_version[x]
OS infoos_name, os_versionos_name, os_version[x]
CPU modelcpu_modelcpu_model[x]
Python versionpython_versionpython_version[x]
Benchmark durationbenchmark_duration_secondsbenchmark_duration_seconds[x]
Raw model outputraw_outputraw_output[x]
Reference outputreference_outputreference_output[x]
Efficiency per GBtokens_per_sec_per_gbtokens_per_sec_per_gb[x]
Efficiency per B paramstokens_per_sec_per_billion_paramstokens_per_sec_per_billion_params[x]
Speed delta vs previousspeed_delta_pctspeed_delta_pct[x]
Previous timestamp linkprev_timestampprev_timestamp[x]
Prompt hashprompt_hashprompt_hash[x]
Full params hashparams_hashparams_hash[x]
Prompt textpromptprompt[x]

Historical Validation Queries

Use these queries to find older rows that predate parity fixes.

-- Classic rows that still miss parity fields introduced later.
SELECT id, model_name, timestamp,
         quantization, lmstudio_version, app_version, success
FROM benchmark_results
WHERE quantization IS NULL
    OR lmstudio_version IS NULL
    OR app_version IS NULL
    OR success IS NULL
ORDER BY id DESC;

-- Compatibility rows that still miss core parity fields.
SELECT id, model_name, capability, test_id,
         quantization, lmstudio_version, app_version,
         prompt_hash, params_hash
FROM benchmark_results
WHERE source = 'compatibility'
        AND (
            quantization IS NULL
            OR lmstudio_version IS NULL
            OR app_version IS NULL
            OR prompt_hash IS NULL
            OR params_hash IS NULL
        )
ORDER BY id DESC;

-- Compatibility summary directly from benchmark_results.
SELECT model_name,
             capability,
             COUNT(*) AS total_tests,
             SUM(CASE WHEN success = 1 THEN 1 ELSE 0 END) AS successful_tests,
             SUM(CASE WHEN success = 1 THEN 0 ELSE 1 END) AS failed_tests,
             AVG(avg_gen_time) AS avg_latency_ms,
             AVG(throughput_tokens_per_sec) AS avg_throughput,
             AVG(quality_score) AS avg_quality_score,
             AVG(rouge_score) AS avg_rouge,
             AVG(f1_score) AS avg_f1,
             AVG(exact_match_score) AS avg_exact_match,
             AVG(accuracy_score) AS avg_accuracy
FROM benchmark_results
WHERE source = 'compatibility'
GROUP BY model_name, capability
ORDER BY MAX(id) DESC;