Example 11: Benchmarking with IdentiBench¶

IdentiBench provides standardized benchmarks for comparing system identification methods. This example shows how to run your TSFast models on IdentiBench benchmarks for fair, reproducible comparison with other methods.

Prerequisites¶

Setup¶

In [1]:

Copied!





import identibench as idb

from tsfast.tsdata.benchmark import create_dls_from_spec
from tsfast.models.rnn import RNNLearner
from tsfast.inference import InferenceWrapper
from tsfast.training import fun_rmse
import identibench as idb

from tsfast.tsdata.benchmark import create_dls_from_spec
from tsfast.models.rnn import RNNLearner
from tsfast.inference import InferenceWrapper
from tsfast.training import fun_rmse

What is IdentiBench?¶

IdentiBench is a benchmarking framework that provides standardized datasets, evaluation protocols, and metrics for system identification. Each benchmark defines:

A dataset with specified train/validation/test splits
Input and output column names (e.g., voltage in, displacement out)
Evaluation metrics (typically NRMSE -- normalized root mean square error)
A standard API that all methods must follow, ensuring fair comparison

The workshop_benchmarks dictionary contains the benchmarks used in the IdentiBench workshop -- a curated set covering different system types and difficulties.

The Build Model Function¶

IdentiBench requires a build_model function that takes a TrainingContext and returns a callable model for evaluation. The context provides:

context.spec -- the benchmark specification (dataset path, column names, window sizes, metric function)
context.hyperparameters -- your model's hyperparameters, passed through from the benchmark runner

The returned model must accept numpy arrays: model(u_test, y_init) for simulation benchmarks, where u_test is the full input signal and y_init is the initial output window.

In [2]:

Copied!





def build_model(context: idb.TrainingContext):
    """Build and train a TSFast model for an IdentiBench benchmark."""
    dls = create_dls_from_spec(context.spec)

    lrn = RNNLearner(
        dls,
        rnn_type=context.hyperparameters.get('model_type', 'lstm'),
        num_layers=context.hyperparameters.get('num_layers', 1),
        hidden_size=context.hyperparameters.get('hidden_size', 40),
        n_skip=context.spec.init_window,
        metrics=[fun_rmse],
    )

    lrn.fit_flat_cos(n_epoch=10, lr=3e-3)
    return InferenceWrapper(lrn)
def build_model(context: idb.TrainingContext):
    """Build and train a TSFast model for an IdentiBench benchmark."""
    dls = create_dls_from_spec(context.spec)

    lrn = RNNLearner(
        dls,
        rnn_type=context.hyperparameters.get('model_type', 'lstm'),
        num_layers=context.hyperparameters.get('num_layers', 1),
        hidden_size=context.hyperparameters.get('hidden_size', 40),
        n_skip=context.spec.init_window,
        metrics=[fun_rmse],
    )

    lrn.fit_flat_cos(n_epoch=10, lr=3e-3)
    return InferenceWrapper(lrn)

Key details:

create_dls_from_spec automatically extracts column names, window sizes, and prediction settings from the benchmark spec. It also applies benchmark-specific DataLoader defaults (e.g., batch size, step size) from TSFast's BENCHMARK_DL_KWARGS table.
n_skip=context.spec.init_window uses the benchmark-defined initialization window to skip the initial transient in the loss. This matches IdentiBench's evaluation protocol, which discards the first init_window timesteps.
InferenceWrapper wraps the trained learner into a numpy-in, numpy-out callable that IdentiBench's evaluation harness can call directly.

Configure and Run Benchmarks¶

We define a hyperparameter dictionary and pass it along with the benchmarks to idb.run_benchmarks. The runner:

Downloads each dataset (on first use)
Calls build_model with the spec and hyperparameters
Evaluates the returned model on the held-out test set
Collects metrics into a pandas DataFrame

In [3]:

Copied!





model_config = {
    'model_type': 'lstm',
    'num_layers': 1,
    'hidden_size': 40,
}

benchmarks = list(idb.workshop_benchmarks.values())
results = idb.run_benchmarks(benchmarks, build_model, model_config)
model_config = {
    'model_type': 'lstm',
    'num_layers': 1,
    'hidden_size': 40,
}

benchmarks = list(idb.workshop_benchmarks.values())
results = idb.run_benchmarks(benchmarks, build_model, model_config)

--- Starting benchmark run for 4 specifications, repeating each 1 times ---

-- Repetition 1/1 --

[1/4] Running: BenchmarkWH_Simulation (Rep 1)

epoch	train_loss	valid_loss	fun_rmse	time
0	0.013542	0.010710	0.014256	00:02
1	0.008203	0.007862	0.010135	00:02
2	0.007334	0.005773	0.007681	00:02
3	0.007540	0.007693	0.009679	00:02
4	0.005602	0.003809	0.005187	00:02
5	0.006091	0.006287	0.007872	00:02
6	0.006572	0.008107	0.010937	00:02
7	0.005290	0.005069	0.006561	00:02
8	0.002490	0.002136	0.003210	00:02
9	0.001850	0.001917	0.002910	00:02

  -> ERROR running benchmark 'BenchmarkWH_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[2/4] Running: BenchmarkSilverbox_Simulation (Rep 1)

epoch	train_loss	valid_loss	fun_rmse	time
0	0.005729	0.003897	0.005963	00:02
1	0.003457	0.003796	0.005148	00:02
2	0.003146	0.003109	0.004407	00:02
3	0.002784	0.003424	0.004669	00:02
4	0.002708	0.002582	0.003909	00:02
5	0.002986	0.002651	0.003982	00:02
6	0.002712	0.002917	0.004233	00:02
7	0.002638	0.002083	0.003457	00:02
8	0.001957	0.001940	0.003453	00:02
9	0.001730	0.001772	0.003378	00:02

  -> ERROR running benchmark 'BenchmarkSilverbox_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[3/4] Running: BenchmarkEMPS_Simulation (Rep 1)

epoch	train_loss	valid_loss	fun_rmse	time
0	0.069733	0.071184	0.081173	00:02
1	0.069433	0.071410	0.082529	00:02
2	0.069793	0.071213	0.082052	00:02
3	0.067787	0.067015	0.085624	00:03
4	0.059562	0.068800	0.084610	00:03
5	0.058254	0.063308	0.082248	00:03
6	0.057195	0.063520	0.080586	00:03
7	0.056593	0.062098	0.082033	00:03
8	0.055283	0.061372	0.080548	00:03
9	0.054668	0.061886	0.081729	00:02

  -> ERROR running benchmark 'BenchmarkEMPS_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[4/4] Running: BenchmarkCED_Simulation (Rep 1)

epoch	train_loss	valid_loss	fun_rmse	time
0	0.094108	0.165470	0.242301	00:02
1	0.066462	0.146954	0.214916	00:02
2	0.051098	0.128206	0.179753	00:02
3	0.045406	0.102076	0.145229	00:02
4	0.041550	0.094097	0.135932	00:02
5	0.041709	0.093857	0.132760	00:02
6	0.040035	0.096683	0.137633	00:02
7	0.036568	0.097940	0.137784	00:02
8	0.031412	0.096995	0.137803	00:02
9	0.028697	0.096767	0.137891	00:02

  -> ERROR running benchmark 'BenchmarkCED_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

--- Benchmark run finished. 0/4 individual runs completed successfully. ---

Analyze Results¶

The results DataFrame shows the benchmark name, metric score, and training/test times for each benchmark.

In [4]:

Copied!

print(results)
print(results)

Empty DataFrame
Columns: []
Index: []

Trying Different Configurations¶

One of IdentiBench's strengths is making it easy to compare different model architectures on the same benchmarks. Here we try a GRU with 2 layers instead of a single-layer LSTM.

In [5]:

Copied!





model_config_v2 = {
    'model_type': 'gru',
    'num_layers': 2,
    'hidden_size': 40,
}

results_v2 = idb.run_benchmarks(benchmarks, build_model, model_config_v2)
model_config_v2 = {
    'model_type': 'gru',
    'num_layers': 2,
    'hidden_size': 40,
}

results_v2 = idb.run_benchmarks(benchmarks, build_model, model_config_v2)

--- Starting benchmark run for 4 specifications, repeating each 1 times ---

-- Repetition 1/1 --

[1/4] Running: BenchmarkWH_Simulation (Rep 1)

epoch	train_loss	valid_loss	fun_rmse	time
0	0.011985	0.010200	0.013661	00:03
1	0.009718	0.016444	0.019905	00:03
2	0.007916	0.008811	0.010413	00:03
3	0.006667	0.004925	0.006617	00:03
4	0.007075	0.006787	0.008137	00:03
5	0.005515	0.005563	0.006927	00:03
6	0.006052	0.007912	0.010538	00:03
7	0.005136	0.005280	0.007136	00:03
8	0.002596	0.002249	0.003203	00:03
9	0.001502	0.001535	0.002477	00:02

  -> ERROR running benchmark 'BenchmarkWH_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[2/4] Running: BenchmarkSilverbox_Simulation (Rep 1)

epoch	train_loss	valid_loss	fun_rmse	time
0	0.004455	0.003050	0.004205	00:02
1	0.003158	0.002428	0.003692	00:02
2	0.003148	0.003758	0.004969	00:02
3	0.002930	0.002887	0.004110	00:02
4	0.002939	0.003219	0.004540	00:02
5	0.002938	0.002834	0.004091	00:02
6	0.002749	0.003133	0.004395	00:02
7	0.002402	0.002544	0.003900	00:02
8	0.002036	0.001876	0.003416	00:02
9	0.001760	0.001802	0.003383	00:02

  -> ERROR running benchmark 'BenchmarkSilverbox_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[3/4] Running: BenchmarkEMPS_Simulation (Rep 1)

epoch	train_loss	valid_loss	fun_rmse	time
0	0.069761	0.071430	0.081981	00:03
1	0.069618	0.071303	0.081248	00:02
2	0.068572	0.071255	0.082006	00:02
3	0.069085	0.071208	0.081936	00:02
4	0.069027	0.071020	0.082321	00:02
5	0.068211	0.068786	0.081111	00:03
6	0.055204	0.050874	0.067061	00:03
7	0.033857	0.033331	0.060670	00:03
8	0.034300	0.031959	0.055136	00:04
9	0.024241	0.019655	0.034394	00:03

  -> ERROR running benchmark 'BenchmarkEMPS_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

[4/4] Running: BenchmarkCED_Simulation (Rep 1)

epoch	train_loss	valid_loss	fun_rmse	time
0	0.105475	0.164520	0.225912	00:03
1	0.049000	0.076640	0.111517	00:03
2	0.043063	0.080853	0.118647	00:03
3	0.038487	0.101073	0.145044	00:02
4	0.036761	0.106693	0.156260	00:02
5	0.035981	0.117225	0.173467	00:02
6	0.031315	0.128918	0.189117	00:02
7	0.031419	0.128531	0.191488	00:02
8	0.027531	0.131669	0.201200	00:02
9	0.024075	0.130144	0.200553	00:02

  -> ERROR running benchmark 'BenchmarkCED_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2

--- Benchmark run finished. 0/4 individual runs completed successfully. ---

In [6]:

Copied!

print(results_v2)
print(results_v2)

Empty DataFrame
Columns: []
Index: []

Key Takeaways¶

IdentiBench provides standardized, reproducible benchmarks for fair comparison across system identification methods.
The build_model function follows a simple API: receive a training context, build and train a model, return an InferenceWrapper.
create_dls_from_spec handles dataset-specific configuration automatically -- column names, window sizes, and prediction settings are all extracted from the benchmark spec.
Compare different architectures (LSTM vs. GRU, depth, width) on the same benchmarks with minimal code changes.
Results are directly comparable with other methods in the IdentiBench ecosystem.