Example 11: Benchmarking with IdentiBench¶
IdentiBench provides standardized benchmarks for comparing system identification methods. This example shows how to run your TSFast models on IdentiBench benchmarks for fair, reproducible comparison with other methods.
Setup¶
import identibench as idb
from tsfast.tsdata.benchmark import create_dls_from_spec
from tsfast.models.rnn import RNNLearner
from tsfast.inference import InferenceWrapper
from tsfast.training import fun_rmse
What is IdentiBench?¶
IdentiBench is a benchmarking framework that provides standardized datasets, evaluation protocols, and metrics for system identification. Each benchmark defines:
- A dataset with specified train/validation/test splits
- Input and output column names (e.g., voltage in, displacement out)
- Evaluation metrics (typically NRMSE -- normalized root mean square error)
- A standard API that all methods must follow, ensuring fair comparison
The workshop_benchmarks dictionary contains the benchmarks used in the
IdentiBench workshop -- a curated set covering different system types and
difficulties.
The Build Model Function¶
IdentiBench requires a build_model function that takes a
TrainingContext and returns a callable model for evaluation. The context
provides:
context.spec-- the benchmark specification (dataset path, column names, window sizes, metric function)context.hyperparameters-- your model's hyperparameters, passed through from the benchmark runner
The returned model must accept numpy arrays: model(u_test, y_init) for
simulation benchmarks, where u_test is the full input signal and
y_init is the initial output window.
def build_model(context: idb.TrainingContext):
"""Build and train a TSFast model for an IdentiBench benchmark."""
dls = create_dls_from_spec(context.spec)
lrn = RNNLearner(
dls,
rnn_type=context.hyperparameters.get('model_type', 'lstm'),
num_layers=context.hyperparameters.get('num_layers', 1),
hidden_size=context.hyperparameters.get('hidden_size', 40),
n_skip=context.spec.init_window,
metrics=[fun_rmse],
)
lrn.fit_flat_cos(n_epoch=10, lr=3e-3)
return InferenceWrapper(lrn)
Key details:
create_dls_from_specautomatically extracts column names, window sizes, and prediction settings from the benchmark spec. It also applies benchmark-specific DataLoader defaults (e.g., batch size, step size) from TSFast'sBENCHMARK_DL_KWARGStable.n_skip=context.spec.init_windowuses the benchmark-defined initialization window to skip the initial transient in the loss. This matches IdentiBench's evaluation protocol, which discards the firstinit_windowtimesteps.InferenceWrapperwraps the trained learner into a numpy-in, numpy-out callable that IdentiBench's evaluation harness can call directly.
Configure and Run Benchmarks¶
We define a hyperparameter dictionary and pass it along with the
benchmarks to idb.run_benchmarks. The runner:
- Downloads each dataset (on first use)
- Calls
build_modelwith the spec and hyperparameters - Evaluates the returned model on the held-out test set
- Collects metrics into a pandas DataFrame
model_config = {
'model_type': 'lstm',
'num_layers': 1,
'hidden_size': 40,
}
benchmarks = list(idb.workshop_benchmarks.values())
results = idb.run_benchmarks(benchmarks, build_model, model_config)
--- Starting benchmark run for 4 specifications, repeating each 1 times --- -- Repetition 1/1 -- [1/4] Running: BenchmarkWH_Simulation (Rep 1)
| epoch | train_loss | valid_loss | fun_rmse | time |
|---|---|---|---|---|
| 0 | 0.013542 | 0.010710 | 0.014256 | 00:02 |
| 1 | 0.008203 | 0.007862 | 0.010135 | 00:02 |
| 2 | 0.007334 | 0.005773 | 0.007681 | 00:02 |
| 3 | 0.007540 | 0.007693 | 0.009679 | 00:02 |
| 4 | 0.005602 | 0.003809 | 0.005187 | 00:02 |
| 5 | 0.006091 | 0.006287 | 0.007872 | 00:02 |
| 6 | 0.006572 | 0.008107 | 0.010937 | 00:02 |
| 7 | 0.005290 | 0.005069 | 0.006561 | 00:02 |
| 8 | 0.002490 | 0.002136 | 0.003210 | 00:02 |
| 9 | 0.001850 | 0.001917 | 0.002910 | 00:02 |
-> ERROR running benchmark 'BenchmarkWH_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2 [2/4] Running: BenchmarkSilverbox_Simulation (Rep 1)
| epoch | train_loss | valid_loss | fun_rmse | time |
|---|---|---|---|---|
| 0 | 0.005729 | 0.003897 | 0.005963 | 00:02 |
| 1 | 0.003457 | 0.003796 | 0.005148 | 00:02 |
| 2 | 0.003146 | 0.003109 | 0.004407 | 00:02 |
| 3 | 0.002784 | 0.003424 | 0.004669 | 00:02 |
| 4 | 0.002708 | 0.002582 | 0.003909 | 00:02 |
| 5 | 0.002986 | 0.002651 | 0.003982 | 00:02 |
| 6 | 0.002712 | 0.002917 | 0.004233 | 00:02 |
| 7 | 0.002638 | 0.002083 | 0.003457 | 00:02 |
| 8 | 0.001957 | 0.001940 | 0.003453 | 00:02 |
| 9 | 0.001730 | 0.001772 | 0.003378 | 00:02 |
-> ERROR running benchmark 'BenchmarkSilverbox_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2 [3/4] Running: BenchmarkEMPS_Simulation (Rep 1)
| epoch | train_loss | valid_loss | fun_rmse | time |
|---|---|---|---|---|
| 0 | 0.069733 | 0.071184 | 0.081173 | 00:02 |
| 1 | 0.069433 | 0.071410 | 0.082529 | 00:02 |
| 2 | 0.069793 | 0.071213 | 0.082052 | 00:02 |
| 3 | 0.067787 | 0.067015 | 0.085624 | 00:03 |
| 4 | 0.059562 | 0.068800 | 0.084610 | 00:03 |
| 5 | 0.058254 | 0.063308 | 0.082248 | 00:03 |
| 6 | 0.057195 | 0.063520 | 0.080586 | 00:03 |
| 7 | 0.056593 | 0.062098 | 0.082033 | 00:03 |
| 8 | 0.055283 | 0.061372 | 0.080548 | 00:03 |
| 9 | 0.054668 | 0.061886 | 0.081729 | 00:02 |
-> ERROR running benchmark 'BenchmarkEMPS_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2 [4/4] Running: BenchmarkCED_Simulation (Rep 1)
| epoch | train_loss | valid_loss | fun_rmse | time |
|---|---|---|---|---|
| 0 | 0.094108 | 0.165470 | 0.242301 | 00:02 |
| 1 | 0.066462 | 0.146954 | 0.214916 | 00:02 |
| 2 | 0.051098 | 0.128206 | 0.179753 | 00:02 |
| 3 | 0.045406 | 0.102076 | 0.145229 | 00:02 |
| 4 | 0.041550 | 0.094097 | 0.135932 | 00:02 |
| 5 | 0.041709 | 0.093857 | 0.132760 | 00:02 |
| 6 | 0.040035 | 0.096683 | 0.137633 | 00:02 |
| 7 | 0.036568 | 0.097940 | 0.137784 | 00:02 |
| 8 | 0.031412 | 0.096995 | 0.137803 | 00:02 |
| 9 | 0.028697 | 0.096767 | 0.137891 | 00:02 |
-> ERROR running benchmark 'BenchmarkCED_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2 --- Benchmark run finished. 0/4 individual runs completed successfully. ---
Analyze Results¶
The results DataFrame shows the benchmark name, metric score, and training/test times for each benchmark.
print(results)
Empty DataFrame Columns: [] Index: []
Trying Different Configurations¶
One of IdentiBench's strengths is making it easy to compare different model architectures on the same benchmarks. Here we try a GRU with 2 layers instead of a single-layer LSTM.
model_config_v2 = {
'model_type': 'gru',
'num_layers': 2,
'hidden_size': 40,
}
results_v2 = idb.run_benchmarks(benchmarks, build_model, model_config_v2)
--- Starting benchmark run for 4 specifications, repeating each 1 times --- -- Repetition 1/1 -- [1/4] Running: BenchmarkWH_Simulation (Rep 1)
| epoch | train_loss | valid_loss | fun_rmse | time |
|---|---|---|---|---|
| 0 | 0.011985 | 0.010200 | 0.013661 | 00:03 |
| 1 | 0.009718 | 0.016444 | 0.019905 | 00:03 |
| 2 | 0.007916 | 0.008811 | 0.010413 | 00:03 |
| 3 | 0.006667 | 0.004925 | 0.006617 | 00:03 |
| 4 | 0.007075 | 0.006787 | 0.008137 | 00:03 |
| 5 | 0.005515 | 0.005563 | 0.006927 | 00:03 |
| 6 | 0.006052 | 0.007912 | 0.010538 | 00:03 |
| 7 | 0.005136 | 0.005280 | 0.007136 | 00:03 |
| 8 | 0.002596 | 0.002249 | 0.003203 | 00:03 |
| 9 | 0.001502 | 0.001535 | 0.002477 | 00:02 |
-> ERROR running benchmark 'BenchmarkWH_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2 [2/4] Running: BenchmarkSilverbox_Simulation (Rep 1)
| epoch | train_loss | valid_loss | fun_rmse | time |
|---|---|---|---|---|
| 0 | 0.004455 | 0.003050 | 0.004205 | 00:02 |
| 1 | 0.003158 | 0.002428 | 0.003692 | 00:02 |
| 2 | 0.003148 | 0.003758 | 0.004969 | 00:02 |
| 3 | 0.002930 | 0.002887 | 0.004110 | 00:02 |
| 4 | 0.002939 | 0.003219 | 0.004540 | 00:02 |
| 5 | 0.002938 | 0.002834 | 0.004091 | 00:02 |
| 6 | 0.002749 | 0.003133 | 0.004395 | 00:02 |
| 7 | 0.002402 | 0.002544 | 0.003900 | 00:02 |
| 8 | 0.002036 | 0.001876 | 0.003416 | 00:02 |
| 9 | 0.001760 | 0.001802 | 0.003383 | 00:02 |
-> ERROR running benchmark 'BenchmarkSilverbox_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2 [3/4] Running: BenchmarkEMPS_Simulation (Rep 1)
| epoch | train_loss | valid_loss | fun_rmse | time |
|---|---|---|---|---|
| 0 | 0.069761 | 0.071430 | 0.081981 | 00:03 |
| 1 | 0.069618 | 0.071303 | 0.081248 | 00:02 |
| 2 | 0.068572 | 0.071255 | 0.082006 | 00:02 |
| 3 | 0.069085 | 0.071208 | 0.081936 | 00:02 |
| 4 | 0.069027 | 0.071020 | 0.082321 | 00:02 |
| 5 | 0.068211 | 0.068786 | 0.081111 | 00:03 |
| 6 | 0.055204 | 0.050874 | 0.067061 | 00:03 |
| 7 | 0.033857 | 0.033331 | 0.060670 | 00:03 |
| 8 | 0.034300 | 0.031959 | 0.055136 | 00:04 |
| 9 | 0.024241 | 0.019655 | 0.034394 | 00:03 |
-> ERROR running benchmark 'BenchmarkEMPS_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2 [4/4] Running: BenchmarkCED_Simulation (Rep 1)
| epoch | train_loss | valid_loss | fun_rmse | time |
|---|---|---|---|---|
| 0 | 0.105475 | 0.164520 | 0.225912 | 00:03 |
| 1 | 0.049000 | 0.076640 | 0.111517 | 00:03 |
| 2 | 0.043063 | 0.080853 | 0.118647 | 00:03 |
| 3 | 0.038487 | 0.101073 | 0.145044 | 00:02 |
| 4 | 0.036761 | 0.106693 | 0.156260 | 00:02 |
| 5 | 0.035981 | 0.117225 | 0.173467 | 00:02 |
| 6 | 0.031315 | 0.128918 | 0.189117 | 00:02 |
| 7 | 0.031419 | 0.128531 | 0.191488 | 00:02 |
| 8 | 0.027531 | 0.131669 | 0.201200 | 00:02 |
| 9 | 0.024075 | 0.130144 | 0.200553 | 00:02 |
-> ERROR running benchmark 'BenchmarkCED_Simulation' (Rep 1): input.size(-1) must be equal to input_size. Expected 1, got 2 --- Benchmark run finished. 0/4 individual runs completed successfully. ---
print(results_v2)
Empty DataFrame Columns: [] Index: []
Key Takeaways¶
- IdentiBench provides standardized, reproducible benchmarks for fair comparison across system identification methods.
- The
build_modelfunction follows a simple API: receive a training context, build and train a model, return anInferenceWrapper. create_dls_from_spechandles dataset-specific configuration automatically -- column names, window sizes, and prediction settings are all extracted from the benchmark spec.- Compare different architectures (LSTM vs. GRU, depth, width) on the same benchmarks with minimal code changes.
- Results are directly comparable with other methods in the IdentiBench ecosystem.