# Basic usage
import identibench as idb
from pathlib import Path
# Example: Download a single dataset
# Note: Always use a Path object, not a string
= Path('./tmp/wh')
save_path idb.datasets.workshop.dl_wiener_hammerstein(save_path)
IdentiBench
IdentiBench is a Python library designed to streamline and standardize the benchmarking of system identification models. Evaluating and comparing dynamic models often requires repetitive setup for data handling, evaluation protocols, and metrics implementation, making fair comparisons and reproducing results challenging. IdentiBench tackles this by offering a collection of pre-defined benchmark specifications for simulation and prediction tasks, built upon common datasets. It automates data downloading and processing into a consistent format and provides standard evaluation metrics via a simple interface (run_benchmark). This allows you to focus your efforts on developing innovative models, while relying on IdentiBench for robust and reproducible evaluation.
Key Features
- Access Many Benchmarks from different systems: Instantly utilize pre-configured benchmarks covering diverse domains like electronics (Silverbox), mechanics (Industrial Robot), process control (Cascaded Tanks), aerospace (Quadrotors), and more, available for both simulation and prediction tasks.
- Automate Data Management: Forget manual downloading and processing; the library handles fetching data from various sources (web, Drive, Dataverse), extracting archives (ZIP, RAR, MAT, BAG), converting to a standard HDF5 format, and caching locally.
- Integrate Any Model to evaluate on all benchmarks: Plug in your custom models, regardless of the Python framework used (NumPy, SciPy, PyTorch, TensorFlow, JAX, etc.), using a straightforward function interface (
build_model
) that receives all necessary context. - Capture Comprehensive Results: Obtain detailed evaluation reports including standard metrics (RMSE, NRMSE, FIT%, etc.), task-specific scores, execution timings, configuration parameters (hyperparameters, seed), and raw model predictions for thorough analysis.
- Easily Define New Benchmarks: Go beyond the included datasets by creating your own benchmark specifications (
BenchmarkSpecSimulation
,BenchmarkSpecPrediction
) for private data or unique tasks, leveraging the library’s structure and transparent data format.
Installation
You can install identibench
using pip:
pip install identibench
To install the latest development version directly from GitHub, use:
pip install git+https://github.com/daniel-om-weber/identibench.git
from sysidentpy.model_structure_selection import FROLS
from sysidentpy.parameter_estimation import LeastSquares
def build_frols_model(context):
= next(context.get_train_sequences())
u_train, y_train, _
= context.hyperparameters.get('ylag', 5)
ylag = context.hyperparameters.get('xlag', 5)
xlag = context.hyperparameters.get('n_terms', 10)
n_terms = context.hyperparameters.get('estimator', LeastSquares())
estimator
= FROLS(xlag=xlag, ylag=ylag, n_terms=n_terms,estimator=estimator)
_model =u_train, y=y_train)
_model.fit(X
def model(u_test, y_init):
nonlocal _model
= _model.predict(X=u_test, y=y_init[:_model.max_lag])
yhat_full = yhat_full[_model.max_lag:]
y_pred return y_pred
return model
= {
hyperparams 'ylag': 2,
'xlag': 2,
'n_terms': 10, # Number of terms for FROLS
'estimator': LeastSquares()
}
= idb.run_benchmark(
results =idb.BenchmarkWH_Simulation,
spec=build_frols_model,
build_model=hyperparams
hyperparameters )
Simulation Benchmarks
Key | Benchmark Name |
---|---|
WH_Sim |
BenchmarkWH_Simulation |
Silverbox_Sim |
BenchmarkSilverbox_Simulation |
Tanks_Sim |
BenchmarkCascadedTanks_Simulation |
CED_Sim |
BenchmarkCED_Simulation |
EMPS_Sim |
BenchmarkEMPS_Simulation |
NoisyWH_Sim |
BenchmarkNoisyWH_Simulation |
RobotForward_Sim |
BenchmarkRobotForward_Simulation |
RobotInverse_Sim |
BenchmarkRobotInverse_Simulation |
Ship_Sim |
BenchmarkShip_Simulation |
QuadPelican_Sim |
BenchmarkQuadPelican_Simulation |
QuadPi_Sim |
BenchmarkQuadPi_Simulation |
Prediction Benchmarks
Key | Benchmark Name |
---|---|
WH_Pred |
BenchmarkWH_Prediction |
Silverbox_Pred |
BenchmarkSilverbox_Prediction |
Tanks_Pred |
BenchmarkCascadedTanks_Prediction |
CED_Pred |
BenchmarkCED_Prediction |
EMPS_Pred |
BenchmarkEMPS_Prediction |
NoisyWH_Pred |
BenchmarkNoisyWH_Prediction |
RobotForward_Pred |
BenchmarkRobotForward_Prediction |
RobotInverse_Pred |
BenchmarkRobotInverse_Prediction |
Ship_Pred |
BenchmarkShip_Prediction |
QuadPelican_Pred |
BenchmarkQuadPelican_Prediction |
QuadPi_Pred |
BenchmarkQuadPi_Prediction |
Workflow Details
This section provides more detail on the core concepts and components of the identibench
workflow.
Benchmark Types
identibench
defines two main types of benchmark tasks, specified using different classes:
- Simulation (
BenchmarkSpecSimulation
):- Goal: Evaluate a model’s ability to perform a free-run simulation, predicting the system’s output over an extended period given the input sequence.
- Typical Input to Predictor: The full input sequence (
u_test
) and potentially an initial segment of the output sequence (y_test[:init_window]
) for warm-up or state initialization. - Expected Output from Predictor: The predicted output sequence (
y_pred
) corresponding to the input, usually excluding the warm-up period. - Use Case: Assessing models intended for long-term prediction, control simulation, or understanding overall system dynamics.
- Prediction (
BenchmarkSpecPrediction
):- Goal: Evaluate a model’s ability to predict the system’s output k steps into the future based on recent past data.
- Typical Input to Predictor: Often involves windows of past inputs and outputs (e.g.,
u[t:t+H]
,y[t:t+H]
). - Expected Output from Predictor: The predicted output at a specific future time step (e.g.,
y[t+H+k]
). Thepred_horizon
parameter defines ‘k’, andpred_step
defines how frequently predictions are made. - Use Case: Evaluating models focused on short-to-medium term forecasting, state estimation, or receding horizon control.
init_window
: Both benchmark types often use aninit_window
. This specifies an initial number of time steps whose data might be provided to the model for initialization or warm-up. Importantly, data within this window is typically excluded from the final performance metric calculation to ensure a fair evaluation of the model’s predictive capabilities beyond the initial transient.
Model Interface (build_model
)
The core of integrating your custom logic is the build_model
function you provide to run_benchmark
.
- Purpose: This function is responsible for defining your model architecture, training it using the provided data, and returning a callable predictor function.
- Input (
context: TrainingContext
): Yourbuild_model
function receives a single argument,context
, which is aTrainingContext
object. This object gives you access to:context.spec
: The full specification of the current benchmark being run (including dataset paths, input/output columns,init_window
, etc.).context.hyperparameters
: A dictionary containing any hyperparameters you passed torun_benchmark
. Use this to configure your model or training process.context.seed
: A random seed for ensuring reproducibility.- Data Access Methods: Functions like
context.get_train_sequences()
andcontext.get_valid_sequences()
provide iterators over the raw, full-length training and validation data sequences (as tuples of NumPy arrays(u, y, x)
). Note: You need to handle any batching or windowing required for your specific training algorithm within yourbuild_model
function.
- Output (Predictor
Callable
):build_model
must return a callable object (e.g., a function, an object’s method) that represents your trained model ready for prediction/simulation. This returned callable will be used internally byrun_benchmark
on the test set. Its expected signature depends on the benchmark type, but typically it accepts NumPy arrays for test inputs (and potentially initial outputs) and returns a NumPy array containing the predictions.
Running Multiple Benchmarks
To evaluate a model across several scenarios efficiently, use the run_multiple_benchmarks
function:
# Example: Run on a subset of benchmarks
= {
specs_to_run 'WH_Sim': idb.simulation_benchmarks['WH_Sim'],
'Silverbox_Sim': idb.simulation_benchmarks['Silverbox_Sim']
}
# Assume 'my_build_model' is your defined build function
= idb.run_benchmarks(specs_to_run, build_model=build_frols_model,n_times=3)
all_results
all_results
--- Starting benchmark run for 2 specifications, repeating each 3 times ---
-- Repetition 1/3 --
[1/6] Running: BenchmarkWH_Simulation (Rep 1)
-> Success: BenchmarkWH_Simulation (Rep 1) completed.
[2/6] Running: BenchmarkSilverbox_Simulation (Rep 1)
-> Success: BenchmarkSilverbox_Simulation (Rep 1) completed.
-- Repetition 2/3 --
[3/6] Running: BenchmarkWH_Simulation (Rep 2)
-> Success: BenchmarkWH_Simulation (Rep 2) completed.
[4/6] Running: BenchmarkSilverbox_Simulation (Rep 2)
-> Success: BenchmarkSilverbox_Simulation (Rep 2) completed.
-- Repetition 3/3 --
[5/6] Running: BenchmarkWH_Simulation (Rep 3)
-> Success: BenchmarkWH_Simulation (Rep 3) completed.
[6/6] Running: BenchmarkSilverbox_Simulation (Rep 3)
-> Success: BenchmarkSilverbox_Simulation (Rep 3) completed.
--- Benchmark run finished. 6/6 individual runs completed successfully. ---
benchmark_name | dataset_id | hyperparameters | seed | training_time_seconds | test_time_seconds | benchmark_type | metric_name | metric_score | cs_multisine_rmse | cs_arrow_full_rmse | cs_arrow_no_extrapolation_rmse | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | BenchmarkWH_Simulation | wh | {} | 2406651230 | 4.944649 | 1.012850 | BenchmarkSpecSimulation | rmse_mV | 42.161572 | NaN | NaN | NaN |
1 | BenchmarkSilverbox_Simulation | silverbox | {} | 3813113752 | 2.839149 | 1.246224 | BenchmarkSpecSimulation | rmse_mV | 10.732386 | 8.501941 | 16.154317 | 7.5409 |
2 | BenchmarkWH_Simulation | wh | {} | 1950649438 | 4.801520 | 1.034119 | BenchmarkSpecSimulation | rmse_mV | 42.161572 | NaN | NaN | NaN |
3 | BenchmarkSilverbox_Simulation | silverbox | {} | 1560698088 | 2.880391 | 1.217932 | BenchmarkSpecSimulation | rmse_mV | 10.732386 | 8.501941 | 16.154317 | 7.5409 |
4 | BenchmarkWH_Simulation | wh | {} | 3258007268 | 4.916941 | 1.021927 | BenchmarkSpecSimulation | rmse_mV | 42.161572 | NaN | NaN | NaN |
5 | BenchmarkSilverbox_Simulation | silverbox | {} | 4194043971 | 2.937101 | 1.231710 | BenchmarkSpecSimulation | rmse_mV | 10.732386 | 8.501941 | 16.154317 | 7.5409 |
This function iterates through the provided list or dictionary of benchmark specifications, calling run_benchmark
for each one using the same build_model
function and hyperparameters.
#calculate mean and std of the results
=['mean','std']) idb.aggregate_benchmark_results(all_results,agg_funcs
training_time_seconds | test_time_seconds | metric_score | cs_multisine_rmse | cs_arrow_full_rmse | cs_arrow_no_extrapolation_rmse | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | |
benchmark_name | ||||||||||||
BenchmarkSilverbox_Simulation | 2.885547 | 0.049179 | 1.231955 | 0.014147 | 10.732386 | 0.0 | 8.501941 | 0.0 | 16.154317 | 0.0 | 7.5409 | 0.0 |
BenchmarkWH_Simulation | 4.887703 | 0.075912 | 1.022966 | 0.010673 | 42.161572 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN |
Data Handling & Format
Understanding how identibench
organizes and stores data is helpful for direct interaction or adding new datasets.
- Directory Structure: Datasets are stored under a root directory (default:
~/.identibench_data
, configurable via theIDENTIBENCH_DATA_ROOT
environment variable). The structure follows:DATA_ROOT / [dataset_id] / [subset] / [experiment_file.hdf5]
. - Subsets: Standard subset names are
train
,valid
, andtest
. An optionaltrain_valid
directory might contain combined data. - Download & Cache: Data is downloaded automatically when a benchmark requires it and cached locally to avoid re-downloads. The
identibench.datasets.download_all_datasets
function can fetch all datasets at once. - File Format: Processed time-series data is stored in the HDF5 (
.hdf5
) format. - HDF5 Structure:
- Each
.hdf5
file typically represents one experimental run. - Signals (inputs, outputs, states) are stored as separate 1-dimensional datasets within the file, named conventionally as
u0
,u1
, …,y0
,y1
, …,x0
, … - Data is usually stored as
float32
NumPy arrays. - Metadata like sampling frequency (
fs
) and suggested initialization window size (init_sz
) are stored as attributes on the root group of the HDF5 file. - Example Structure:
my_dataset/ └── train/ └── train_run_1.hdf5 ├── u0 (Dataset: shape=(N,), dtype=float32) ├── y0 (Dataset: shape=(N,), dtype=float32) └── Attributes: └── fs (Attribute: float)
- Each
- Extensibility: Adhering to this HDF5 format ensures compatibility when adding new dataset loaders. Helper functions like
identibench.utils.write_array
facilitate creating files in the correct format.
Understanding Benchmark Results
The run_benchmark
function returns a dictionary containing detailed results of the experiment. Key entries include:
benchmark_name
(str
): The unique name of the benchmark specification used.dataset_id
(str
): Identifier for the dataset source.hyperparameters
(dict
): The hyperparameters dictionary passed to the run.seed
(int
): The random seed used for the run.training_time_seconds
(float
): Wall-clock time spent inside yourbuild_model
function.test_time_seconds
(float
): Wall-clock time spent evaluating the returned predictor on the test set.benchmark_type
(str
): The type of benchmark run (e.g.,'BenchmarkSpecSimulation'
).metric_name
(str
): The name of the primary metric function defined in the spec.metric_score
(float
): The calculated score for the primary metric on the test set (aggregated if multiple test files).custom_scores
(dict
): Any additional scores calculated by custom evaluation logic specific to the benchmark.model_predictions
(list
): A list containing the raw outputs. For simulation, it’s typically[(y_pred_test1, y_true_test1), (y_pred_test2, y_true_test2), ...]
. For prediction, the structure might be nested reflecting windowed predictions.