identibench logo

IdentiBench

PyPI version License: Apache 2.0 Docs Status Python Versions

IdentiBench is a Python library designed to streamline and standardize the benchmarking of system identification models. Evaluating and comparing dynamic models often requires repetitive setup for data handling, evaluation protocols, and metrics implementation, making fair comparisons and reproducing results challenging. IdentiBench tackles this by offering a collection of pre-defined benchmark specifications for simulation and prediction tasks, built upon common datasets. It automates data downloading and processing into a consistent format and provides standard evaluation metrics via a simple interface (run_benchmark). This allows you to focus your efforts on developing innovative models, while relying on IdentiBench for robust and reproducible evaluation.

Key Features

  • Access Many Benchmarks from different systems: Instantly utilize pre-configured benchmarks covering diverse domains like electronics (Silverbox), mechanics (Industrial Robot), process control (Cascaded Tanks), aerospace (Quadrotors), and more, available for both simulation and prediction tasks.
  • Automate Data Management: Forget manual downloading and processing; the library handles fetching data from various sources (web, Drive, Dataverse), extracting archives (ZIP, RAR, MAT, BAG), converting to a standard HDF5 format, and caching locally.
  • Integrate Any Model to evaluate on all benchmarks: Plug in your custom models, regardless of the Python framework used (NumPy, SciPy, PyTorch, TensorFlow, JAX, etc.), using a straightforward function interface (build_model) that receives all necessary context.
  • Capture Comprehensive Results: Obtain detailed evaluation reports including standard metrics (RMSE, NRMSE, FIT%, etc.), task-specific scores, execution timings, configuration parameters (hyperparameters, seed), and raw model predictions for thorough analysis.
  • Easily Define New Benchmarks: Go beyond the included datasets by creating your own benchmark specifications (BenchmarkSpecSimulation, BenchmarkSpecPrediction) for private data or unique tasks, leveraging the library’s structure and transparent data format.

Installation

You can install identibench using pip:

pip install identibench

To install the latest development version directly from GitHub, use:

pip install git+https://github.com/daniel-om-weber/identibench.git
# Basic usage
import identibench as idb
from pathlib import Path

# Example: Download a single dataset
# Note: Always use a Path object, not a string
save_path = Path('./tmp/wh')
idb.datasets.workshop.dl_wiener_hammerstein(save_path)
from sysidentpy.model_structure_selection import FROLS
from sysidentpy.parameter_estimation import LeastSquares
def build_frols_model(context):
    u_train, y_train, _ = next(context.get_train_sequences())
    
    ylag = context.hyperparameters.get('ylag', 5)
    xlag = context.hyperparameters.get('xlag', 5)
    n_terms = context.hyperparameters.get('n_terms', 10)
    estimator = context.hyperparameters.get('estimator', LeastSquares())

    _model = FROLS(xlag=xlag, ylag=ylag, n_terms=n_terms,estimator=estimator)
    _model.fit(X=u_train, y=y_train)

    def model(u_test, y_init):
        nonlocal _model
        yhat_full = _model.predict(X=u_test, y=y_init[:_model.max_lag])
        y_pred = yhat_full[_model.max_lag:]
        return y_pred
    
    return model
hyperparams = {
    'ylag': 2,
    'xlag': 2,
    'n_terms': 10, # Number of terms for FROLS
    'estimator': LeastSquares()
}

results = idb.run_benchmark(
    spec=idb.BenchmarkWH_Simulation,
    build_model=build_frols_model,
    hyperparameters=hyperparams
)

Simulation Benchmarks

Key Benchmark Name
WH_Sim BenchmarkWH_Simulation
Silverbox_Sim BenchmarkSilverbox_Simulation
Tanks_Sim BenchmarkCascadedTanks_Simulation
CED_Sim BenchmarkCED_Simulation
EMPS_Sim BenchmarkEMPS_Simulation
NoisyWH_Sim BenchmarkNoisyWH_Simulation
RobotForward_Sim BenchmarkRobotForward_Simulation
RobotInverse_Sim BenchmarkRobotInverse_Simulation
Ship_Sim BenchmarkShip_Simulation
QuadPelican_Sim BenchmarkQuadPelican_Simulation
QuadPi_Sim BenchmarkQuadPi_Simulation

Prediction Benchmarks

Key Benchmark Name
WH_Pred BenchmarkWH_Prediction
Silverbox_Pred BenchmarkSilverbox_Prediction
Tanks_Pred BenchmarkCascadedTanks_Prediction
CED_Pred BenchmarkCED_Prediction
EMPS_Pred BenchmarkEMPS_Prediction
NoisyWH_Pred BenchmarkNoisyWH_Prediction
RobotForward_Pred BenchmarkRobotForward_Prediction
RobotInverse_Pred BenchmarkRobotInverse_Prediction
Ship_Pred BenchmarkShip_Prediction
QuadPelican_Pred BenchmarkQuadPelican_Prediction
QuadPi_Pred BenchmarkQuadPi_Prediction

Workflow Details

This section provides more detail on the core concepts and components of the identibench workflow.

Benchmark Types

identibench defines two main types of benchmark tasks, specified using different classes:

  • Simulation (BenchmarkSpecSimulation):
    • Goal: Evaluate a model’s ability to perform a free-run simulation, predicting the system’s output over an extended period given the input sequence.
    • Typical Input to Predictor: The full input sequence (u_test) and potentially an initial segment of the output sequence (y_test[:init_window]) for warm-up or state initialization.
    • Expected Output from Predictor: The predicted output sequence (y_pred) corresponding to the input, usually excluding the warm-up period.
    • Use Case: Assessing models intended for long-term prediction, control simulation, or understanding overall system dynamics.
  • Prediction (BenchmarkSpecPrediction):
    • Goal: Evaluate a model’s ability to predict the system’s output k steps into the future based on recent past data.
    • Typical Input to Predictor: Often involves windows of past inputs and outputs (e.g., u[t:t+H], y[t:t+H]).
    • Expected Output from Predictor: The predicted output at a specific future time step (e.g., y[t+H+k]). The pred_horizon parameter defines ‘k’, and pred_step defines how frequently predictions are made.
    • Use Case: Evaluating models focused on short-to-medium term forecasting, state estimation, or receding horizon control.
  • init_window: Both benchmark types often use an init_window. This specifies an initial number of time steps whose data might be provided to the model for initialization or warm-up. Importantly, data within this window is typically excluded from the final performance metric calculation to ensure a fair evaluation of the model’s predictive capabilities beyond the initial transient.

Model Interface (build_model)

The core of integrating your custom logic is the build_model function you provide to run_benchmark.

  • Purpose: This function is responsible for defining your model architecture, training it using the provided data, and returning a callable predictor function.
  • Input (context: TrainingContext): Your build_model function receives a single argument, context, which is a TrainingContext object. This object gives you access to:
    • context.spec: The full specification of the current benchmark being run (including dataset paths, input/output columns, init_window, etc.).
    • context.hyperparameters: A dictionary containing any hyperparameters you passed to run_benchmark. Use this to configure your model or training process.
    • context.seed: A random seed for ensuring reproducibility.
    • Data Access Methods: Functions like context.get_train_sequences() and context.get_valid_sequences() provide iterators over the raw, full-length training and validation data sequences (as tuples of NumPy arrays (u, y, x)). Note: You need to handle any batching or windowing required for your specific training algorithm within your build_model function.
  • Output (Predictor Callable): build_model must return a callable object (e.g., a function, an object’s method) that represents your trained model ready for prediction/simulation. This returned callable will be used internally by run_benchmark on the test set. Its expected signature depends on the benchmark type, but typically it accepts NumPy arrays for test inputs (and potentially initial outputs) and returns a NumPy array containing the predictions.

Running Multiple Benchmarks

To evaluate a model across several scenarios efficiently, use the run_multiple_benchmarks function:

# Example: Run on a subset of benchmarks
specs_to_run = {
    'WH_Sim': idb.simulation_benchmarks['WH_Sim'],
    'Silverbox_Sim': idb.simulation_benchmarks['Silverbox_Sim']
}

# Assume 'my_build_model' is your defined build function
all_results = idb.run_benchmarks(specs_to_run, build_model=build_frols_model,n_times=3)

all_results
--- Starting benchmark run for 2 specifications, repeating each 3 times ---

-- Repetition 1/3 --

[1/6] Running: BenchmarkWH_Simulation (Rep 1)
  -> Success: BenchmarkWH_Simulation (Rep 1) completed.

[2/6] Running: BenchmarkSilverbox_Simulation (Rep 1)
  -> Success: BenchmarkSilverbox_Simulation (Rep 1) completed.

-- Repetition 2/3 --

[3/6] Running: BenchmarkWH_Simulation (Rep 2)
  -> Success: BenchmarkWH_Simulation (Rep 2) completed.

[4/6] Running: BenchmarkSilverbox_Simulation (Rep 2)
  -> Success: BenchmarkSilverbox_Simulation (Rep 2) completed.

-- Repetition 3/3 --

[5/6] Running: BenchmarkWH_Simulation (Rep 3)
  -> Success: BenchmarkWH_Simulation (Rep 3) completed.

[6/6] Running: BenchmarkSilverbox_Simulation (Rep 3)
  -> Success: BenchmarkSilverbox_Simulation (Rep 3) completed.

--- Benchmark run finished. 6/6 individual runs completed successfully. ---
benchmark_name dataset_id hyperparameters seed training_time_seconds test_time_seconds benchmark_type metric_name metric_score cs_multisine_rmse cs_arrow_full_rmse cs_arrow_no_extrapolation_rmse
0 BenchmarkWH_Simulation wh {} 2406651230 4.944649 1.012850 BenchmarkSpecSimulation rmse_mV 42.161572 NaN NaN NaN
1 BenchmarkSilverbox_Simulation silverbox {} 3813113752 2.839149 1.246224 BenchmarkSpecSimulation rmse_mV 10.732386 8.501941 16.154317 7.5409
2 BenchmarkWH_Simulation wh {} 1950649438 4.801520 1.034119 BenchmarkSpecSimulation rmse_mV 42.161572 NaN NaN NaN
3 BenchmarkSilverbox_Simulation silverbox {} 1560698088 2.880391 1.217932 BenchmarkSpecSimulation rmse_mV 10.732386 8.501941 16.154317 7.5409
4 BenchmarkWH_Simulation wh {} 3258007268 4.916941 1.021927 BenchmarkSpecSimulation rmse_mV 42.161572 NaN NaN NaN
5 BenchmarkSilverbox_Simulation silverbox {} 4194043971 2.937101 1.231710 BenchmarkSpecSimulation rmse_mV 10.732386 8.501941 16.154317 7.5409

This function iterates through the provided list or dictionary of benchmark specifications, calling run_benchmark for each one using the same build_model function and hyperparameters.

#calculate mean and std of the results
idb.aggregate_benchmark_results(all_results,agg_funcs=['mean','std'])
training_time_seconds test_time_seconds metric_score cs_multisine_rmse cs_arrow_full_rmse cs_arrow_no_extrapolation_rmse
mean std mean std mean std mean std mean std mean std
benchmark_name
BenchmarkSilverbox_Simulation 2.885547 0.049179 1.231955 0.014147 10.732386 0.0 8.501941 0.0 16.154317 0.0 7.5409 0.0
BenchmarkWH_Simulation 4.887703 0.075912 1.022966 0.010673 42.161572 0.0 NaN NaN NaN NaN NaN NaN

Data Handling & Format

Understanding how identibench organizes and stores data is helpful for direct interaction or adding new datasets.

  • Directory Structure: Datasets are stored under a root directory (default: ~/.identibench_data, configurable via the IDENTIBENCH_DATA_ROOT environment variable). The structure follows: DATA_ROOT / [dataset_id] / [subset] / [experiment_file.hdf5].
  • Subsets: Standard subset names are train, valid, and test. An optional train_valid directory might contain combined data.
  • Download & Cache: Data is downloaded automatically when a benchmark requires it and cached locally to avoid re-downloads. The identibench.datasets.download_all_datasets function can fetch all datasets at once.
  • File Format: Processed time-series data is stored in the HDF5 (.hdf5) format.
  • HDF5 Structure:
    • Each .hdf5 file typically represents one experimental run.
    • Signals (inputs, outputs, states) are stored as separate 1-dimensional datasets within the file, named conventionally as u0, u1, …, y0, y1, …, x0, …
    • Data is usually stored as float32 NumPy arrays.
    • Metadata like sampling frequency (fs) and suggested initialization window size (init_sz) are stored as attributes on the root group of the HDF5 file.
    • Example Structure: my_dataset/ └── train/ └── train_run_1.hdf5 ├── u0 (Dataset: shape=(N,), dtype=float32) ├── y0 (Dataset: shape=(N,), dtype=float32) └── Attributes: └── fs (Attribute: float)
  • Extensibility: Adhering to this HDF5 format ensures compatibility when adding new dataset loaders. Helper functions like identibench.utils.write_array facilitate creating files in the correct format.

Understanding Benchmark Results

The run_benchmark function returns a dictionary containing detailed results of the experiment. Key entries include:

  • benchmark_name (str): The unique name of the benchmark specification used.
  • dataset_id (str): Identifier for the dataset source.
  • hyperparameters (dict): The hyperparameters dictionary passed to the run.
  • seed (int): The random seed used for the run.
  • training_time_seconds (float): Wall-clock time spent inside your build_model function.
  • test_time_seconds (float): Wall-clock time spent evaluating the returned predictor on the test set.
  • benchmark_type (str): The type of benchmark run (e.g., 'BenchmarkSpecSimulation').
  • metric_name (str): The name of the primary metric function defined in the spec.
  • metric_score (float): The calculated score for the primary metric on the test set (aggregated if multiple test files).
  • custom_scores (dict): Any additional scores calculated by custom evaluation logic specific to the benchmark.
  • model_predictions (list): A list containing the raw outputs. For simulation, it’s typically [(y_pred_test1, y_true_test1), (y_pred_test2, y_true_test2), ...]. For prediction, the structure might be nested reflecting windowed predictions.