Benchmark

Fill in a module description here

Benchmark Specifications


source

BenchmarkSpecSimulation

 BenchmarkSpecSimulation (name:str, dataset_id:str, u_cols:list[str],
                          y_cols:list[str], metric_func:collections.abc.Ca
                          llable[[numpy.ndarray,numpy.ndarray],float],
                          x_cols:list[str]|None=None,
                          sampling_time:float|None=None, download_func:col
                          lections.abc.Callable[[pathlib.Path,bool],None]|
                          None=None, test_model_func:collections.abc.Calla
                          ble[[__main__.BenchmarkSpecBase,collections.abc.
                          Callable],dict[str,typing.Any]]=<function
                          _test_simulation>, custom_test_evaluation=None,
                          init_window:int|None=None, data_root:[<class'pat
                          hlib.Path'>,collections.abc.Callable[[],pathlib.
                          Path]]=<function get_default_data_root>)

*Specification for a simulation benchmark task.

Inherits common parameters from BaseBenchmarkSpec. Use this when the goal is to simulate the system’s output given the input u.*

Type Default Details
name str Unique name identifying this benchmark task.
dataset_id str Identifier for the raw dataset source.
u_cols list list of column names for input signals (u).
y_cols list list of column names for output signals (y).
metric_func Callable Primary metric: func(y_true, y_pred).
x_cols list[str] | None None Optional state inputs (x).
sampling_time float | None None Optional sampling time (seconds).
download_func collections.abc.Callable[[pathlib.Path, bool], None] | None None Dataset preparation func.
test_model_func Callable _test_simulation
custom_test_evaluation NoneType None
init_window int | None None Steps for warm-up, potentially ignored in evaluation.
data_root [<class ‘pathlib.Path’>, collections.abc.Callable[[], pathlib.Path]] get_default_data_root root dir for dataset, may be a callable or path

source

BenchmarkSpecPrediction

 BenchmarkSpecPrediction (name:str, dataset_id:str, u_cols:list[str],
                          y_cols:list[str], metric_func:collections.abc.Ca
                          llable[[numpy.ndarray,numpy.ndarray],float],
                          pred_horizon:int, pred_step:int,
                          x_cols:list[str]|None=None,
                          sampling_time:float|None=None, download_func:col
                          lections.abc.Callable[[pathlib.Path,bool],None]|
                          None=None, test_model_func:collections.abc.Calla
                          ble[[__main__.BenchmarkSpecBase,collections.abc.
                          Callable],dict[str,typing.Any]]=<function
                          _test_prediction>, custom_test_evaluation=None,
                          init_window:int|None=None, data_root:[<class'pat
                          hlib.Path'>,collections.abc.Callable[[],pathlib.
                          Path]]=<function get_default_data_root>)

*Specification for a k-step ahead prediction benchmark task.

Inherits common parameters from BaseBenchmarkSpec and adds prediction-specific ones. Use this when the goal is to predict y some steps ahead based on past u and y.*

Type Default Details
name str Unique name identifying this benchmark task.
dataset_id str Identifier for the raw dataset source.
u_cols list list of column names for input signals (u).
y_cols list list of column names for output signals (y).
metric_func Callable Primary metric: func(y_true, y_pred).
pred_horizon int The ‘k’ in k-step ahead prediction (mandatory for this type).
pred_step int Step size for k-step ahead prediction (e.g., predict y[t+k] using data up to t).
x_cols list[str] | None None Optional state inputs (x).
sampling_time float | None None Optional sampling time (seconds).
download_func collections.abc.Callable[[pathlib.Path, bool], None] | None None Dataset preparation func.
test_model_func Callable _test_prediction
custom_test_evaluation NoneType None
init_window int | None None Steps for warm-up, potentially ignored in evaluation.
data_root [<class ‘pathlib.Path’>, collections.abc.Callable[[], pathlib.Path]] get_default_data_root root dir for dataset, may be a callable or path
# Test: BenchmarkSpec basic initialization and defaults
_spec_sim = BenchmarkSpecSimulation(
    name='_spec_default', dataset_id='_dummy_default',
    u_cols=['u0'], y_cols=['y0'], metric_func=identibench.metrics.rmse, 
    download_func=_dummy_dataset_loader
)
test_eq(_spec_sim.init_window, None)
test_eq(_spec_sim.name, '_spec_default')
# Test: BenchmarkSpec initialization with prediction-related parameters
_spec_pred = BenchmarkSpecPrediction(
    name='_spec_pred_params', dataset_id='_dummy_pred_params',
    u_cols=['u0'], y_cols=['y0'], metric_func=identibench.metrics.rmse, 
    download_func=_dummy_dataset_loader, 
    init_window=20, pred_horizon=5, pred_step=2
)
test_eq(_spec_pred.init_window, 20)
test_eq(_spec_pred.pred_horizon, 5)
test_eq(_spec_pred.pred_step, 2)
# Test: BenchmarkSpec ensure_dataset_exists - first call (creation)
_spec_ensure = BenchmarkSpecSimulation(
    name='_spec_ensure', dataset_id='_dummy_ensure',
    u_cols=['u0'], y_cols=['y0'], metric_func=identibench.metrics.rmse, 
    download_func=_dummy_dataset_loader
)
_spec_ensure.ensure_dataset_exists()
_dataset_path_ensure = _spec_ensure.dataset_path
test_eq(_dataset_path_ensure.is_dir(), True)
test_eq((_dataset_path_ensure / 'train' / 'train_0.hdf5').is_file(), True)
# Test: BenchmarkSpec ensure_dataset_exists - second call (skip)
_mtime_before_skip = (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
time.sleep(0.1) 
_spec_ensure.ensure_dataset_exists() 
_mtime_after_skip = (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
test_eq(_mtime_before_skip, _mtime_after_skip)
# Test: BenchmarkSpec ensure_dataset_exists - third call (force_download=True)
_mtime_before_force = (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
time.sleep(0.1) 
_spec_ensure.ensure_dataset_exists(force_download=True) 
_mtime_after_force = (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
test_ne(_mtime_before_force, _mtime_after_force)
Preparing dataset for '_spec_ensure' at /Users/daniel/.identibench_data/_dummy_ensure...
Dataset '_spec_ensure' prepared successfully.

Training Context


source

TrainingContext

 TrainingContext (spec:__main__.BenchmarkSpecBase,
                  hyperparameters:dict[str,typing.Any],
                  seed:int|None=None)

*Context object passed to the user’s training function (build_predictor).

Holds the benchmark specification, hyperparameters, and seed. Provides methods to access the raw, full-length training and validation data sequences. Windowing/batching for training must be handled within the user’s build_predictor function.*

Type Default Details
spec BenchmarkSpecBase The benchmark specification.
hyperparameters dict User-provided dictionary containing model and training hyperparameters.
seed int | None None Optional random seed for reproducibility.
#todo: test

Benchmark Runtime


source

run_benchmark

 run_benchmark (spec, build_model, hyperparameters={}, seed=None)
# Example usage of run_benchmark
hyperparams = {'learning_rate': 0.01, 'epochs': 5} # Example hyperparameters

benchmark_results = run_benchmark(
    spec=_spec_sim, 
    build_model=_dummy_build_model,
    hyperparameters=hyperparams
)
Building model with spec: _spec_default, seed: 138830228
{'benchmark_name': '_spec_default',
 'dataset_id': '_dummy_default',
 'hyperparameters': {'learning_rate': 0.01, 'epochs': 5},
 'seed': 138830228,
 'training_time_seconds': 4.279200220480561e-05,
 'test_time_seconds': 0.0013009580434300005,
 'benchmark_type': 'BenchmarkSpecSimulation',
 'metric_name': 'rmse',
 'metric_score': 0.5644842382745956,
 'custom_scores': {}}
# Example usage of run_benchmark
benchmark_results = run_benchmark(
    spec=_spec_pred, 
    build_model=_dummy_build_model,
    hyperparameters=hyperparams
)
Building model with spec: _spec_pred_params, seed: 3900254360
{'benchmark_name': '_spec_pred_params',
 'dataset_id': '_dummy_pred_params',
 'hyperparameters': {'learning_rate': 0.01, 'epochs': 5},
 'seed': 3900254360,
 'training_time_seconds': 6.71250163577497e-05,
 'test_time_seconds': 0.0010067080147564411,
 'benchmark_type': 'BenchmarkSpecPrediction',
 'metric_name': 'rmse',
 'metric_score': 0.5594019958882623,
 'custom_scores': {}}
def custom_evaluation(results,spec):
    def get_max_abs_error(y_pred,y_test):
        return np.max(np.abs(y_test - y_pred))
    def get_max_error(y_pred,y_test):
        return np.max(y_test - y_pred)

    avg_max_abs_error = aggregate_metric_score(results, get_max_abs_error, score_name='avg_max_abs_error',sequence_aggregation_func=np.mean,window_aggregation_func=np.mean)
    median_max_error = aggregate_metric_score(results, get_max_error, score_name='median_max_abs_error',sequence_aggregation_func=np.median,window_aggregation_func=np.median)
    return {**avg_max_abs_error, **median_max_error}
spec_with_custom_test = BenchmarkSpecSimulation(
    name="CustomTestExampleBench",
    dataset_id="dummy_core_data_v1", # Same dataset ID as before
    download_func=_dummy_dataset_loader, 
    u_cols=['u0', 'u1'], 
    y_cols=['y0'],
    custom_test_evaluation=custom_evaluation,
    metric_func=identibench.metrics.rmse
)
# Run benchmark using the spec with the custom test function
hyperparams = {'model_type': 'dummy_v2'} 

benchmark_results = run_benchmark(
    spec=spec_with_custom_test, 
    build_model=_dummy_build_model,
    hyperparameters=hyperparams
)
Building model with spec: CustomTestExampleBench, seed: 1172241199
{'benchmark_name': 'CustomTestExampleBench',
 'dataset_id': 'dummy_core_data_v1',
 'hyperparameters': {'model_type': 'dummy_v2'},
 'seed': 1172241199,
 'training_time_seconds': 2.1415995433926582e-05,
 'test_time_seconds': 0.0015841670101508498,
 'benchmark_type': 'BenchmarkSpecSimulation',
 'metric_name': 'rmse',
 'metric_score': 0.5739597924041242,
 'custom_scores': {'avg_max_abs_error': 0.9934645593166351,
  'median_max_abs_error': 0.9934645593166351}}

source

benchmark_results_to_dataframe

 benchmark_results_to_dataframe (results_list:list[dict[str,typing.Any]])

Transforms a list of benchmark result dictionaries into a pandas DataFrame.

Type Details
results_list list List of benchmark result dictionaries from run_benchmark.
Returns DataFrame

source

run_benchmarks

 run_benchmarks (specs:list[__main__.BenchmarkSpecBase]|dict[str,__main__.
                 BenchmarkSpecBase], build_model:collections.abc.Callable[
                 [__main__.TrainingContext],collections.abc.Callable], hyp
                 erparameters:dict[str,typing.Any]|list[dict[str,typing.An
                 y]]|None=None, n_times:int=1,
                 continue_on_error:bool=True, return_dataframe:bool=True)

*Runs multiple benchmarks sequentially, with repetitions and flexible hyperparameters.

Returns either a pandas DataFrame summarizing the results (default) or a list of raw result dictionaries.*

Type Default Details
specs list[main.BenchmarkSpecBase] | dict[str, main.BenchmarkSpecBase] Collection of specs to run.
build_model Callable User function to build the model/predictor.
hyperparameters dict[str, typing.Any] | list[dict[str, typing.Any]] | None None Single dict, list of dicts (matching specs), or None.
n_times int 1 Number of times to repeat each benchmark specification.
continue_on_error bool True If True, continue running benchmarks even if one fails.
return_dataframe bool True If True, return results as a pandas DataFrame, otherwise return a list of dicts.
Returns pandas.core.frame.DataFrame | list[dict[str, typing.Any]]
benchmark_results = run_benchmarks(
    specs=[_spec_sim,_spec_pred,spec_with_custom_test], 
    build_model=_dummy_build_model,
    return_dataframe=False
)
benchmark_results_to_dataframe(benchmark_results)
--- Starting benchmark run for 3 specifications, repeating each 1 times ---

-- Repetition 1/1 --

[1/3] Running: _spec_default (Rep 1)
Building model with spec: _spec_default, seed: 2979218856
  -> Success: _spec_default (Rep 1) completed.

[2/3] Running: _spec_pred_params (Rep 1)
Building model with spec: _spec_pred_params, seed: 2767908549
  -> Success: _spec_pred_params (Rep 1) completed.

[3/3] Running: CustomTestExampleBench (Rep 1)
Building model with spec: CustomTestExampleBench, seed: 3139743514
  -> Success: CustomTestExampleBench (Rep 1) completed.

--- Benchmark run finished. 3/3 individual runs completed successfully. ---
benchmark_name dataset_id hyperparameters seed training_time_seconds test_time_seconds benchmark_type metric_name metric_score cs_avg_max_abs_error cs_median_max_abs_error
0 _spec_default _dummy_default {} 2979218856 0.000006 0.001325 BenchmarkSpecSimulation rmse 0.564484 NaN NaN
1 _spec_pred_params _dummy_pred_params {} 2767908549 0.000006 0.000844 BenchmarkSpecPrediction rmse 0.559402 NaN NaN
2 CustomTestExampleBench dummy_core_data_v1 {} 3139743514 0.000005 0.000521 BenchmarkSpecSimulation rmse 0.573960 0.993465 0.993465
results_multiple_runs = run_benchmarks(
    specs=[_spec_sim,_spec_pred,spec_with_custom_test], 
    build_model=_dummy_build_model,
    n_times=3
)
results_multiple_runs
--- Starting benchmark run for 3 specifications, repeating each 3 times ---

-- Repetition 1/3 --

[1/9] Running: _spec_default (Rep 1)
Building model with spec: _spec_default, seed: 30935737
  -> Success: _spec_default (Rep 1) completed.

[2/9] Running: _spec_pred_params (Rep 1)
Building model with spec: _spec_pred_params, seed: 2986847840
  -> Success: _spec_pred_params (Rep 1) completed.

[3/9] Running: CustomTestExampleBench (Rep 1)
Building model with spec: CustomTestExampleBench, seed: 1147267216
  -> Success: CustomTestExampleBench (Rep 1) completed.

-- Repetition 2/3 --

[4/9] Running: _spec_default (Rep 2)
Building model with spec: _spec_default, seed: 3191904871
  -> Success: _spec_default (Rep 2) completed.

[5/9] Running: _spec_pred_params (Rep 2)
Building model with spec: _spec_pred_params, seed: 1536587039
  -> Success: _spec_pred_params (Rep 2) completed.

[6/9] Running: CustomTestExampleBench (Rep 2)
Building model with spec: CustomTestExampleBench, seed: 3900899545
  -> Success: CustomTestExampleBench (Rep 2) completed.

-- Repetition 3/3 --

[7/9] Running: _spec_default (Rep 3)
Building model with spec: _spec_default, seed: 3797015292
  -> Success: _spec_default (Rep 3) completed.

[8/9] Running: _spec_pred_params (Rep 3)
Building model with spec: _spec_pred_params, seed: 3789263585
  -> Success: _spec_pred_params (Rep 3) completed.

[9/9] Running: CustomTestExampleBench (Rep 3)
Building model with spec: CustomTestExampleBench, seed: 851966748
  -> Success: CustomTestExampleBench (Rep 3) completed.

--- Benchmark run finished. 9/9 individual runs completed successfully. ---
benchmark_name dataset_id hyperparameters seed training_time_seconds test_time_seconds benchmark_type metric_name metric_score cs_avg_max_abs_error cs_median_max_abs_error
0 _spec_default _dummy_default {} 30935737 0.000009 0.001040 BenchmarkSpecSimulation rmse 0.564484 NaN NaN
1 _spec_pred_params _dummy_pred_params {} 2986847840 0.000004 0.000537 BenchmarkSpecPrediction rmse 0.559402 NaN NaN
2 CustomTestExampleBench dummy_core_data_v1 {} 1147267216 0.000004 0.000385 BenchmarkSpecSimulation rmse 0.573960 0.993465 0.993465
3 _spec_default _dummy_default {} 3191904871 0.000003 0.000280 BenchmarkSpecSimulation rmse 0.564484 NaN NaN
4 _spec_pred_params _dummy_pred_params {} 1536587039 0.000003 0.000285 BenchmarkSpecPrediction rmse 0.559402 NaN NaN
5 CustomTestExampleBench dummy_core_data_v1 {} 3900899545 0.000003 0.000330 BenchmarkSpecSimulation rmse 0.573960 0.993465 0.993465
6 _spec_default _dummy_default {} 3797015292 0.000003 0.000264 BenchmarkSpecSimulation rmse 0.564484 NaN NaN
7 _spec_pred_params _dummy_pred_params {} 3789263585 0.000003 0.000278 BenchmarkSpecPrediction rmse 0.559402 NaN NaN
8 CustomTestExampleBench dummy_core_data_v1 {} 851966748 0.000003 0.000531 BenchmarkSpecSimulation rmse 0.573960 0.993465 0.993465

source

aggregate_benchmark_results

 aggregate_benchmark_results (results_df:pandas.core.frame.DataFrame,
                              group_by_cols:str|list[str]='benchmark_name'
                              , agg_funcs:str|list[str]='mean')

Aggregates numeric results from a benchmark DataFrame, grouped by specified columns.

Type Default Details
results_df DataFrame DataFrame returned by run_benchmarks (with return_dataframe=True).
group_by_cols str | list[str] benchmark_name Column(s) to group by before aggregation.
agg_funcs str | list[str] mean Aggregation function(s) (‘mean’, ‘median’, ‘std’, etc.) or list thereof.
Returns DataFrame
aggregate_benchmark_results(results_multiple_runs,agg_funcs=['mean','std'])
training_time_seconds test_time_seconds metric_score cs_avg_max_abs_error cs_median_max_abs_error
mean std mean std mean std mean std mean std
benchmark_name
CustomTestExampleBench 0.000003 4.453506e-07 0.000415 0.000104 0.573960 0.0 0.993465 0.0 0.993465 0.0
_spec_default 0.000005 3.395723e-06 0.000528 0.000443 0.564484 0.0 NaN NaN NaN NaN
_spec_pred_params 0.000003 3.254011e-07 0.000367 0.000147 0.559402 0.0 NaN NaN NaN NaN