# Test: BenchmarkSpec basic initialization and defaults
= BenchmarkSpecSimulation(
_spec_sim ='_spec_default', dataset_id='_dummy_default',
name=['u0'], y_cols=['y0'], metric_func=identibench.metrics.rmse,
u_cols=_dummy_dataset_loader
download_func
)None)
test_eq(_spec_sim.init_window, '_spec_default') test_eq(_spec_sim.name,
Benchmark
Benchmark Specifications
BenchmarkSpecSimulation
BenchmarkSpecSimulation (name:str, dataset_id:str, u_cols:list[str], y_cols:list[str], metric_func:collections.abc.Ca llable[[numpy.ndarray,numpy.ndarray],float], x_cols:list[str]|None=None, sampling_time:float|None=None, download_func:col lections.abc.Callable[[pathlib.Path,bool],None]| None=None, test_model_func:collections.abc.Calla ble[[__main__.BenchmarkSpecBase,collections.abc. Callable],dict[str,typing.Any]]=<function _test_simulation>, custom_test_evaluation=None, init_window:int|None=None, data_root:[<class'pat hlib.Path'>,collections.abc.Callable[[],pathlib. Path]]=<function get_default_data_root>)
*Specification for a simulation benchmark task.
Inherits common parameters from BaseBenchmarkSpec. Use this when the goal is to simulate the system’s output given the input u
.*
Type | Default | Details | |
---|---|---|---|
name | str | Unique name identifying this benchmark task. | |
dataset_id | str | Identifier for the raw dataset source. | |
u_cols | list | list of column names for input signals (u). | |
y_cols | list | list of column names for output signals (y). | |
metric_func | Callable | Primary metric: func(y_true, y_pred) . |
|
x_cols | list[str] | None | None | Optional state inputs (x). |
sampling_time | float | None | None | Optional sampling time (seconds). |
download_func | collections.abc.Callable[[pathlib.Path, bool], None] | None | None | Dataset preparation func. |
test_model_func | Callable | _test_simulation | |
custom_test_evaluation | NoneType | None | |
init_window | int | None | None | Steps for warm-up, potentially ignored in evaluation. |
data_root | [<class ‘pathlib.Path’>, collections.abc.Callable[[], pathlib.Path]] | get_default_data_root | root dir for dataset, may be a callable or path |
BenchmarkSpecPrediction
BenchmarkSpecPrediction (name:str, dataset_id:str, u_cols:list[str], y_cols:list[str], metric_func:collections.abc.Ca llable[[numpy.ndarray,numpy.ndarray],float], pred_horizon:int, pred_step:int, x_cols:list[str]|None=None, sampling_time:float|None=None, download_func:col lections.abc.Callable[[pathlib.Path,bool],None]| None=None, test_model_func:collections.abc.Calla ble[[__main__.BenchmarkSpecBase,collections.abc. Callable],dict[str,typing.Any]]=<function _test_prediction>, custom_test_evaluation=None, init_window:int|None=None, data_root:[<class'pat hlib.Path'>,collections.abc.Callable[[],pathlib. Path]]=<function get_default_data_root>)
*Specification for a k-step ahead prediction benchmark task.
Inherits common parameters from BaseBenchmarkSpec and adds prediction-specific ones. Use this when the goal is to predict y
some steps ahead based on past u
and y
.*
Type | Default | Details | |
---|---|---|---|
name | str | Unique name identifying this benchmark task. | |
dataset_id | str | Identifier for the raw dataset source. | |
u_cols | list | list of column names for input signals (u). | |
y_cols | list | list of column names for output signals (y). | |
metric_func | Callable | Primary metric: func(y_true, y_pred) . |
|
pred_horizon | int | The ‘k’ in k-step ahead prediction (mandatory for this type). | |
pred_step | int | Step size for k-step ahead prediction (e.g., predict y[t+k] using data up to t). | |
x_cols | list[str] | None | None | Optional state inputs (x). |
sampling_time | float | None | None | Optional sampling time (seconds). |
download_func | collections.abc.Callable[[pathlib.Path, bool], None] | None | None | Dataset preparation func. |
test_model_func | Callable | _test_prediction | |
custom_test_evaluation | NoneType | None | |
init_window | int | None | None | Steps for warm-up, potentially ignored in evaluation. |
data_root | [<class ‘pathlib.Path’>, collections.abc.Callable[[], pathlib.Path]] | get_default_data_root | root dir for dataset, may be a callable or path |
# Test: BenchmarkSpec initialization with prediction-related parameters
= BenchmarkSpecPrediction(
_spec_pred ='_spec_pred_params', dataset_id='_dummy_pred_params',
name=['u0'], y_cols=['y0'], metric_func=identibench.metrics.rmse,
u_cols=_dummy_dataset_loader,
download_func=20, pred_horizon=5, pred_step=2
init_window
)20)
test_eq(_spec_pred.init_window, 5)
test_eq(_spec_pred.pred_horizon, 2) test_eq(_spec_pred.pred_step,
# Test: BenchmarkSpec ensure_dataset_exists - first call (creation)
= BenchmarkSpecSimulation(
_spec_ensure ='_spec_ensure', dataset_id='_dummy_ensure',
name=['u0'], y_cols=['y0'], metric_func=identibench.metrics.rmse,
u_cols=_dummy_dataset_loader
download_func
)
_spec_ensure.ensure_dataset_exists()= _spec_ensure.dataset_path
_dataset_path_ensure True)
test_eq(_dataset_path_ensure.is_dir(), / 'train' / 'train_0.hdf5').is_file(), True) test_eq((_dataset_path_ensure
# Test: BenchmarkSpec ensure_dataset_exists - second call (skip)
= (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
_mtime_before_skip 0.1)
time.sleep(
_spec_ensure.ensure_dataset_exists() = (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
_mtime_after_skip test_eq(_mtime_before_skip, _mtime_after_skip)
# Test: BenchmarkSpec ensure_dataset_exists - third call (force_download=True)
= (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
_mtime_before_force 0.1)
time.sleep(=True)
_spec_ensure.ensure_dataset_exists(force_download= (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
_mtime_after_force test_ne(_mtime_before_force, _mtime_after_force)
Preparing dataset for '_spec_ensure' at /Users/daniel/.identibench_data/_dummy_ensure...
Dataset '_spec_ensure' prepared successfully.
Training Context
TrainingContext
TrainingContext (spec:__main__.BenchmarkSpecBase, hyperparameters:dict[str,typing.Any], seed:int|None=None)
*Context object passed to the user’s training function (build_predictor
).
Holds the benchmark specification, hyperparameters, and seed. Provides methods to access the raw, full-length training and validation data sequences. Windowing/batching for training must be handled within the user’s build_predictor
function.*
Type | Default | Details | |
---|---|---|---|
spec | BenchmarkSpecBase | The benchmark specification. | |
hyperparameters | dict | User-provided dictionary containing model and training hyperparameters. | |
seed | int | None | None | Optional random seed for reproducibility. |
#todo: test
Benchmark Runtime
run_benchmark
run_benchmark (spec, build_model, hyperparameters={}, seed=None)
# Example usage of run_benchmark
= {'learning_rate': 0.01, 'epochs': 5} # Example hyperparameters
hyperparams
= run_benchmark(
benchmark_results =_spec_sim,
spec=_dummy_build_model,
build_model=hyperparams
hyperparameters )
Building model with spec: _spec_default, seed: 138830228
{'benchmark_name': '_spec_default',
'dataset_id': '_dummy_default',
'hyperparameters': {'learning_rate': 0.01, 'epochs': 5},
'seed': 138830228,
'training_time_seconds': 4.279200220480561e-05,
'test_time_seconds': 0.0013009580434300005,
'benchmark_type': 'BenchmarkSpecSimulation',
'metric_name': 'rmse',
'metric_score': 0.5644842382745956,
'custom_scores': {}}
# Example usage of run_benchmark
= run_benchmark(
benchmark_results =_spec_pred,
spec=_dummy_build_model,
build_model=hyperparams
hyperparameters )
Building model with spec: _spec_pred_params, seed: 3900254360
{'benchmark_name': '_spec_pred_params',
'dataset_id': '_dummy_pred_params',
'hyperparameters': {'learning_rate': 0.01, 'epochs': 5},
'seed': 3900254360,
'training_time_seconds': 6.71250163577497e-05,
'test_time_seconds': 0.0010067080147564411,
'benchmark_type': 'BenchmarkSpecPrediction',
'metric_name': 'rmse',
'metric_score': 0.5594019958882623,
'custom_scores': {}}
def custom_evaluation(results,spec):
def get_max_abs_error(y_pred,y_test):
return np.max(np.abs(y_test - y_pred))
def get_max_error(y_pred,y_test):
return np.max(y_test - y_pred)
= aggregate_metric_score(results, get_max_abs_error, score_name='avg_max_abs_error',sequence_aggregation_func=np.mean,window_aggregation_func=np.mean)
avg_max_abs_error = aggregate_metric_score(results, get_max_error, score_name='median_max_abs_error',sequence_aggregation_func=np.median,window_aggregation_func=np.median)
median_max_error return {**avg_max_abs_error, **median_max_error}
= BenchmarkSpecSimulation(
spec_with_custom_test ="CustomTestExampleBench",
name="dummy_core_data_v1", # Same dataset ID as before
dataset_id=_dummy_dataset_loader,
download_func=['u0', 'u1'],
u_cols=['y0'],
y_cols=custom_evaluation,
custom_test_evaluation=identibench.metrics.rmse
metric_func )
# Run benchmark using the spec with the custom test function
= {'model_type': 'dummy_v2'}
hyperparams
= run_benchmark(
benchmark_results =spec_with_custom_test,
spec=_dummy_build_model,
build_model=hyperparams
hyperparameters )
Building model with spec: CustomTestExampleBench, seed: 1172241199
{'benchmark_name': 'CustomTestExampleBench',
'dataset_id': 'dummy_core_data_v1',
'hyperparameters': {'model_type': 'dummy_v2'},
'seed': 1172241199,
'training_time_seconds': 2.1415995433926582e-05,
'test_time_seconds': 0.0015841670101508498,
'benchmark_type': 'BenchmarkSpecSimulation',
'metric_name': 'rmse',
'metric_score': 0.5739597924041242,
'custom_scores': {'avg_max_abs_error': 0.9934645593166351,
'median_max_abs_error': 0.9934645593166351}}
benchmark_results_to_dataframe
benchmark_results_to_dataframe (results_list:list[dict[str,typing.Any]])
Transforms a list of benchmark result dictionaries into a pandas DataFrame.
Type | Details | |
---|---|---|
results_list | list | List of benchmark result dictionaries from run_benchmark . |
Returns | DataFrame |
run_benchmarks
run_benchmarks (specs:list[__main__.BenchmarkSpecBase]|dict[str,__main__. BenchmarkSpecBase], build_model:collections.abc.Callable[ [__main__.TrainingContext],collections.abc.Callable], hyp erparameters:dict[str,typing.Any]|list[dict[str,typing.An y]]|None=None, n_times:int=1, continue_on_error:bool=True, return_dataframe:bool=True)
*Runs multiple benchmarks sequentially, with repetitions and flexible hyperparameters.
Returns either a pandas DataFrame summarizing the results (default) or a list of raw result dictionaries.*
Type | Default | Details | |
---|---|---|---|
specs | list[main.BenchmarkSpecBase] | dict[str, main.BenchmarkSpecBase] | Collection of specs to run. | |
build_model | Callable | User function to build the model/predictor. | |
hyperparameters | dict[str, typing.Any] | list[dict[str, typing.Any]] | None | None | Single dict, list of dicts (matching specs), or None. |
n_times | int | 1 | Number of times to repeat each benchmark specification. |
continue_on_error | bool | True | If True, continue running benchmarks even if one fails. |
return_dataframe | bool | True | If True, return results as a pandas DataFrame, otherwise return a list of dicts. |
Returns | pandas.core.frame.DataFrame | list[dict[str, typing.Any]] |
= run_benchmarks(
benchmark_results =[_spec_sim,_spec_pred,spec_with_custom_test],
specs=_dummy_build_model,
build_model=False
return_dataframe
) benchmark_results_to_dataframe(benchmark_results)
--- Starting benchmark run for 3 specifications, repeating each 1 times ---
-- Repetition 1/1 --
[1/3] Running: _spec_default (Rep 1)
Building model with spec: _spec_default, seed: 2979218856
-> Success: _spec_default (Rep 1) completed.
[2/3] Running: _spec_pred_params (Rep 1)
Building model with spec: _spec_pred_params, seed: 2767908549
-> Success: _spec_pred_params (Rep 1) completed.
[3/3] Running: CustomTestExampleBench (Rep 1)
Building model with spec: CustomTestExampleBench, seed: 3139743514
-> Success: CustomTestExampleBench (Rep 1) completed.
--- Benchmark run finished. 3/3 individual runs completed successfully. ---
benchmark_name | dataset_id | hyperparameters | seed | training_time_seconds | test_time_seconds | benchmark_type | metric_name | metric_score | cs_avg_max_abs_error | cs_median_max_abs_error | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | _spec_default | _dummy_default | {} | 2979218856 | 0.000006 | 0.001325 | BenchmarkSpecSimulation | rmse | 0.564484 | NaN | NaN |
1 | _spec_pred_params | _dummy_pred_params | {} | 2767908549 | 0.000006 | 0.000844 | BenchmarkSpecPrediction | rmse | 0.559402 | NaN | NaN |
2 | CustomTestExampleBench | dummy_core_data_v1 | {} | 3139743514 | 0.000005 | 0.000521 | BenchmarkSpecSimulation | rmse | 0.573960 | 0.993465 | 0.993465 |
= run_benchmarks(
results_multiple_runs =[_spec_sim,_spec_pred,spec_with_custom_test],
specs=_dummy_build_model,
build_model=3
n_times
) results_multiple_runs
--- Starting benchmark run for 3 specifications, repeating each 3 times ---
-- Repetition 1/3 --
[1/9] Running: _spec_default (Rep 1)
Building model with spec: _spec_default, seed: 30935737
-> Success: _spec_default (Rep 1) completed.
[2/9] Running: _spec_pred_params (Rep 1)
Building model with spec: _spec_pred_params, seed: 2986847840
-> Success: _spec_pred_params (Rep 1) completed.
[3/9] Running: CustomTestExampleBench (Rep 1)
Building model with spec: CustomTestExampleBench, seed: 1147267216
-> Success: CustomTestExampleBench (Rep 1) completed.
-- Repetition 2/3 --
[4/9] Running: _spec_default (Rep 2)
Building model with spec: _spec_default, seed: 3191904871
-> Success: _spec_default (Rep 2) completed.
[5/9] Running: _spec_pred_params (Rep 2)
Building model with spec: _spec_pred_params, seed: 1536587039
-> Success: _spec_pred_params (Rep 2) completed.
[6/9] Running: CustomTestExampleBench (Rep 2)
Building model with spec: CustomTestExampleBench, seed: 3900899545
-> Success: CustomTestExampleBench (Rep 2) completed.
-- Repetition 3/3 --
[7/9] Running: _spec_default (Rep 3)
Building model with spec: _spec_default, seed: 3797015292
-> Success: _spec_default (Rep 3) completed.
[8/9] Running: _spec_pred_params (Rep 3)
Building model with spec: _spec_pred_params, seed: 3789263585
-> Success: _spec_pred_params (Rep 3) completed.
[9/9] Running: CustomTestExampleBench (Rep 3)
Building model with spec: CustomTestExampleBench, seed: 851966748
-> Success: CustomTestExampleBench (Rep 3) completed.
--- Benchmark run finished. 9/9 individual runs completed successfully. ---
benchmark_name | dataset_id | hyperparameters | seed | training_time_seconds | test_time_seconds | benchmark_type | metric_name | metric_score | cs_avg_max_abs_error | cs_median_max_abs_error | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | _spec_default | _dummy_default | {} | 30935737 | 0.000009 | 0.001040 | BenchmarkSpecSimulation | rmse | 0.564484 | NaN | NaN |
1 | _spec_pred_params | _dummy_pred_params | {} | 2986847840 | 0.000004 | 0.000537 | BenchmarkSpecPrediction | rmse | 0.559402 | NaN | NaN |
2 | CustomTestExampleBench | dummy_core_data_v1 | {} | 1147267216 | 0.000004 | 0.000385 | BenchmarkSpecSimulation | rmse | 0.573960 | 0.993465 | 0.993465 |
3 | _spec_default | _dummy_default | {} | 3191904871 | 0.000003 | 0.000280 | BenchmarkSpecSimulation | rmse | 0.564484 | NaN | NaN |
4 | _spec_pred_params | _dummy_pred_params | {} | 1536587039 | 0.000003 | 0.000285 | BenchmarkSpecPrediction | rmse | 0.559402 | NaN | NaN |
5 | CustomTestExampleBench | dummy_core_data_v1 | {} | 3900899545 | 0.000003 | 0.000330 | BenchmarkSpecSimulation | rmse | 0.573960 | 0.993465 | 0.993465 |
6 | _spec_default | _dummy_default | {} | 3797015292 | 0.000003 | 0.000264 | BenchmarkSpecSimulation | rmse | 0.564484 | NaN | NaN |
7 | _spec_pred_params | _dummy_pred_params | {} | 3789263585 | 0.000003 | 0.000278 | BenchmarkSpecPrediction | rmse | 0.559402 | NaN | NaN |
8 | CustomTestExampleBench | dummy_core_data_v1 | {} | 851966748 | 0.000003 | 0.000531 | BenchmarkSpecSimulation | rmse | 0.573960 | 0.993465 | 0.993465 |
aggregate_benchmark_results
aggregate_benchmark_results (results_df:pandas.core.frame.DataFrame, group_by_cols:str|list[str]='benchmark_name' , agg_funcs:str|list[str]='mean')
Aggregates numeric results from a benchmark DataFrame, grouped by specified columns.
Type | Default | Details | |
---|---|---|---|
results_df | DataFrame | DataFrame returned by run_benchmarks (with return_dataframe=True). | |
group_by_cols | str | list[str] | benchmark_name | Column(s) to group by before aggregation. |
agg_funcs | str | list[str] | mean | Aggregation function(s) (‘mean’, ‘median’, ‘std’, etc.) or list thereof. |
Returns | DataFrame |
=['mean','std']) aggregate_benchmark_results(results_multiple_runs,agg_funcs
training_time_seconds | test_time_seconds | metric_score | cs_avg_max_abs_error | cs_median_max_abs_error | ||||||
---|---|---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | mean | std | |
benchmark_name | ||||||||||
CustomTestExampleBench | 0.000003 | 4.453506e-07 | 0.000415 | 0.000104 | 0.573960 | 0.0 | 0.993465 | 0.0 | 0.993465 | 0.0 |
_spec_default | 0.000005 | 3.395723e-06 | 0.000528 | 0.000443 | 0.564484 | 0.0 | NaN | NaN | NaN | NaN |
_spec_pred_params | 0.000003 | 3.254011e-07 | 0.000367 | 0.000147 | 0.559402 | 0.0 | NaN | NaN | NaN | NaN |