Benchmark

Fill in a module description here

Benchmark Specifications

BenchmarkSpecSimulation

 BenchmarkSpecSimulation (name:str, dataset_id:str, u_cols:list[str],
                          y_cols:list[str], metric_func:collections.abc.Ca
                          llable[[numpy.ndarray,numpy.ndarray],float],
                          x_cols:list[str]|None=None,
                          sampling_time:float|None=None, download_func:col
                          lections.abc.Callable[[pathlib.Path,bool],None]|
                          None=None, test_model_func:collections.abc.Calla
                          ble[[__main__.BenchmarkSpecBase,collections.abc.
                          Callable],dict[str,typing.Any]]=<function
                          _test_simulation>, custom_test_evaluation=None,
                          init_window:int|None=None, data_root:[<class'pat
                          hlib.Path'>,collections.abc.Callable[[],pathlib.
                          Path]]=<function get_default_data_root>)

*Specification for a simulation benchmark task.

Inherits common parameters from BaseBenchmarkSpec. Use this when the goal is to simulate the system’s output given the input u.*

	Type	Default	Details
name	str		Unique name identifying this benchmark task.
dataset_id	str		Identifier for the raw dataset source.
u_cols	list		list of column names for input signals (u).
y_cols	list		list of column names for output signals (y).
metric_func	Callable		Primary metric: `func(y_true, y_pred)`.
x_cols	list[str] \| None	None	Optional state inputs (x).
sampling_time	float \| None	None	Optional sampling time (seconds).
download_func	collections.abc.Callable[[pathlib.Path, bool], None] \| None	None	Dataset preparation func.
test_model_func	Callable	_test_simulation
custom_test_evaluation	NoneType	None
init_window	int \| None	None	Steps for warm-up, potentially ignored in evaluation.
data_root	[<class ‘pathlib.Path’>, collections.abc.Callable[[], pathlib.Path]]	get_default_data_root	root dir for dataset, may be a callable or path

source

BenchmarkSpecPrediction

 BenchmarkSpecPrediction (name:str, dataset_id:str, u_cols:list[str],
                          y_cols:list[str], metric_func:collections.abc.Ca
                          llable[[numpy.ndarray,numpy.ndarray],float],
                          pred_horizon:int, pred_step:int,
                          x_cols:list[str]|None=None,
                          sampling_time:float|None=None, download_func:col
                          lections.abc.Callable[[pathlib.Path,bool],None]|
                          None=None, test_model_func:collections.abc.Calla
                          ble[[__main__.BenchmarkSpecBase,collections.abc.
                          Callable],dict[str,typing.Any]]=<function
                          _test_prediction>, custom_test_evaluation=None,
                          init_window:int|None=None, data_root:[<class'pat
                          hlib.Path'>,collections.abc.Callable[[],pathlib.
                          Path]]=<function get_default_data_root>)

*Specification for a k-step ahead prediction benchmark task.

Inherits common parameters from BaseBenchmarkSpec and adds prediction-specific ones. Use this when the goal is to predict y some steps ahead based on past u and y.*

	Type	Default	Details
name	str		Unique name identifying this benchmark task.
dataset_id	str		Identifier for the raw dataset source.
u_cols	list		list of column names for input signals (u).
y_cols	list		list of column names for output signals (y).
metric_func	Callable		Primary metric: `func(y_true, y_pred)`.
pred_horizon	int		The ‘k’ in k-step ahead prediction (mandatory for this type).
pred_step	int		Step size for k-step ahead prediction (e.g., predict y[t+k] using data up to t).
x_cols	list[str] \| None	None	Optional state inputs (x).
sampling_time	float \| None	None	Optional sampling time (seconds).
download_func	collections.abc.Callable[[pathlib.Path, bool], None] \| None	None	Dataset preparation func.
test_model_func	Callable	_test_prediction
custom_test_evaluation	NoneType	None
init_window	int \| None	None	Steps for warm-up, potentially ignored in evaluation.
data_root	[<class ‘pathlib.Path’>, collections.abc.Callable[[], pathlib.Path]]	get_default_data_root	root dir for dataset, may be a callable or path

# Test: BenchmarkSpec basic initialization and defaults
_spec_sim = BenchmarkSpecSimulation(
    name='_spec_default', dataset_id='_dummy_default',
    u_cols=['u0'], y_cols=['y0'], metric_func=identibench.metrics.rmse, 
    download_func=_dummy_dataset_loader
)
test_eq(_spec_sim.init_window, None)
test_eq(_spec_sim.name, '_spec_default')

# Test: BenchmarkSpec initialization with prediction-related parameters
_spec_pred = BenchmarkSpecPrediction(
    name='_spec_pred_params', dataset_id='_dummy_pred_params',
    u_cols=['u0'], y_cols=['y0'], metric_func=identibench.metrics.rmse, 
    download_func=_dummy_dataset_loader, 
    init_window=20, pred_horizon=5, pred_step=2
)
test_eq(_spec_pred.init_window, 20)
test_eq(_spec_pred.pred_horizon, 5)
test_eq(_spec_pred.pred_step, 2)

# Test: BenchmarkSpec ensure_dataset_exists - first call (creation)
_spec_ensure = BenchmarkSpecSimulation(
    name='_spec_ensure', dataset_id='_dummy_ensure',
    u_cols=['u0'], y_cols=['y0'], metric_func=identibench.metrics.rmse, 
    download_func=_dummy_dataset_loader
)
_spec_ensure.ensure_dataset_exists()
_dataset_path_ensure = _spec_ensure.dataset_path
test_eq(_dataset_path_ensure.is_dir(), True)
test_eq((_dataset_path_ensure / 'train' / 'train_0.hdf5').is_file(), True)

# Test: BenchmarkSpec ensure_dataset_exists - second call (skip)
_mtime_before_skip = (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
time.sleep(0.1) 
_spec_ensure.ensure_dataset_exists() 
_mtime_after_skip = (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
test_eq(_mtime_before_skip, _mtime_after_skip)

# Test: BenchmarkSpec ensure_dataset_exists - third call (force_download=True)
_mtime_before_force = (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
time.sleep(0.1) 
_spec_ensure.ensure_dataset_exists(force_download=True) 
_mtime_after_force = (_dataset_path_ensure / 'train' / 'train_0.hdf5').stat().st_mtime
test_ne(_mtime_before_force, _mtime_after_force)

Preparing dataset for '_spec_ensure' at /Users/daniel/.identibench_data/_dummy_ensure...
Dataset '_spec_ensure' prepared successfully.

Training Context

source

TrainingContext

 TrainingContext (spec:__main__.BenchmarkSpecBase,
                  hyperparameters:dict[str,typing.Any],
                  seed:int|None=None)

*Context object passed to the user’s training function (build_predictor).

Holds the benchmark specification, hyperparameters, and seed. Provides methods to access the raw, full-length training and validation data sequences. Windowing/batching for training must be handled within the user’s build_predictor function.*

	Type	Default	Details
spec	BenchmarkSpecBase		The benchmark specification.
hyperparameters	dict		User-provided dictionary containing model and training hyperparameters.
seed	int \| None	None	Optional random seed for reproducibility.

#todo: test

Benchmark Runtime

source

run_benchmark

 run_benchmark (spec, build_model, hyperparameters={}, seed=None)

# Example usage of run_benchmark
hyperparams = {'learning_rate': 0.01, 'epochs': 5} # Example hyperparameters

benchmark_results = run_benchmark(
    spec=_spec_sim, 
    build_model=_dummy_build_model,
    hyperparameters=hyperparams
)

Building model with spec: _spec_default, seed: 138830228

{'benchmark_name': '_spec_default',
 'dataset_id': '_dummy_default',
 'hyperparameters': {'learning_rate': 0.01, 'epochs': 5},
 'seed': 138830228,
 'training_time_seconds': 4.279200220480561e-05,
 'test_time_seconds': 0.0013009580434300005,
 'benchmark_type': 'BenchmarkSpecSimulation',
 'metric_name': 'rmse',
 'metric_score': 0.5644842382745956,
 'custom_scores': {}}

# Example usage of run_benchmark
benchmark_results = run_benchmark(
    spec=_spec_pred, 
    build_model=_dummy_build_model,
    hyperparameters=hyperparams
)

Building model with spec: _spec_pred_params, seed: 3900254360

{'benchmark_name': '_spec_pred_params',
 'dataset_id': '_dummy_pred_params',
 'hyperparameters': {'learning_rate': 0.01, 'epochs': 5},
 'seed': 3900254360,
 'training_time_seconds': 6.71250163577497e-05,
 'test_time_seconds': 0.0010067080147564411,
 'benchmark_type': 'BenchmarkSpecPrediction',
 'metric_name': 'rmse',
 'metric_score': 0.5594019958882623,
 'custom_scores': {}}

def custom_evaluation(results,spec):
    def get_max_abs_error(y_pred,y_test):
        return np.max(np.abs(y_test - y_pred))
    def get_max_error(y_pred,y_test):
        return np.max(y_test - y_pred)

    avg_max_abs_error = aggregate_metric_score(results, get_max_abs_error, score_name='avg_max_abs_error',sequence_aggregation_func=np.mean,window_aggregation_func=np.mean)
    median_max_error = aggregate_metric_score(results, get_max_error, score_name='median_max_abs_error',sequence_aggregation_func=np.median,window_aggregation_func=np.median)
    return {**avg_max_abs_error, **median_max_error}

spec_with_custom_test = BenchmarkSpecSimulation(
    name="CustomTestExampleBench",
    dataset_id="dummy_core_data_v1", # Same dataset ID as before
    download_func=_dummy_dataset_loader, 
    u_cols=['u0', 'u1'], 
    y_cols=['y0'],
    custom_test_evaluation=custom_evaluation,
    metric_func=identibench.metrics.rmse
)

# Run benchmark using the spec with the custom test function
hyperparams = {'model_type': 'dummy_v2'} 

benchmark_results = run_benchmark(
    spec=spec_with_custom_test, 
    build_model=_dummy_build_model,
    hyperparameters=hyperparams
)

Building model with spec: CustomTestExampleBench, seed: 1172241199

{'benchmark_name': 'CustomTestExampleBench',
 'dataset_id': 'dummy_core_data_v1',
 'hyperparameters': {'model_type': 'dummy_v2'},
 'seed': 1172241199,
 'training_time_seconds': 2.1415995433926582e-05,
 'test_time_seconds': 0.0015841670101508498,
 'benchmark_type': 'BenchmarkSpecSimulation',
 'metric_name': 'rmse',
 'metric_score': 0.5739597924041242,
 'custom_scores': {'avg_max_abs_error': 0.9934645593166351,
  'median_max_abs_error': 0.9934645593166351}}

source

benchmark_results_to_dataframe

 benchmark_results_to_dataframe (results_list:list[dict[str,typing.Any]])

Transforms a list of benchmark result dictionaries into a pandas DataFrame.

	Type	Details
results_list	list	List of benchmark result dictionaries from `run_benchmark`.
Returns	DataFrame

source

run_benchmarks

 run_benchmarks (specs:list[__main__.BenchmarkSpecBase]|dict[str,__main__.
                 BenchmarkSpecBase], build_model:collections.abc.Callable[
                 [__main__.TrainingContext],collections.abc.Callable], hyp
                 erparameters:dict[str,typing.Any]|list[dict[str,typing.An
                 y]]|None=None, n_times:int=1,
                 continue_on_error:bool=True, return_dataframe:bool=True)

*Runs multiple benchmarks sequentially, with repetitions and flexible hyperparameters.

Returns either a pandas DataFrame summarizing the results (default) or a list of raw result dictionaries.*

	Type	Default	Details
specs	list[main.BenchmarkSpecBase] \| dict[str, main.BenchmarkSpecBase]		Collection of specs to run.
build_model	Callable		User function to build the model/predictor.
hyperparameters	dict[str, typing.Any] \| list[dict[str, typing.Any]] \| None	None	Single dict, list of dicts (matching specs), or None.
n_times	int	1	Number of times to repeat each benchmark specification.
continue_on_error	bool	True	If True, continue running benchmarks even if one fails.
return_dataframe	bool	True	If True, return results as a pandas DataFrame, otherwise return a list of dicts.
Returns	pandas.core.frame.DataFrame \| list[dict[str, typing.Any]]

benchmark_results = run_benchmarks(
    specs=[_spec_sim,_spec_pred,spec_with_custom_test], 
    build_model=_dummy_build_model,
    return_dataframe=False
)
benchmark_results_to_dataframe(benchmark_results)

--- Starting benchmark run for 3 specifications, repeating each 1 times ---

-- Repetition 1/1 --

[1/3] Running: _spec_default (Rep 1)
Building model with spec: _spec_default, seed: 2979218856
  -> Success: _spec_default (Rep 1) completed.

[2/3] Running: _spec_pred_params (Rep 1)
Building model with spec: _spec_pred_params, seed: 2767908549
  -> Success: _spec_pred_params (Rep 1) completed.

[3/3] Running: CustomTestExampleBench (Rep 1)
Building model with spec: CustomTestExampleBench, seed: 3139743514
  -> Success: CustomTestExampleBench (Rep 1) completed.

--- Benchmark run finished. 3/3 individual runs completed successfully. ---

	benchmark_name	dataset_id	hyperparameters	seed	training_time_seconds	test_time_seconds	benchmark_type	metric_name	metric_score	cs_avg_max_abs_error	cs_median_max_abs_error
0	_spec_default	_dummy_default	{}	2979218856	0.000006	0.001325	BenchmarkSpecSimulation	rmse	0.564484	NaN	NaN
1	_spec_pred_params	_dummy_pred_params	{}	2767908549	0.000006	0.000844	BenchmarkSpecPrediction	rmse	0.559402	NaN	NaN
2	CustomTestExampleBench	dummy_core_data_v1	{}	3139743514	0.000005	0.000521	BenchmarkSpecSimulation	rmse	0.573960	0.993465	0.993465

results_multiple_runs = run_benchmarks(
    specs=[_spec_sim,_spec_pred,spec_with_custom_test], 
    build_model=_dummy_build_model,
    n_times=3
)
results_multiple_runs

--- Starting benchmark run for 3 specifications, repeating each 3 times ---

-- Repetition 1/3 --

[1/9] Running: _spec_default (Rep 1)
Building model with spec: _spec_default, seed: 30935737
  -> Success: _spec_default (Rep 1) completed.

[2/9] Running: _spec_pred_params (Rep 1)
Building model with spec: _spec_pred_params, seed: 2986847840
  -> Success: _spec_pred_params (Rep 1) completed.

[3/9] Running: CustomTestExampleBench (Rep 1)
Building model with spec: CustomTestExampleBench, seed: 1147267216
  -> Success: CustomTestExampleBench (Rep 1) completed.

-- Repetition 2/3 --

[4/9] Running: _spec_default (Rep 2)
Building model with spec: _spec_default, seed: 3191904871
  -> Success: _spec_default (Rep 2) completed.

[5/9] Running: _spec_pred_params (Rep 2)
Building model with spec: _spec_pred_params, seed: 1536587039
  -> Success: _spec_pred_params (Rep 2) completed.

[6/9] Running: CustomTestExampleBench (Rep 2)
Building model with spec: CustomTestExampleBench, seed: 3900899545
  -> Success: CustomTestExampleBench (Rep 2) completed.

-- Repetition 3/3 --

[7/9] Running: _spec_default (Rep 3)
Building model with spec: _spec_default, seed: 3797015292
  -> Success: _spec_default (Rep 3) completed.

[8/9] Running: _spec_pred_params (Rep 3)
Building model with spec: _spec_pred_params, seed: 3789263585
  -> Success: _spec_pred_params (Rep 3) completed.

[9/9] Running: CustomTestExampleBench (Rep 3)
Building model with spec: CustomTestExampleBench, seed: 851966748
  -> Success: CustomTestExampleBench (Rep 3) completed.

--- Benchmark run finished. 9/9 individual runs completed successfully. ---

	benchmark_name	dataset_id	hyperparameters	seed	training_time_seconds	test_time_seconds	benchmark_type	metric_name	metric_score	cs_avg_max_abs_error	cs_median_max_abs_error
0	_spec_default	_dummy_default	{}	30935737	0.000009	0.001040	BenchmarkSpecSimulation	rmse	0.564484	NaN	NaN
1	_spec_pred_params	_dummy_pred_params	{}	2986847840	0.000004	0.000537	BenchmarkSpecPrediction	rmse	0.559402	NaN	NaN
2	CustomTestExampleBench	dummy_core_data_v1	{}	1147267216	0.000004	0.000385	BenchmarkSpecSimulation	rmse	0.573960	0.993465	0.993465
3	_spec_default	_dummy_default	{}	3191904871	0.000003	0.000280	BenchmarkSpecSimulation	rmse	0.564484	NaN	NaN
4	_spec_pred_params	_dummy_pred_params	{}	1536587039	0.000003	0.000285	BenchmarkSpecPrediction	rmse	0.559402	NaN	NaN
5	CustomTestExampleBench	dummy_core_data_v1	{}	3900899545	0.000003	0.000330	BenchmarkSpecSimulation	rmse	0.573960	0.993465	0.993465
6	_spec_default	_dummy_default	{}	3797015292	0.000003	0.000264	BenchmarkSpecSimulation	rmse	0.564484	NaN	NaN
7	_spec_pred_params	_dummy_pred_params	{}	3789263585	0.000003	0.000278	BenchmarkSpecPrediction	rmse	0.559402	NaN	NaN
8	CustomTestExampleBench	dummy_core_data_v1	{}	851966748	0.000003	0.000531	BenchmarkSpecSimulation	rmse	0.573960	0.993465	0.993465

source

aggregate_benchmark_results

 aggregate_benchmark_results (results_df:pandas.core.frame.DataFrame,
                              group_by_cols:str|list[str]='benchmark_name'
                              , agg_funcs:str|list[str]='mean')

Aggregates numeric results from a benchmark DataFrame, grouped by specified columns.

	Type	Default	Details
results_df	DataFrame		DataFrame returned by run_benchmarks (with return_dataframe=True).
group_by_cols	str \| list[str]	benchmark_name	Column(s) to group by before aggregation.
agg_funcs	str \| list[str]	mean	Aggregation function(s) (‘mean’, ‘median’, ‘std’, etc.) or list thereof.
Returns	DataFrame

aggregate_benchmark_results(results_multiple_runs,agg_funcs=['mean','std'])

	training_time_seconds		test_time_seconds		metric_score		cs_avg_max_abs_error		cs_median_max_abs_error
	mean	std	mean	std	mean	std	mean	std	mean	std
benchmark_name
CustomTestExampleBench	0.000003	4.453506e-07	0.000415	0.000104	0.573960	0.0	0.993465	0.0	0.993465	0.0
_spec_default	0.000005	3.395723e-06	0.000528	0.000443	0.564484	0.0	NaN	NaN	NaN	NaN
_spec_pred_params	0.000003	3.254011e-07	0.000367	0.000147	0.559402	0.0	NaN	NaN	NaN	NaN