Data Utilities

Fill in a module description here

First, we evaluate how the datasets are loaded by the nonlinear_benchmarks library


source

get_default_data_root

 get_default_data_root ()

*Returns the default root directory for datasets.

Checks the ‘IDENTIBENCH_DATA_ROOT’ environment variable first, otherwise defaults to ‘~/.identibench_data’.*

Test utilities


source

hdf_files_from_path

 hdf_files_from_path (fpath:pathlib.Path)

data utilities

train_val, test = nonlinear_benchmarks.WienerHammerBenchMark()
plt.plot(train_val.y)
type(train_val)
nonlinear_benchmarks.utilities.Input_output_data

The data is store in a Input_output_data class, which provides customized access. We want to write a function, which exports the underlying data to hdf5 files.

u = train_val.atleast_2d().u
u.shape
(100000, 1)

source

write_dataset

 write_dataset (group:h5py._hl.files.File|h5py._hl.group.Group,
                ds_name:str, data:numpy.ndarray, dtype:str='f4',
                chunks:tuple[int,...]|None=None)
tmp_dir = Path('./tmp/')
os.makedirs(tmp_dir,exist_ok=True)
tmp_file = tmp_dir / 'tmp.hdf5'
with h5py.File(tmp_file,'w') as f:
    write_dataset(f,'u',u)

with h5py.File(tmp_file,'r') as f:
    hdf_u = f['u'][:]
    test_ne(hdf_u.dtype,u.dtype)
    test_eq(hdf_u,u.astype('f4'))

source

write_array

 write_array (group:h5py._hl.files.File|h5py._hl.group.Group, ds_name:str,
              data:numpy.ndarray, dtype:str='f4',
              chunks:tuple[int,...]|None=None)

Writes a 2d numpy array rowwise to a hdf5 file.

with h5py.File(tmp_file,'w') as f:
    write_array(f,'u',u)

with h5py.File(tmp_file,'r') as f:
    hdf_u = f['u0'][:]
    test_ne(hdf_u.dtype,u.dtype)
    test_ne(hdf_u,u.astype('f4'))
    test_eq(hdf_u[:,None],u.astype('f4'))

source

iodata_to_hdf5

 iodata_to_hdf5 (iodata:nonlinear_benchmarks.utilities.Input_output_data,
                 hdf_dir:pathlib.Path, f_name:str=None)
Type Default Details
iodata Input_output_data data to save to file
hdf_dir Path Export directory for hdf5 files
f_name str None name of hdf5 file without ‘.hdf5’ ending
Returns Path
fname = iodata_to_hdf5(train_val,tmp_dir)

with h5py.File(fname,'r') as f:
    hdf_u = f['u0'][:]
    hdf_y = f['y0'][:]
    test_eq(hdf_u[:,None],train_val.atleast_2d().u.astype('f4'))
    test_eq(hdf_y[:,None],train_val.atleast_2d().y.astype('f4'))

Let us evaluate how the general shape of the downloaded datasets looks like

for bench in nonlinear_benchmarks.all_splitted_benchmarks:
    train,test = bench(atleast_2d=True,always_return_tuples_of_datasets=True)
    print(type(train))
    print(type(train[0]))
    break
<class 'tuple'>
<class 'nonlinear_benchmarks.utilities.Input_output_data'>

With the correct flags set, all datasets have a consistent training and test tuple of one or more elements of type Input_output_data. We will transform that in a training, validation and test tuple, which we will then save with a single function.

# for bench in nonlinear_benchmarks.all_not_splitted_benchmarks:
#     train = bench()
#     if len(train) == 2:
#         train, test = train
#         # print('\n'.join(map(str,train)))
#         if isinstance(train,list):
#             print(len(train))
#             print(train[0].name)

Only the datasets in nonlinear_benchmarks.all_splitted_benchmarks have a consistent output form. The other benchmarks have random splits


source

dataset_to_hdf5

 dataset_to_hdf5 (train:tuple, valid:tuple, test:tuple,
                  save_path:pathlib.Path, train_valid:tuple=None)

Save a dataset consisting of training, validation, and test set in hdf5 format in seperate subdirectories

Type Default Details
train tuple tuple of Input_output_data for training
valid tuple tuple of Input_output_data for validation
test tuple tuple of Input_output_data for test
save_path Path directory the files are written to, created if it does not exist
train_valid tuple None optional tuple of unsplit Input_output_data for training and validation
Returns None
train_val, test = nonlinear_benchmarks.WienerHammerBenchMark()
split_idx = 90_000
train = train_val[:split_idx]
valid = train_val[split_idx:]
test = test
dataset_to_hdf5(train,valid,test,tmp_dir)
dataset_to_hdf5((train,),valid,test,tmp_dir)

Download Utilities


source

unzip_download

 unzip_download (url:str, extract_dir:pathlib.Path=Path('.'))

downloads a zip archive to ram and extracts it

Type Default Details
url str url to file to download
extract_dir Path . directory the archive is extracted to
Returns None

source

unrar_download

 unrar_download (url:str, extract_dir:pathlib.Path=Path('.'))

downloads a rar archive to ram and extracts it

Type Default Details
url str url to file to download
extract_dir Path . directory the archive is extracted to
Returns None

source

download

 download (url:str, target_dir:pathlib.Path=Path('.'))
Type Default Details
url str url to file to download
target_dir Path .
Returns Path