split_dataset_random#
- ethology.datasets.split.split_dataset_random(dataset, list_fractions, seed=42, samples_coordinate='image_id')[source]#
Split an annotations dataset using random sampling.
Split an
ethologyannotations dataset into multiple subsets by randomly shuffling all samples and then partitioning them sequentially according to the specified fractions.- Parameters:
dataset (xarray.Dataset) – The annotations dataset to split.
list_fractions (list[float, ...]) – The fractions of the input annotations dataset to allocate to each subset. The list must contain at least two elements, all elements must be between 0 and 1, and add up to 1.
seed (int, optional) – Seed to use for the random number generator. Default is 42.
samples_coordinate (str, optional) – The coordinate along which to split the dataset. Default is
image_id.
- Returns:
The subsets of the input dataset. The subsets are returned in the same order as the input list of fractions.
- Return type:
tuple[xarray.Dataset, …]
- Raises:
ValueError – If the elements of
list_fractionsare less than two, are not between 0 and 1, or do not sum to 1.
Examples
Split a dataset with a single data variable
foo, with 100 values defined along theimage_iddimension into 70/20/10 splits.>>> from ethology.datasets.split import split_dataset_random >>> import numpy as np >>> import xarray as xr >>> ds = xr.Dataset( ... data_vars=dict( ... foo=("image_id", np.random.randint(0, 100, size=100)), ... ), ... coords=dict( ... image_id=range(100), ... ), ... ) >>> ds_train, ds_val, ds_test = split_dataset_random( ... ds, ... list_fractions=[0.7, 0.2, 0.1], ... seed=42, ... ) >>> print(len(ds_train.image_id)) # 70 >>> print(len(ds_val.image_id)) # 20 >>> print(len(ds_test.image_id)) # 10
Notes
The function operates in two steps: first, it shuffles all sample indices along the
samples_coordinatedimension using the provided random seed; then, it partitions the shuffled indices into contiguous blocks, one for each subset.The size of each block is determined by rounding down (floor) the product of the subset’s fraction and the total number of samples. To ensure all samples are included, the last subset receives any remaining samples after the earlier subsets have been allocated. Due to this rounding behavior, the actual fraction for the last subset may differ slightly from the requested fraction.