split_dataset_random#

ethology.datasets.split.split_dataset_random(dataset, list_fractions, seed=42, samples_coordinate='image_id')[source]#

Split an annotations dataset using random sampling.

Split an ethology annotations dataset into multiple subsets by randomly shuffling all samples and then partitioning them sequentially according to the specified fractions.

Parameters:
  • dataset (xarray.Dataset) – The annotations dataset to split.

  • list_fractions (list[float, ...]) – The fractions of the input annotations dataset to allocate to each subset. The list must contain at least two elements, all elements must be between 0 and 1, and add up to 1.

  • seed (int, optional) – Seed to use for the random number generator. Default is 42.

  • samples_coordinate (str, optional) – The coordinate along which to split the dataset. Default is image_id.

Returns:

The subsets of the input dataset. The subsets are returned in the same order as the input list of fractions.

Return type:

tuple[xarray.Dataset, …]

Raises:

ValueError – If the elements of list_fractions are less than two, are not between 0 and 1, or do not sum to 1.

Examples

Split a dataset with a single data variable foo, with 100 values defined along the image_id dimension into 70/20/10 splits.

>>> from ethology.datasets.split import split_dataset_random
>>> import numpy as np
>>> import xarray as xr
>>> ds = xr.Dataset(
...     data_vars=dict(
...         foo=("image_id", np.random.randint(0, 100, size=100)),
...     ),
...     coords=dict(
...         image_id=range(100),
...     ),
... )
>>> ds_train, ds_val, ds_test = split_dataset_random(
...     ds,
...     list_fractions=[0.7, 0.2, 0.1],
...     seed=42,
... )
>>> print(len(ds_train.image_id))  # 70
>>> print(len(ds_val.image_id))  # 20
>>> print(len(ds_test.image_id))  # 10

Notes

The function operates in two steps: first, it shuffles all sample indices along the samples_coordinate dimension using the provided random seed; then, it partitions the shuffled indices into contiguous blocks, one for each subset.

The size of each block is determined by rounding down (floor) the product of the subset’s fraction and the total number of samples. To ensure all samples are included, the last subset receives any remaining samples after the earlier subsets have been allocated. Due to this rounding behavior, the actual fraction for the last subset may differ slightly from the requested fraction.

Examples using split_dataset_random#

Split an annotations dataset

Split an annotations dataset