.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "examples/approximate_subset_sum_split.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code or to run this example in your browser via Binder. .. rst-class:: sphx-glr-example-title .. _sphx_glr_examples_approximate_subset_sum_split.py: Split an annotations dataset ================================= Split an annotations dataset by grouping variable, and compare to random splitting. .. GENERATED FROM PYTHON SOURCE LINES 8-38 This example demonstrates two dataset splitting strategies: - **Grouping-based split**: splits the input dataset into two subsets with approximately the requested fractions, while keeping the values of a user-defined grouping variable (such as "videos" or "species") entirely separate between subsets. We will explore two approaches: a group k-fold approach and an approximate subset-sum approach. - **Random split**: splits the input dataset randomly into subsets with the requested fractions. It achieves precise split fractions but may mix values of variables across subsets (e.g., frames from the same video may be present in multiple subsets). A grouping-based split is useful when defining a held-out test dataset with a specified percentage. For example, you may want to hold out ~10% of the annotated frames while ensuring that frames from the same video are not present in both the training and test sets. In contrast, a random splitting strategy divides the dataset into precise proportions but does not prevent data leakage across subsets. This may be useful, for example, to generate multiple train/validation splits with very similar content for `cross-validation `_. Both approaches can be useful in different situations, and this example demonstrates how to apply them using ``ethology``. For more complex dataset splits, we recommend going through `scikit-learn's cross-validation functionalities `_, in particular the section on `grouped data `_. .. GENERATED FROM PYTHON SOURCE LINES 40-42 Imports ------- .. GENERATED FROM PYTHON SOURCE LINES 42-62 .. code-block:: Python import sys from collections import Counter from pathlib import Path import matplotlib.pyplot as plt import numpy as np import pooch import xarray as xr from loguru import logger from ethology.datasets.split import ( split_dataset_group_by, split_dataset_random, ) from ethology.io.annotations import load_bboxes # For interactive plots: install ipympl with `pip install ipympl` and uncomment # the following line in your notebook # %matplotlib widget .. GENERATED FROM PYTHON SOURCE LINES 63-68 Configure logging for this example ------------------------------------------------------------ By default, ``ethology`` outputs log messages to ``stderr``. Here, we configure the logger to output logs to ``stdout`` as well, so that we can display the log messages produced in this example. .. GENERATED FROM PYTHON SOURCE LINES 68-71 .. code-block:: Python _ = logger.add(sys.stdout, level="INFO") .. GENERATED FROM PYTHON SOURCE LINES 72-83 Download dataset ---------------- For this example, we will use the `Australian Camera Trap Dataset `_ which comprises images from camera traps across various sites in Victoria, Australia. This dataset is licensed under CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/) and authored by Sameeruddin Muhammad, Scott Mann, Callum Luke, Chris Pocknee, Supriya Nair and Jay Nair. We use the `pooch `_ library to download the dataset to the ``.ethology`` cache directory. .. GENERATED FROM PYTHON SOURCE LINES 85-105 .. code-block:: Python data_source = { "url": "https://github.com/neuroinformatics-unit/ethology/releases/download/data-ACTD/ACTD_COCO_files.zip", "hash": "4019bb11cd360d66d13d9309928195638adf83e95ddec7b0b23e693ec8c7c26b", } # Define cache directory ethology_cache = Path.home() / ".ethology" ethology_cache.mkdir(exist_ok=True) # Download the dataset to the cache directory extracted_files = pooch.retrieve( url=data_source["url"], known_hash=data_source["hash"], fname="ACTD_COCO_files.zip", path=ethology_cache, processor=pooch.Unzip(extract_dir=ethology_cache / "ACTD_COCO_files"), ) print(*extracted_files, sep="\n") .. rst-class:: sphx-glr-script-out .. code-block:: none /home/runner/.ethology/ACTD_COCO_files/3_Feral_animals_data_CCT.json /home/runner/.ethology/ACTD_COCO_files/2_Region_specific_data_CCT.json /home/runner/.ethology/ACTD_COCO_files/1_Terrestrial_group_data_CCT.json .. GENERATED FROM PYTHON SOURCE LINES 106-113 Read as a single annotation dataset ------------------------------------ The dataset contains three different COCO annotation files. We can load them as a single dataset using the :func:`ethology.io.annotations.load_bboxes.from_files` function. .. GENERATED FROM PYTHON SOURCE LINES 116-122 .. code-block:: Python ds_all = load_bboxes.from_files(extracted_files, format="COCO") print(ds_all) print(*ds_all.annotation_files, sep="\n") .. rst-class:: sphx-glr-script-out .. code-block:: none Size: 10MB Dimensions: (image_id: 39426, space: 2, id: 6) Coordinates: * image_id (image_id) int64 315kB 0 1 2 3 4 ... 39422 39423 39424 39425 * space (space) `_ performance of a species classifier, that is, its performance on species not seen during training. In this case we may want to split the dataset into train and test sets, while ensuring that no species are present in both train and test sets. To do this, we first need to compute a variable that holds the species per image. Then we can split the images in the dataset based on the species they contain. Note that only one specie is defined per image. We use a helper function to extract the information of interest from the image filenames. .. GENERATED FROM PYTHON SOURCE LINES 176-204 .. code-block:: Python # Helper function def split_at_any_delimiter(text: str, delimiters: list[str]) -> list[str]: """Split a string at any of the specified delimiters if present.""" for delimiter in delimiters: if delimiter in text: return text.split(delimiter) return [text] # Get species name per image species_per_image_id = np.array( [ ds_all.map_image_id_to_filename[i].split("\\")[-2] for i in ds_all.image_id.values ] ) # Add the species array to the dataset ds_all["specie"] = xr.DataArray( species_per_image_id, dims="image_id", ) print(f"Total species: {len(np.unique(species_per_image_id))}") .. rst-class:: sphx-glr-script-out .. code-block:: none Total species: 15 .. GENERATED FROM PYTHON SOURCE LINES 205-207 We have 15 different species in the dataset. With a bar plot we can visualise their distribution in the dataset. .. GENERATED FROM PYTHON SOURCE LINES 209-223 .. code-block:: Python count_per_specie = dict(Counter(ds_all["specie"].values).most_common()) fig, ax = plt.subplots() ax.bar( count_per_specie.keys(), count_per_specie.values(), ) ax.set_xticks(range(len(count_per_specie))) ax.set_xticklabels(count_per_specie.keys(), rotation=90) ax.set_ylabel("# images") ax.set_title("Image count per specie") plt.tight_layout() .. image-sg:: /examples/images/sphx_glr_approximate_subset_sum_split_001.png :alt: Image count per specie :srcset: /examples/images/sphx_glr_approximate_subset_sum_split_001.png :class: sphx-glr-single-img .. GENERATED FROM PYTHON SOURCE LINES 224-228 We can now split the dataset by species using the :func:`ethology.datasets.split.split_dataset_group_by` function. For example, for a very specific 30/70 split, we would do: .. GENERATED FROM PYTHON SOURCE LINES 230-239 .. code-block:: Python fraction_1 = 0.3 fraction_2 = 1 - fraction_1 ds_species_1, ds_species_2 = split_dataset_group_by( ds_all, group_by_var="specie", list_fractions=[fraction_1, fraction_2], ) .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 21:04:17.949 | INFO | ethology.datasets.split:split_dataset_group_by:211 - Using group k-fold method with 3 folds and seed=42. .. GENERATED FROM PYTHON SOURCE LINES 240-246 By default, the ``method`` parameter of the function is set to ``auto``, which automatically selects the appropriate splitting method based on the number of unique groups and the requested split fractions. From the info messages logged to the terminal we can see that the automatically selected method was the "group k-fold" method. To force the use of this method, we can explicitly set the ``method`` parameter of the function to ``kfold``. .. GENERATED FROM PYTHON SOURCE LINES 248-250 We can check how close is the resulting split to the requested fractions, and verify that the subsets contain distinct species. .. GENERATED FROM PYTHON SOURCE LINES 252-265 .. code-block:: Python print(f"User specified fractions:{[fraction_1, fraction_2]}") print( "Output split fractions: [" f"{len(ds_species_1.image_id.values) / len(ds_all.image_id.values):.3f}, " f"{len(ds_species_2.image_id.values) / len(ds_all.image_id.values):.3f}]" ) print("--------------------------------") print(f"Subset 1 species: {np.unique(ds_species_1.specie.values)}") print(f"Subset 2 species: {np.unique(ds_species_2.specie.values)}") .. rst-class:: sphx-glr-script-out .. code-block:: none User specified fractions:[0.3, 0.7] Output split fractions: [0.336, 0.664] -------------------------------- Subset 1 species: ['Bandicoot' 'Kangaroo' 'Possum' 'Rabbit' 'Wallaby'] Subset 2 species: ['Bird' 'Bushrat' 'Cat' 'Fox' 'Large animals' 'Mid-sized animals' 'Others' 'Potoroo' 'Small animals' 'Wombat'] .. GENERATED FROM PYTHON SOURCE LINES 266-270 When using the "group k-fold" method, we can also generate different splits by setting a different value for the ``seed`` parameter. In the example below, we set the ``method`` parameter to ``kfold`` and use two different seed values, to generate two different splits. .. GENERATED FROM PYTHON SOURCE LINES 272-291 .. code-block:: Python # Split A with seed 42 ds_species_1a, ds_species_2a = split_dataset_group_by( ds_all, group_by_var="specie", list_fractions=[fraction_1, fraction_2], method="kfold", seed=42, ) # Split B with seed 43 ds_species_1b, ds_species_2b = split_dataset_group_by( ds_all, group_by_var="specie", list_fractions=[fraction_1, fraction_2], method="kfold", seed=43, ) .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 21:04:17.971 | INFO | ethology.datasets.split:split_dataset_group_by:211 - Using group k-fold method with 3 folds and seed=42. 2026-01-16 21:04:17.987 | INFO | ethology.datasets.split:split_dataset_group_by:211 - Using group k-fold method with 3 folds and seed=43. .. GENERATED FROM PYTHON SOURCE LINES 292-296 We can verify that the split using the default value of the ``seed`` parameter (42) is the same as the first split computed above, but different from the split obtained with a different seed value (43). The output fractions in both cases are approximately the requested fractions. .. GENERATED FROM PYTHON SOURCE LINES 298-317 .. code-block:: Python print( "Output split fractions for seed 42: [" f"{len(ds_species_1a.image_id.values) / len(ds_all.image_id.values):.3f}, " f"{len(ds_species_2a.image_id.values) / len(ds_all.image_id.values):.3f}]" ) print( "Output split fractions for seed 43: [" f"{len(ds_species_1b.image_id.values) / len(ds_all.image_id.values):.3f}, " f"{len(ds_species_2b.image_id.values) / len(ds_all.image_id.values):.3f}]" ) assert ds_species_1a.equals(ds_species_1) assert ds_species_2a.equals(ds_species_2) assert not ds_species_1a.equals(ds_species_1b) assert not ds_species_2a.equals(ds_species_2b) .. rst-class:: sphx-glr-script-out .. code-block:: none Output split fractions for seed 42: [0.336, 0.664] Output split fractions for seed 43: [0.365, 0.635] .. GENERATED FROM PYTHON SOURCE LINES 318-324 We have mentioned that by default, the ``method`` in the :func:`ethology.datasets.split.split_dataset_group_by` function is set to ``auto``, which automatically selects the appropriate method based on the number of unique groups and the requested number of folds. The number of required folds is calculated as the closest integer to ``1 / min(list_fractions)``. .. GENERATED FROM PYTHON SOURCE LINES 326-331 In this case, we have 15 unique species, and 1/0.3 ~ 3 folds. Since there are more unique groups than folds, the the ``auto`` setting defers to the preferred "group k-fold" method. The "group k-fold" method is preferred because it allows us to compute different disjoint splits for the same requested fractions via the ``seed`` parameter. .. GENERATED FROM PYTHON SOURCE LINES 333-338 If the number of unique groups is less than the requested number of folds, the ``auto`` setting defers to `approximate subset sum algorithm `_. to compute a solution. We explore this case in the next section. .. GENERATED FROM PYTHON SOURCE LINES 341-349 Split by input annotation file: approximate subset-sum approach ---------------------------------------------------------------- Let's consider another case, in which we would like to split the images in the dataset by the annotation file they come from. As before, we first compute the annotation file per image, which we derive from the image filename. Then we add the annotation file array to the dataset. .. GENERATED FROM PYTHON SOURCE LINES 351-369 .. code-block:: Python # Get annotation file per image annotation_file_per_image_id = np.array( [ split_at_any_delimiter( ds_all.map_image_id_to_filename[i], ["\\"], )[0] for i in ds_all.image_id.values ] ) # Add to dataset ds_all["json_file"] = xr.DataArray( annotation_file_per_image_id, dims="image_id" ) .. GENERATED FROM PYTHON SOURCE LINES 370-374 We can now split the dataset by annotation file using the :func:`ethology.datasets.split.split_dataset_group_by` function with the ``method`` parameter set to ``apss`` (approximate subset-sum). .. GENERATED FROM PYTHON SOURCE LINES 376-383 .. code-block:: Python ds_annotations_1, ds_annotations_2 = split_dataset_group_by( ds_all, group_by_var="json_file", list_fractions=[fraction_1, fraction_2], method="apss", ) .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 21:04:19.208 | INFO | ethology.datasets.split:split_dataset_group_by:219 - Using approximate subset-sum method with epsilon=0. .. GENERATED FROM PYTHON SOURCE LINES 384-392 The log message confirms we have used the "approximate subset-sum" method. It also mentions an ``epsilon`` parameter, which is optional. This is the percentage of the optimal solution that the solution is guaranteed to be within. If ``epsilon`` is 0 (default), the solution will be the best solution (optimal) for the requested fraction and grouping variable. The algorithm computes the smallest subset to be less than or equal to the smallest requested fraction. .. GENERATED FROM PYTHON SOURCE LINES 394-406 .. code-block:: Python print(f"User specified fractions:{[fraction_1, fraction_2]}") output_fractions = [ len(ds_annotations_1.image_id.values) / len(ds_all.image_id.values), len(ds_annotations_2.image_id.values) / len(ds_all.image_id.values), ] print(f"Output split fractions for epsilon 0: {output_fractions}") print(f"Subset 1 files: {np.unique(ds_annotations_1.json_file.values)}") print(f"Subset 2 files: {np.unique(ds_annotations_2.json_file.values)}") .. rst-class:: sphx-glr-script-out .. code-block:: none User specified fractions:[0.3, 0.7] Output split fractions for epsilon 0: [0.18130167909501343, 0.8186983209049865] Subset 1 files: ['2_Region_specific_data'] Subset 2 files: ['1_Terrestrial_group_classifier' '3_Feral_animals_data'] .. GENERATED FROM PYTHON SOURCE LINES 407-418 We can verify that the subsets contain distinct annotation files. Since we used the default ``epsilon=0``, this split is the best solution we can get within the specified constraints. In this case there are only three possible splits of the dataset, since there are only three possible values for the source annotation file. The choice of ``epsilon`` involves a trade-off between accuracy and speed. In more cases with many possible splits, we may want to use a larger ``epsilon`` value, to get faster to a solution that is close enough to the optimal one. .. GENERATED FROM PYTHON SOURCE LINES 421-434 Split using random sampling ---------------------------- Very often we want to compute splits for a specific fraction, and don't care if a grouping variable (such as "species" or "source annotation file") is mixed across subsets. In this case, we can use random sampling with the function :func:`ethology.datasets.split.split_dataset_random`. This function shuffles the dataset and then partitions it according to the specified fractions. By setting a different value for the ``seed``, we can again get different splits for the same requested fractions. .. GENERATED FROM PYTHON SOURCE LINES 436-449 .. code-block:: Python ds_species_1, ds_species_2 = split_dataset_random( ds_all, list_fractions=[fraction_1, fraction_2], seed=42, ) print(f"User specified fractions:{[fraction_1, fraction_2]}") print("Split fractions:") print(len(ds_species_1.image_id.values) / len(ds_all.image_id.values)) print(len(ds_species_2.image_id.values) / len(ds_all.image_id.values)) .. rst-class:: sphx-glr-script-out .. code-block:: none User specified fractions:[0.3, 0.7] Split fractions: 0.2999797088215898 0.7000202911784101 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 21.493 seconds) .. _sphx_glr_download_examples_approximate_subset_sum_split.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/neuroinformatics-unit/ethology/v0.1.2?filepath=notebooks/examples/approximate_subset_sum_split.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: approximate_subset_sum_split.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: approximate_subset_sum_split.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: approximate_subset_sum_split.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_