Getting Started
Installation
Python
The BetterLoader library is hosted on pypi and can be installed via pip.
From Source
For developers, BetterLoader's source may also be found at our Github repository. You can also install BetterLoader from source, but if you're just trying to use the package, pip is probably a far better bet.
Why BetterLoader?
BetterLoader really shines when you're working with a dataset, and you want to load subsets of image classes conditionally. Say you have 3 folders of images, and you only want to load those images that conform to a specific condition, or those that are present in a pre-defined subset file. What if you want to load a specific set of crops per source image, given a set of source images? BetterLoader can do all this, and more.
Creating a BetterLoader
Using BetterLoader with its default parameters lets it function just like the regular Python dataloader. A few points worth noting are that:
- BetterLoader does not expect a nested folder structure. In its current iteration, files are expected to all be present in the root directory. This lets us use index files to define classes and labels dynamically, and vary them from experiment to experiment.
- Every instance of BetterLoader requires an index file to function. The default index file format maps class names to a list of image paths, but the index file can be any json file as long as you modify train_test_val_instances to parse it correctly; for example you could instead map class names to regex for the file paths and pass a train_test_val_instances that reads the files based on that regex. Sample index files may be found here.
A sample use-case for BetterLoader may be found below. It's worth noting that at this point in time, the BetterLoader class has only one callable function.
Constructor Parameters
field | type | description | optional (datatype) |
---|---|---|---|
basepath | str | path to image directory | no |
index_json_path | str | path to index file | yes (None) |
index_object | dict | An object representation of an index file | yes (None) |
num_workers | int | number of workers | yes (1) |
subset_json_path | str | path to subset json file | yes (None) |
subset_object | dict | An object representation of the subset file | yes (None) |
dataset_metadata | metadata object for dataset | list of optional metadata attributes to customise the BetterLoader (more below) | yes (None) |
Usage
The BetterLoader class' fetch_segmented_dataloaders
function allows for a user to obtain a tuple of dictionaries, which are most commonly referenced as (dataloaders, sizes)
. Each dictionary consequently contains train
, test
, and val
keys, allowing for easy access to the dataloaders, as well as their sizes. The function header for the same may be found below:
Metadata Parameters
BetterLoader accepts certain key value pairs under the dataset_metadata
parameter, in order to enable some custom functionality.
- pretransform (callable, optional): This allows us to load a custom pretransform before images are loaded into the dataloader and transformed. For basic usage a pretransform that does not do anything (the default) is usually sufficient. An example use case for the customizability is listed below.
- classdata (callable, optional): Defines a custom mapping for a custom format index file to read data from the DatasetFolder class. Since the index file may have any structure we need to ensure that the classes and a mapping from the classes to the index are always available. Returns a tuple (list of classes, dictionary mapping of class to index)
- split (tuple, optional): Defines a tuple for train, test, val values which must add to one.
- train_test_val_instances (callable, optional): Defines a custom function to read values from the index file.
The default expects an index that is a dict mapping classes to a list of file paths, will need to be written custom for different index formats.
Always must return train test and val splits, which each need to be a list of tuples, each tuple corresponding to one datapoint.
The first element of this tuple must also be the filepath of the image for that datapoint.
The default also has the target class index as the second element of this tuple, this is probably good for most use cases.
Each of these datapoint tuples is passed as the
values
argument in the pretransform, any additional data necessary for transforming the datapoint before it is loaded can go in the datapoint tuple. - supervised (bool, optional): Defines whether or not the experiment is supervised
- custom_collator (callable, optional): Custom function that merges a list of samples to form a mini-batch of Tensors
- drop_last (bool, optional): Defines whether to drop the last incomplete batch if the dataset is not divisible by batch size to avoid sizing errors
- pin_mem (bool, optional): Sets the data load to copy tensors into CUDA pinned memory before returning them, providing your data elements are not custom type
- sampler (torch.utils.data.Sampler or
iterable
, optional): Can be used to define a custom strategy to draw data from the dataset
Here is an example of a pretransform
and a train_test_val_instances
designed to allow for a specified crop to be taken of each image.
- The internals of the loader dictate that the elements of the
instances
variables generated from train_test_val_instances will become thevalues
argument for a pretransform call, and thesample
argument for pretransform is the image data loaded directly from the filepath invalues[0]
(orinstances[i][0]
). - Since the index file here has a similar structure to the default we can get away with using the default classdata function, but index files that don't have the classes as keys of a dictionary will need a custom way of determining the classes.