Finding Data Block Nirvana (a journey through the fastai data block API)

https://blog.usejournal.com/finding-data-block-nirvana-a-journey-through-the-fastai-data-block-api-c38210537fe4

1*c8FvOS8Gyvkf-cvgofTEaw.jpeg

If you’re using the fastai library to train your PyTorch models, you’re using the data block API whether you realize it or not.

The API is pretty straightforward insofar as to how the basics work, but as you start getting deeper into the docs and into what is happening at each point, it gets a little confusing on how all the pieces fit together (at least it was for me). There are a number of classes that play a part in the API and different things that are happening at each step, things like pre-processing, data augmentation, splitting of data, etc….

How is one to really understand what these classes are and what is happening (and why) at each step?

May I suggest what worked for me … building your own custom ItemBase and ItemList subclasses following the Custom ItemList tutorial in the fastai docs. This is what I’m calling the pathway towards data block nirvana, and it is this approach I have tried to demonstrate in code (available from this GitHub repository) and from which I highlight lessons learned in the article before you.

NOTE: I am a firm believer that one will learn more about using this framework by first reading and running the associated code, and then coding it themselves by hand (no copy&paste), than by any other means. This doesn’t mean reading the docs, highlighting and underlining important concepts isn’t important (believe me I do more than my fair share of this), only that for it to take a solid hold in your brain you have to do it. So get the code, run the code, and love the code. Utilizing it and the contents of this article, make it your goal to demonstrate individual understanding by coding everything up yourself.

The API from 50,000 feet

What is the data block API?

The fastai data block API defines a chainable mechanism for transforming raw data (e.g., in image files, .csv files, pandas DataFrames, etc…) into the requisite PyTorch Datasets and Dataloaders that are fed into the forwardfunction of your nn.Module subclass. In less than 10 lines of code, you can define everything you need to split your data into training, validation, and optionally test datasets, apply transforms, and package their respective dataloaders into something called a DataBunch for training. The API also provides a way to save your transformations and pre-processing for use against future datasets/dataloaders you later want to use at inference time.

How does the data block API work?

Here’s an example of how it works from the core documentation:

data = (ImageItemList.from_folder(path)                     .split_by_folder()                     .label_from_folder()                     .add_test_folder()                     .transform(tfms, size=64)                     .databunch())

Essentially the process is this:

Define the source of your inputs (that is your X values) using an ItemList subclass pertinent to your data. This can be just about anything you have (e.g., a .csv, a pandas DataFrame, a text file, images, etc…). There are several built-in ItemList subclasses that you can use out of the box for image, tabular, collaborative, and text based data. You can also build your own as we will do here by subclassing one of these built-in ItemList classes or the ItemList class itself.
Define how you want to split your inputs into training and validation datasets using one of the built-in mechanisms for doing so.
Define the source of your targets (that is your y values) and combine them with the inputs of your training and validation datasets in the form of fastai LabelList objects. LabelList subclasses the PyTorch Dataset class.
Add a test dataset (optional).
Add transforms to your LabelList objects (optional). Here you can apply data augmentation to either, or both, your inputs and targets.