Loading Unlabeled Images with ImageDataGenerator flow_from_directory in Keras
The ImageDataGenerator class in Keras is a really valuable tool. I’ve recently written about using it for training/validation splitting of images, and it’s also helpful for data augmentation by applying random permutations to your image dataset in an effort to reduce overfitting and improve the generalized performance of your models.
The flow_from_directory function is particularly helpful as it allows you to load batches of images from a labeled directory structure, where the name of each subdirectory is the class name for a classification task. For instance, you might use a directory structure like so for the MNIST dataset:
train/ 0/ 1/ 2/ 3/ 4/ 5/ 6/ 7/ 8/ 9/
In this case, each subdirectory of the train directory contains images corresponding to that particular class (ie. classes 0 through 9). Loading this dataset with ImageDataGenerator could be accomplished like so:
datagen = ImageDataGenerator() train_data = datagen.flow_from_directory('./train')
You’ll then see output like so, indicating the number of images and classes discovered:
Found 1000 images belonging to 10 classes.
Frequently however, you’ll also have a test directory which doesn’t have subdirectories as you don’t know the classes of those images. Inside of test is simply a variety of images of unknown class, and you can’t use the flow_from_directory function like we did above as you’ll end up with the following issue:
datagen = ImageDataGenerator() train_data = datagen.flow_from_directory('./test')
Found 0 images belonging to 0 classes.
It can’t find any classes because test has no subdirectories. Without classes it can’t load your images, as you see in the log output above. There is a workaround to this however, as you can specify the parent directory of the test directory and specify that you only want to load the test “class”:
datagen = ImageDataGenerator() test_data = datagen.flow_from_directory('.', classes=['test'])
Found 1000 images belonging to 1 class.
The classes argument tells flow_from_directory that you only want to load images of the specified class(es), in this case the test “class”. Now test_data will contain only the one test “class”, allowing you to work with your test dataset just as you would the labeled training data from above.
In fact, you can also use this trick to exclude subdirectories of your dataset for whatever reason. Going back to the MNIST problem above, if you only cared about recognizing digits 0 through 3, you could do something like this:
datagen = ImageDataGenerator() train_data = datagen.flow_from_directory('./train', classes=['0', '1', '2', '3'])
Found 400 images belonging to 4 classes.