Training/Validation Split with ImageDataGenerator in Keras
Keras comes bundled with many helpful utility functions and classes to accomplish all kinds of common tasks in your machine learning pipelines. One commonly used class is the ImageDataGenerator. As the documentation explains:
Generate batches of tensor image data with real-time data augmentation. The data will be looped over (in batches).
Until recently though, you were on your own to put together your training and validation datasets, for instance by creating two separate folder structures for your images to be used in conjunction with the flow_from_directory function.
For example, the old way would be to do something like so:
TRAIN_DIR = './datasets/training' VALIDATION_DIR = './datasets/validation' datagen = ImageDataGenerator(rescale=1./255) train_generator = datagen.flow_from_directory(TRAIN_DIR) val_generator = datagen.flow_from_directory(VALIDATION_DIR)
Recently however (here’s the pull request, if you’re curious), a new validation_split parameter was added to the ImageDataGenerator that allows you to randomly split a subset of your training data into a validation set, by specifying the percentage you want to allocate to the validation set:
datagen = ImageDataGenerator(validation_split=0.2, rescale=1./255)
Then when you invoke flow_from_directory, you pass the subset parameter specifying which set you want:
train_generator = datagen.flow_from_directory( TRAIN_DIR, subset='training' ) val_generator = datagen.flow_from_directory( TRAIN_DIR, subset='validation' )
You’ll note that both generators are being loaded from the TRAIN_DIR, the only difference is one uses the training subset and the other uses the validation subset.
And that’s all, it’s as easy as specifying the two parameters as needed. Now that you have your training and validation sets, did you know you can also use the ImageDataGenerator to load unlabeled images, such as an unlabeled test dataset?