Multimodal 101: Images#
Daft has a rich multimodal type-system with support for Images, URLs, Tensors and more.
This tutorial will introduce you to the canonical ways of working with images in Daft.
CI = False
# Skip this notebook execution in CI because it uses torch and private buckets
if CI:
import sys
sys.exit()
#!pip install -U torch torchvision
import daft
Working with Images in DataFrames#
Now you might be wondering: how could we possibly work with images in a DataFrame, especially when tabular file formats like Parquet don’t support storing images?
Here’s how the magic works:
Images are stored separately, not inside your DataFrame – either locally on disk or at a stable URL.
Daft can download the data at these paths/URLs as bytes into your DataFrame.
If needed, Daft can also render the bytes into human-readable images inside your DataFrame.
You process your images as needed.
You store the results in tabular format, keeping the Images as paths/URLs.
Storing images separately gives you a single source of truth. It is also more efficient. Different users or teams can access the same image data from a central location rather than storing separate copies.
Let’s take a look at an actual example.
1. Loading Image Paths#
Let’s say we have some data about dogs and their owners, including images of the dogs, and we want to classify the breeds of these dogs.
The tabular data about owners’ and dogs names can all be stored in a DataFrame, along with the paths to the images (either local or remote URLs).
from daft.io import IOConfig, S3Config
io_config = IOConfig(
s3=S3Config(
region_name="eu-north-1",
)
)
Storing Images in Cloud Object Store#
You can also use from_glob_path
to load images from paths to remote object stores, like an S3 bucket.
This method creates a DataFrame from a collection of file paths.
from_glob_path
supports wildcards:
*
matches any number of any characters including none?
matches any single character[…]
matches any single character in the brackets**
recursively matches any number of layers of directories
It supports reading from local filesystems, S3 file-systems and other cloud-based storage.
df = daft.from_glob_path(
"s3://avriiil/images-dogs/*.jpg", # substitute with a path to your own private bucket
io_config=io_config,
)
df.show()
path Utf8 | size Int64 | num_rows Int64 |
---|---|---|
s3://avriiil/images-dogs/53670606332_1ea5f2ce68_o.jpg | 23604 | None |
s3://avriiil/images-dogs/53671698613_0230f8af3c_o.jpg | 22713 | None |
s3://avriiil/images-dogs/53671700073_2c9441422e_o.jpg | 24990 | None |
s3://avriiil/images-dogs/53671838039_b97411a441_o.jpg | 42836 | None |
s3://avriiil/images-dogs/53671838774_03ba68d203_o.jpg | 13432 | None |
Check out the Data Access tutorial to learn more about configuring the IOConfig
object.
Storing Images at URLs#
Your images can also be stored at a stable HTTP URL. Commonly, this is something such as a CDN like flicker’s.
In that case, we can create a Daft DataFrame containing the data as follows:
df = daft.from_pydict(
{
"full_name": [
"Ernesto Evergreen",
"James Jale",
"Wolfgang Winter",
"Shandra Shamas",
"Zaya Zaphora",
],
"dog_name": ["Ernie", "Jackie", "Wolfie", "Shaggie", "Zadie"],
"urls": [
"https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg",
"https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg",
"https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg",
"https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg",
"https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg",
],
}
)
df.collect()
full_name Utf8 | dog_name Utf8 | urls Utf8 |
---|---|---|
Ernesto Evergreen | Ernie | https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg |
James Jale | Jackie | https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg |
Wolfgang Winter | Wolfie | https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg |
Shandra Shamas | Shaggie | https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg |
Zaya Zaphora | Zadie | https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg |
2. Download Images as Bytes#
You can use the url.download()
expression to download the bytes from a path or URL. Below we will continue working with the images stored at stable URLs, but the same code will work for your local or cloud object store file paths.
Let’s store the Image Bytes in a new column using the with_column
method:
df_img = df.with_column("image_bytes", df["urls"].url.download(on_error="null"))
df_img.show()
full_name Utf8 | dog_name Utf8 | urls Utf8 | image_bytes Binary |
---|---|---|---|
Ernesto Evergreen | Ernie | https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
James Jale | Jackie | https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
Wolfgang Winter | Wolfie | https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
Shandra Shamas | Shaggie | https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
Zaya Zaphora | Zadie | https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
Great, all the Image data is now available as bytes in our DataFrame. We can use this for further processing, which in our case means ML Classification.
But…where’s all the fluffiness? 🐶
As a data professional working with Images, you often want to be able to see what you’re working with.
3. Render Images#
You can turn the bytes into human-readable images using image.decode
:
df_img = df_img.with_column("image", daft.col("image_bytes").image.decode())
df_img.show()
full_name Utf8 | dog_name Utf8 | urls Utf8 | image_bytes Binary | image Image[MIXED] |
---|---|---|---|---|
Ernesto Evergreen | Ernie | https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | |
James Jale | Jackie | https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | |
Wolfgang Winter | Wolfie | https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | |
Shandra Shamas | Shaggie | https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | |
Zaya Zaphora | Zadie | https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
That’s better! 🙂
4. Process your Images#
Now you’re all set to process your images.
Let’s:
create a thumbnail of each image
use a PyTorch ML model to classiffy the dogs by breed
Create Thumbnails#
Expressions are a Daft API for defining computation that needs to happen over your columns. There are dedicated image.(...)
Expressions for working with images.
You can use the image.resize
Expression to create a thumbnail of each image:
df_img = df_img.with_column("thumbnail", daft.col("image").image.resize(32, 32))
df_img.show()
full_name Utf8 | dog_name Utf8 | urls Utf8 | image_bytes Binary | image Image[MIXED] | thumbnail Image[MIXED] |
---|---|---|---|---|---|
Ernesto Evergreen | Ernie | https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | ||
James Jale | Jackie | https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | ||
Wolfgang Winter | Wolfie | https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | ||
Shandra Shamas | Shaggie | https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | ||
Zaya Zaphora | Zadie | https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
Take a look at the Daft documentation on Image Expressions to learn more.
ML Classification#
Now let’s classify the dogs by breed.
We’ll define a function that uses a pre-trained PyTorch model ResNet50 to classify the dog pictures. We’ll then pass the image
column to this PyTorch model and send the classification predictions to a new column classify_breed
.
You will use Daft User-Defined Functions (UDFs) to do this. Daft UDFs which are the best way to run computations over multiple rows or columns.
Setting up PyTorch#
Working with PyTorch adds some complexity unrelated to Daft. You can just run the cells below to perform the classification.
First, make sure to install and import some extra dependencies:
# import additional libraries, these are necessary for PyTorch
import numpy as np
import torch
from PIL import Image
from daft import DataType, udf
Then, go ahead and define your ClassifyImages
UDF.
Models are expensive to initialize and load, so we want to do this as few times as possible, and share a model across multiple invocations.
# @udf(return_dtype=DataType.fixed_size_list(dtype=DataType.string(), size=2))
@udf(return_dtype=DataType.struct({"top_prediction": DataType.string(), "confidence": DataType.float32()}))
class ClassifyImages:
def __init__(self):
# Perform expensive initializations - create and load the pre-trained model
self.model = torch.hub.load("NVIDIA/DeepLearningExamples:torchhub", "nvidia_resnet50", pretrained=True)
self.utils = torch.hub.load("NVIDIA/DeepLearningExamples:torchhub", "nvidia_convnets_processing_utils")
self.model.eval().to(torch.device("cpu"))
def __call__(self, tensors):
tensors = torch.tensor(np.array(tensors.to_pylist())) # get tensors into correct format
with torch.no_grad():
output = torch.nn.functional.softmax(self.model(tensors), dim=1)
results = self.utils.pick_n_best(predictions=output, n=1)
# post-process results into StructType format
list_res = [result[0] for result in results]
new_list = []
for pred, conf in list_res:
conf = float(conf.strip("%")) / 100
new_list.append({"top_prediction": pred, "confidence": round(conf, 2)})
return new_list
Preprocessing#
Now let’s preprocess our image
column to prepare the data into the right format:
from torchvision import transforms
def transform_image(image):
# img = Image.fromarray(image)
img = Image.fromarray(np.array(image))
preprocess = transforms.Compose(
[
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
]
)
tensor = preprocess(img)
return tensor
df_pre = df_img.with_column(
"tensor",
df_img["image"].cast(DataType.tensor(dtype=DataType.uint8())),
).with_column(
"transformed_tensor",
daft.col("tensor").apply(transform_image, return_dtype=daft.DataType.tensor(daft.DataType.float32())),
)
df_pre.show()
full_name Utf8 | dog_name Utf8 | urls Utf8 | image_bytes Binary | image Image[MIXED] | thumbnail Image[MIXED] | tensor Tensor(UInt8) | transformed_tensor Tensor(Float32) |
---|---|---|---|---|---|---|---|
Ernesto Evergreen | Ernie | https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | <Tensor shape=(400, 303, 3)> | <Tensor shape=(3, 224, 224)> | ||
James Jale | Jackie | https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | <Tensor shape=(375, 500, 3)> | <Tensor shape=(3, 224, 224)> | ||
Wolfgang Winter | Wolfie | https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | <Tensor shape=(500, 275, 3)> | <Tensor shape=(3, 224, 224)> | ||
Shandra Shamas | Shaggie | https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | <Tensor shape=(432, 288, 3)> | <Tensor shape=(3, 224, 224)> | ||
Zaya Zaphora | Zadie | https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... | <Tensor shape=(375, 500, 3)> | <Tensor shape=(3, 224, 224)> |
Classify the Puppies! 🐶#
Nice work. Now you’re all set to call this function on the urls
column and store the outputs in a new column we’ll call classify breeds
:
df_classified = df_pre.with_column("classify_breed", ClassifyImages(daft.col("transformed_tensor")))
df_classified.select("dog_name", "image", "classify_breed").show()
dog_name Utf8 | image Image[MIXED] | classify_breed Struct[top_prediction: Utf8, confidence: Float32] |
---|---|---|
Ernie | {top_prediction: boxer, confidence: 0.52, } | |
Jackie | {top_prediction: American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier, confidence: 0.42, } | |
Wolfie | {top_prediction: collie, confidence: 0.5, } | |
Shaggie | {top_prediction: standard schnauzer, confidence: 0.3, } | |
Zadie | {top_prediction: Rottweiler, confidence: 0.79, } |
Nicely done!
It looks like our pre-trained model is more familiar with some specific breeds. You could do further work to fine-tune this model to improve performance.
5. Store your Results#
Let’s take the results we need (breed classification) and store them with the rest of our tabular data. Let’s also encode the thumbnail
column as JPEG and store it as binary code in our dataframe.
We’ll join the original data df
to our df_classified
to get the breed classification data.
df_res = df.join(df_classified.select("dog_name", "classify_breed", "thumbnail"), on="dog_name", how="left")
df_res.show()
dog_name Utf8 | full_name Utf8 | urls Utf8 | classify_breed Struct[top_prediction: Utf8, confidence: Float32] | thumbnail Image[MIXED] |
---|---|---|---|---|
Ernie | Ernesto Evergreen | https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg | {top_prediction: boxer, confidence: 0.52, } | |
Jackie | James Jale | https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg | {top_prediction: American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier, confidence: 0.42, } | |
Wolfie | Wolfgang Winter | https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg | {top_prediction: collie, confidence: 0.5, } | |
Shaggie | Shandra Shamas | https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg | {top_prediction: standard schnauzer, confidence: 0.3, } | |
Zadie | Zaya Zaphora | https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg | {top_prediction: Rottweiler, confidence: 0.79, } |
And then encode the thumbnails as jpeg
:
df_res = df_res.with_column("thumbnail_bin", daft.col("thumbnail").image.encode("jpeg")).exclude("thumbnail")
df_res.show()
full_name Utf8 | dog_name Utf8 | urls Utf8 | classify_breed Struct[top_prediction: Utf8, confidence: Float32] | thumbnail_bin Binary |
---|---|---|---|---|
Ernesto Evergreen | Ernie | https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg | {top_prediction: boxer, confidence: 0.52, } | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
James Jale | Jackie | https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg | {top_prediction: American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier, confidence: 0.42, } | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
Wolfgang Winter | Wolfie | https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg | {top_prediction: collie, confidence: 0.5, } | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
Shandra Shamas | Shaggie | https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg | {top_prediction: standard schnauzer, confidence: 0.3, } | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
Zaya Zaphora | Zadie | https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg | {top_prediction: Rottweiler, confidence: 0.79, } | b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"... |
Then we can write the results out to a storage format, such as Parquet or Delta Lake:
# write to Parquet
df_res.write_parquet("dogs_classified.parquet")
# write to Delta Lake
df_res.write_deltalake("delta/dogs_classified")
Great job!
A note about working with Images at URLs#
Storing images as URLs may mean that the size of your data increases significantly during processing.
For example, a 10MB Parquet file with a few hundred thousands rows of integers and URLs seems small enough to process locally on your machine. But if these URLs contain high-definition Images then your data volume will be much higher once you download the Image bytes for processing.
This is important to keep in mind when choosing where and on what type of infrastructure to run your data workloads.