Multimodal 101: Images#

Daft has a rich multimodal type-system with support for Images, URLs, Tensors and more.

This tutorial will introduce you to the canonical ways of working with images in Daft.

CI = False
# Skip this notebook execution in CI because it uses torch and private buckets
if CI:
    import sys

    sys.exit()
#!pip install -U torch torchvision
import daft

Working with Images in DataFrames#

Now you might be wondering: how could we possibly work with images in a DataFrame, especially when tabular file formats like Parquet don’t support storing images?

Here’s how the magic works:

  1. Images are stored separately, not inside your DataFrame – either locally on disk or at a stable URL.

  2. Daft can download the data at these paths/URLs as bytes into your DataFrame.

  3. If needed, Daft can also render the bytes into human-readable images inside your DataFrame.

  4. You process your images as needed.

  5. You store the results in tabular format, keeping the Images as paths/URLs.

Storing images separately gives you a single source of truth. It is also more efficient. Different users or teams can access the same image data from a central location rather than storing separate copies.

Let’s take a look at an actual example.

1. Loading Image Paths#

Let’s say we have some data about dogs and their owners, including images of the dogs, and we want to classify the breeds of these dogs.

The tabular data about owners’ and dogs names can all be stored in a DataFrame, along with the paths to the images (either local or remote URLs).

from daft.io import IOConfig, S3Config

io_config = IOConfig(
    s3=S3Config(
        region_name="eu-north-1",
    )
)

Storing Images in Cloud Object Store#

You can also use from_glob_path to load images from paths to remote object stores, like an S3 bucket.

This method creates a DataFrame from a collection of file paths.

from_glob_path supports wildcards:

  • * matches any number of any characters including none

  • ? matches any single character

  • […] matches any single character in the brackets

  • ** recursively matches any number of layers of directories

It supports reading from local filesystems, S3 file-systems and other cloud-based storage.

df = daft.from_glob_path(
    "s3://avriiil/images-dogs/*.jpg",  # substitute with a path to your own private bucket
    io_config=io_config,
)
df.show()
path
Utf8
size
Int64
num_rows
Int64
s3://avriiil/images-dogs/53670606332_1ea5f2ce68_o.jpg
23604
None
s3://avriiil/images-dogs/53671698613_0230f8af3c_o.jpg
22713
None
s3://avriiil/images-dogs/53671700073_2c9441422e_o.jpg
24990
None
s3://avriiil/images-dogs/53671838039_b97411a441_o.jpg
42836
None
s3://avriiil/images-dogs/53671838774_03ba68d203_o.jpg
13432
None
(Showing first 5 of 5 rows)

Check out the Data Access tutorial to learn more about configuring the IOConfig object.

Storing Images at URLs#

Your images can also be stored at a stable HTTP URL. Commonly, this is something such as a CDN like flicker’s.

In that case, we can create a Daft DataFrame containing the data as follows:

df = daft.from_pydict(
    {
        "full_name": [
            "Ernesto Evergreen",
            "James Jale",
            "Wolfgang Winter",
            "Shandra Shamas",
            "Zaya Zaphora",
        ],
        "dog_name": ["Ernie", "Jackie", "Wolfie", "Shaggie", "Zadie"],
        "urls": [
            "https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg",
            "https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg",
            "https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg",
            "https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg",
            "https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg",
        ],
    }
)

df.collect()
full_name
Utf8
dog_name
Utf8
urls
Utf8
Ernesto Evergreen
Ernie
https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg
James Jale
Jackie
https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg
Wolfgang Winter
Wolfie
https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg
Shandra Shamas
Shaggie
https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg
Zaya Zaphora
Zadie
https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg
(Showing first 5 of 5 rows)

2. Download Images as Bytes#

You can use the url.download() expression to download the bytes from a path or URL. Below we will continue working with the images stored at stable URLs, but the same code will work for your local or cloud object store file paths.

Let’s store the Image Bytes in a new column using the with_column method:

df_img = df.with_column("image_bytes", df["urls"].url.download(on_error="null"))
df_img.show()
full_name
Utf8
dog_name
Utf8
urls
Utf8
image_bytes
Binary
Ernesto Evergreen
Ernie
https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
James Jale
Jackie
https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
Wolfgang Winter
Wolfie
https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
Shandra Shamas
Shaggie
https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
Zaya Zaphora
Zadie
https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
(Showing first 5 of 5 rows)

Great, all the Image data is now available as bytes in our DataFrame. We can use this for further processing, which in our case means ML Classification.

But…where’s all the fluffiness? 🐶

As a data professional working with Images, you often want to be able to see what you’re working with.

3. Render Images#

You can turn the bytes into human-readable images using image.decode:

df_img = df_img.with_column("image", daft.col("image_bytes").image.decode())
df_img.show()
full_name
Utf8
dog_name
Utf8
urls
Utf8
image_bytes
Binary
image
Image[MIXED]
Ernesto Evergreen
Ernie
https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
James Jale
Jackie
https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
Wolfgang Winter
Wolfie
https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
Shandra Shamas
Shaggie
https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
Zaya Zaphora
Zadie
https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
(Showing first 5 of 5 rows)

That’s better! 🙂

4. Process your Images#

Now you’re all set to process your images.

Let’s:

  • create a thumbnail of each image

  • use a PyTorch ML model to classiffy the dogs by breed

Create Thumbnails#

Expressions are a Daft API for defining computation that needs to happen over your columns. There are dedicated image.(...) Expressions for working with images.

You can use the image.resize Expression to create a thumbnail of each image:

df_img = df_img.with_column("thumbnail", daft.col("image").image.resize(32, 32))
df_img.show()
full_name
Utf8
dog_name
Utf8
urls
Utf8
image_bytes
Binary
image
Image[MIXED]
thumbnail
Image[MIXED]
Ernesto Evergreen
Ernie
https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
<Image>
James Jale
Jackie
https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
<Image>
Wolfgang Winter
Wolfie
https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
<Image>
Shandra Shamas
Shaggie
https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
<Image>
Zaya Zaphora
Zadie
https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
<Image>
(Showing first 5 of 5 rows)

Take a look at the Daft documentation on Image Expressions to learn more.

ML Classification#

Now let’s classify the dogs by breed.

We’ll define a function that uses a pre-trained PyTorch model ResNet50 to classify the dog pictures. We’ll then pass the image column to this PyTorch model and send the classification predictions to a new column classify_breed.

You will use Daft User-Defined Functions (UDFs) to do this. Daft UDFs which are the best way to run computations over multiple rows or columns.

Setting up PyTorch#

Working with PyTorch adds some complexity unrelated to Daft. You can just run the cells below to perform the classification.

First, make sure to install and import some extra dependencies:

# import additional libraries, these are necessary for PyTorch
import numpy as np
import torch
from PIL import Image

from daft import DataType, udf

Then, go ahead and define your ClassifyImages UDF.

Models are expensive to initialize and load, so we want to do this as few times as possible, and share a model across multiple invocations.

# @udf(return_dtype=DataType.fixed_size_list(dtype=DataType.string(), size=2))
@udf(return_dtype=DataType.struct({"top_prediction": DataType.string(), "confidence": DataType.float32()}))
class ClassifyImages:
    def __init__(self):
        # Perform expensive initializations - create and load the pre-trained model
        self.model = torch.hub.load("NVIDIA/DeepLearningExamples:torchhub", "nvidia_resnet50", pretrained=True)
        self.utils = torch.hub.load("NVIDIA/DeepLearningExamples:torchhub", "nvidia_convnets_processing_utils")
        self.model.eval().to(torch.device("cpu"))

    def __call__(self, tensors):
        tensors = torch.tensor(np.array(tensors.to_pylist()))  # get tensors into correct format

        with torch.no_grad():
            output = torch.nn.functional.softmax(self.model(tensors), dim=1)

        results = self.utils.pick_n_best(predictions=output, n=1)

        # post-process results into StructType format
        list_res = [result[0] for result in results]
        new_list = []
        for pred, conf in list_res:
            conf = float(conf.strip("%")) / 100
            new_list.append({"top_prediction": pred, "confidence": round(conf, 2)})

        return new_list

Preprocessing#

Now let’s preprocess our image column to prepare the data into the right format:

from torchvision import transforms


def transform_image(image):
    # img = Image.fromarray(image)
    img = Image.fromarray(np.array(image))
    preprocess = transforms.Compose(
        [
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        ]
    )
    tensor = preprocess(img)
    return tensor
df_pre = df_img.with_column(
    "tensor",
    df_img["image"].cast(DataType.tensor(dtype=DataType.uint8())),
).with_column(
    "transformed_tensor",
    daft.col("tensor").apply(transform_image, return_dtype=daft.DataType.tensor(daft.DataType.float32())),
)
df_pre.show()
full_name
Utf8
dog_name
Utf8
urls
Utf8
image_bytes
Binary
image
Image[MIXED]
thumbnail
Image[MIXED]
tensor
Tensor(UInt8)
transformed_tensor
Tensor(Float32)
Ernesto Evergreen
Ernie
https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
<Image>
<Tensor shape=(400, 303, 3)>
<Tensor shape=(3, 224, 224)>
James Jale
Jackie
https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
<Image>
<Tensor shape=(375, 500, 3)>
<Tensor shape=(3, 224, 224)>
Wolfgang Winter
Wolfie
https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
<Image>
<Tensor shape=(500, 275, 3)>
<Tensor shape=(3, 224, 224)>
Shandra Shamas
Shaggie
https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
<Image>
<Tensor shape=(432, 288, 3)>
<Tensor shape=(3, 224, 224)>
Zaya Zaphora
Zadie
https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
<Image>
<Image>
<Tensor shape=(375, 500, 3)>
<Tensor shape=(3, 224, 224)>
(Showing first 5 of 5 rows)

Classify the Puppies! 🐶#

Nice work. Now you’re all set to call this function on the urls column and store the outputs in a new column we’ll call classify breeds:

df_classified = df_pre.with_column("classify_breed", ClassifyImages(daft.col("transformed_tensor")))

df_classified.select("dog_name", "image", "classify_breed").show()
dog_name
Utf8
image
Image[MIXED]
classify_breed
Struct[top_prediction: Utf8, confidence: Float32]
Ernie
<Image>
{top_prediction: boxer,
confidence: 0.52,
}
Jackie
<Image>
{top_prediction: American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier,
confidence: 0.42,
}
Wolfie
<Image>
{top_prediction: collie,
confidence: 0.5,
}
Shaggie
<Image>
{top_prediction: standard schnauzer,
confidence: 0.3,
}
Zadie
<Image>
{top_prediction: Rottweiler,
confidence: 0.79,
}
(Showing first 5 of 5 rows)

Nicely done!

It looks like our pre-trained model is more familiar with some specific breeds. You could do further work to fine-tune this model to improve performance.

5. Store your Results#

Let’s take the results we need (breed classification) and store them with the rest of our tabular data. Let’s also encode the thumbnail column as JPEG and store it as binary code in our dataframe.

We’ll join the original data df to our df_classified to get the breed classification data.

df_res = df.join(df_classified.select("dog_name", "classify_breed", "thumbnail"), on="dog_name", how="left")
df_res.show()
dog_name
Utf8
full_name
Utf8
urls
Utf8
classify_breed
Struct[top_prediction: Utf8, confidence: Float32]
thumbnail
Image[MIXED]
Ernie
Ernesto Evergreen
https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg
{top_prediction: boxer,
confidence: 0.52,
}
<Image>
Jackie
James Jale
https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg
{top_prediction: American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier,
confidence: 0.42,
}
<Image>
Wolfie
Wolfgang Winter
https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg
{top_prediction: collie,
confidence: 0.5,
}
<Image>
Shaggie
Shandra Shamas
https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg
{top_prediction: standard schnauzer,
confidence: 0.3,
}
<Image>
Zadie
Zaya Zaphora
https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg
{top_prediction: Rottweiler,
confidence: 0.79,
}
<Image>
(Showing first 5 of 5 rows)

And then encode the thumbnails as jpeg:

df_res = df_res.with_column("thumbnail_bin", daft.col("thumbnail").image.encode("jpeg")).exclude("thumbnail")
df_res.show()
full_name
Utf8
dog_name
Utf8
urls
Utf8
classify_breed
Struct[top_prediction: Utf8, confidence: Float32]
thumbnail_bin
Binary
Ernesto Evergreen
Ernie
https://live.staticflickr.com/65535/53671838774_03ba68d203_o.jpg
{top_prediction: boxer,
confidence: 0.52,
}
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
James Jale
Jackie
https://live.staticflickr.com/65535/53671700073_2c9441422e_o.jpg
{top_prediction: American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier,
confidence: 0.42,
}
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
Wolfgang Winter
Wolfie
https://live.staticflickr.com/65535/53670606332_1ea5f2ce68_o.jpg
{top_prediction: collie,
confidence: 0.5,
}
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
Shandra Shamas
Shaggie
https://live.staticflickr.com/65535/53671838039_b97411a441_o.jpg
{top_prediction: standard schnauzer,
confidence: 0.3,
}
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
Zaya Zaphora
Zadie
https://live.staticflickr.com/65535/53671698613_0230f8af3c_o.jpg
{top_prediction: Rottweiler,
confidence: 0.79,
}
b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01"...
(Showing first 5 of 5 rows)

Then we can write the results out to a storage format, such as Parquet or Delta Lake:

# write to Parquet
df_res.write_parquet("dogs_classified.parquet")
# write to Delta Lake
df_res.write_deltalake("delta/dogs_classified")

Great job!

A note about working with Images at URLs#

Storing images as URLs may mean that the size of your data increases significantly during processing.

For example, a 10MB Parquet file with a few hundred thousands rows of integers and URLs seems small enough to process locally on your machine. But if these URLs contain high-definition Images then your data volume will be much higher once you download the Image bytes for processing.

This is important to keep in mind when choosing where and on what type of infrastructure to run your data workloads.