Preprocess with Nvidia Dali

Introduction

Nvidia Dali is Nvidia's API to operate and load data such as images or video, and use hardware acceleration when available.

Installation

1. Download repo: git clone --recursive https://github.com/NVIDIA/DALI

2. cd DALI

For Jetson

For jetson platforms the only way to install is to build from source, however Nvidia provides a container to do the build using crossplatform.

1. Inside Dali folder: sudo docker build -t nvidia/dali:builder_aarch64-linux -f docker/Dockerfile.build.aarch64-linux .

2. Build: sudo docker run -v $(pwd):/dali nvidia/dali:builder_aarch64-linux

3. After build is done, the generated wheels will be inside the wheelhouse folder, copy them to the target.

4. Install them both with: pip3 install WHEEL_NAME

For x86 platforms

1. mkdir build

2. cd build

3. cmake -D CMAKE_BUILD_TYPE=Release ..

4. make -j"$(nproc)"

5. cd build

6. pip install dali/python

Usage

The examples provided are for python 3.8, with Cuda 10.2 and Jetpack 4.6.2. Check the Getting started guide

Nvidia dali uses pipelines to contain the operations, from the data loading to the data processing, in order to optimize it and at the end return the processed data.

For data loading an assortment of sources can be used, from file lists with paths, to code generated data, check more. In this case the example used will be from code generate data, for that an external source element will be used:

    fn.external_source(source=SOURCE, num_outputs=N, dtype=types.TYPE,device=DEV)

This element takes almost anything and tries to load it. The SOURCE is where the data will be fetched from, num_outputs is how many elements are expected to get from the data loader, and device is where to load the data to, 'cpu' to use cpu space, 'gpu' to use gpu space and 'mixed' to let it decide.

After that the data can be used by mode elements of the pipeline.

This example shows a simple scaler pipeline, that takes an image, scales it, crops it and then normalizes it.

    @dali.pipeline_def
    def scaler_pipe():
        img = fn.external_source(source=sample_source, num_outputs=1, dtype=types.UINT8,device="gpu",batch=True)
        scaled_img = fn.resize(img,resize_x=455, resize_y=256, interp_type=dali.types.INTERP_LINEAR, antialias=False,dtype=types.UINT8,device="gpu")
        cropped_img = fn.crop(scaled_img,crop_w=256,crop_h=256,dtype=types.FLOAT,device="gpu")
        divided_img = cropped_img/255.0
        normalized_img = fn.normalize(divided_img,mean=0.45,stddev=0.225,dtype=types.FLOAT,device="gpu")
        return normalized_img

In this example only gpu operations are used, there are some operations that can be done on cpu only, check the docs for mode info. Also be sure to set the dtype right otherwise it will likely throw an error. Another thing to take into acount is the sample_source element, this element is where the data will be taken from. This data source needs to be implemented using the Iterable interface:


class SimpleDataSRC:
    sample_size = 30
    def __init__(self):
        self.frames = numpy.ascontiguousarray(numpy.random.randint(high=255,low=0,(self.sample_size,1080,1920,3),dtype=numpy.uint8))
    def __iter__(self) -> None:
        self.i = -1
        self.n = self.sample_size
        return self
    def __next__(self):
        self.i += 1
        return [[self.frames[self.i]]]

Some advice, the ascontiguousarray is preferably used, since when using normal arrays sometimes it throws errors. Also the __next__ function, it returns a nested array containing the sample this is needed since the external source element needs it like that, otherwise it will see the h or the w dimension from the sample as n samples, which is not correct.

After setting up the data source, the pipeline needs to be initialized:

    import nvidia.dali.fn as fn
    import nvidia.dali.types as types
    from nvidia import dali

    src = SimpleDataSRC()
    @dali.pipeline_def
    def scaler_pipe(self):
        img = fn.external_source(source=src, num_outputs=1, dtype=types.UINT8,device="gpu",batch=True)##batch flag to tell it to use the sample available and not to iterate by itself
        scaled_img = fn.resize(img,resize_x=455, resize_y=255, interp_type=dali.types.INTERP_LINEAR, antialias=False,dtype=types.UINT8,device="gpu")
        cropped_img = fn.crop(scaled_img,crop_w=255,crop_h=255,dtype=types.FLOAT,device="gpu")
        divided_img = cropped_img/255.0
        normalized_img = fn.normalize(divided_img,mean=0.45,stddev=0.225,dtype=types.FLOAT,device="gpu")
        return normalized_img
    pipe = simple_pipe(batch_size=1, num_threads=1, device_id=0)
    ##defines the pipeline using the annotated method
    ##Since 1 sample will be processed at the time batch_size is set to 1
    ##Set num_threads and device_id accordingly to your case

After that the pipeline can be used like so:

    res = pipe.run()
    actual_res = res[0]##gets one element since it returns always a tuple with num_outputs
    converted_res = actual_res.as_cpu()##copies the result to cpu space, the return type are TensorList_DEV, where DEV is cpu or gpu

For some operations it can be used as is, but to make sure the type is numpy arr:

    np_result = np.empty((256, 256, 3))
    np_result = converted_res##this is to make sure the array is now numpy if needed like that elsewhere

❯