How to enable eIQ on IMX8MP

This document provides detailed instructions for enabling eIQ on I-Pi SMARC IMX8M plus.

TensorFlowLite
OpenCV

Prerequisites

Download the prebuilt full Yocto Image (2G/4G). Use any of the following methods to flash the Yocto Image on the targeted device,
- Flashing the image in SD card click here.
- Flashing the image in eMMC click here.
- Flashing the image in eMMC using UUU tool click here.
Once flashed the Yocto Image on targeted device turn on it and go to the Yocto terminal which is located on your left corner top side of the screen.
According to the NXP source by far the NPU only enabled on armNN, ONNX, TensorFlowLite and DeepViewRT rest of them run on CPU only.

Enabling NPU and CPU on TensorFlowLite

TensorFlow Lite is an open-source software library focused on running machine learning models and embedded devices. It enables on-device machine learning inference with low latency and small binary size.

Features

TensorFlow Lite v2.5.0
Multithreaded computation with acceleration using Arm Neon SIMD instructions on Cortex-A cores.
Parallel computation using GPU/NPU hardware acceleration.
C++ and Python API (supported Python version 3).

A Yocto Linux BSP Image with machine learning layer included by default contains a simple preinstalled example called ‘label_image’ usable with image classification models. The example file is located at: /usr/bin/tensorflow-lite-2.5.0/examples.

Go to:

cd /usr/bin/tensorflow-lite-2.5.0/examples

This is a sample image we used to test the CPU and NPU. You can use your own image but it must be in (.bmp) format.

To run the example with mobilenet model on the CPU, use the following command,

./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt

The output of a successful classification for the ‘grace_hopper.bmp’ input image is as follows:

Loaded model mobilenet_v1_1.0_224_quant.tflite
resolved reporter
invoked
average time: 39.271 ms
0.780392: 653 military uniform
0.105882: 907 Windsor tie
0.0156863: 458 bow tie
0.0117647: 466 bulletproof vest
0.00784314: 835 suit

To run the example application on the CPU with using the XNNPACK delegate, use the -x 1 switch,

./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt -x 1

To run the example with the same model on the GPU/NPU hardware accelerator, add the -a 1 (for NNAPI Delegate) or -V 1 (for VX Delegate) command line argument. To differentiate between the 3D GPU and the NPU, use the USE_GPU_INFERENCE switch. For example, to run the model accelerated on the NPU hardware using NNAPI Delegate, use this command:

USE_GPU_INFERENCE=0 ./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt -a 1

The output with NPU acceleration enabled should be as follows:

Loaded model mobilenet_v1_1.0_224_quant.tflite
resolved reporter
INFO: Created TensorFlow Lite delegate for NNAPI.
Applied NNAPI delegate.
invoked
average time: 2.967 ms
0.74902: 653 military uniform
0.121569: 907 Windsor tie
0.0196078: 458 bow tie
0.0117647: 466 bulletproof vest
0.00784314: 835 suit

Running Benchmark Applications

A Yocto Linux BSP image with machine learning layer included by default contains a pre-installed benchmarking application. It performs a simple TensorFlow Lite model inference and prints benchmarking information. The application binary file is located at: /usr/bin/tensorflow-lite-2.5.0/examples.

Go to the directory:

cd /usr/bin/tensorflow-lite-2.5.0/examples

To run the benchmark with computation on CPU, use the following command:

./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite

You can optionally specify the number of threads with the –num_threads=X parameter to run the inference on multiple cores. For highest performance, set X to the number of cores available.

The output of the benchmarking application should be similar to:

STARTING!
Duplicate flags: num_threads
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Enable platform-wide tracing: [0]
#threads used for CPU inference: [1]
Max number of delegated partitions : [0]
Min nodes per partition : [0]
Loaded model mobilenet_v1_1.0_224_quant.tflite
The input model file size (MB): 4.27635
Initialized session in 93.252ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding
150 seconds.
count=4 first=147477 curr=140410 min=140279 max=147477 avg=142382 std=2971
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding
150 seconds.
count=50 first=140422 curr=140269 min=140269 max=140532 avg=140391 std=67
Inference timings in us: Init: 93252, First inference: 147477, Warmup (avg): 142382, Inference
(avg): 140391
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the
actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=3.14062 overall=10.043

To run the inference using the XNNPACK delegate, add the –use_xnnpack=true switch:

./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_xnnpack=true

To run the inference using the GPU/NPU hardware accelerator, add the –use_nnapi=true (for NNAPI Delegate) or – use_vxdelegate=true (for VX Delegate) switch:

./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true

The output with GPU/NPU module acceleration enabled should be similar to:

STARTING!
Duplicate flags: num_threads
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Use caching: [0]
Benchmark name: []
Output prefix: []
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
Input layers: []
Input shapes: []
Input value ranges: []
Input layer values files: []
Allow fp16 : [0]
Require full delegation : [0]
Enable op profiling: [0]
Max profiling buffer entries: [1024]
CSV File to export profiling data to: []
Enable platform-wide tracing: [0]
#threads used for CPU inference: [1]
Max number of delegated partitions : [0]
Min nodes per partition : [0]
Loaded model mobilenet_v1_1.0_224_quant.tflite
INFO: Created TensorFlow Lite delegate for NNAPI.
Applied NNAPI delegate, and the model graph will be completely executed w/ the delegate.
The input model file size (MB): 4.27635
Initialized session in 18.648ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding
150 seconds.
count=1 curr=5969598
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding
150 seconds.
count=306 first=3321 curr=3171 min=3161 max=3321 avg=3188.46 std=18
Inference timings in us: Init: 18648, First inference: 5969598, Warmup (avg): 5.9696e+06, Inference
(avg): 3188.46
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the
actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=7.60938 overall=33.7773

Enabling CPU on OpenCV

OpenCV is an open-source computer vision library and one of its modules, called ML, provides traditional machine learning algorithms. OpenCV offers a unified solution for both neural network inference (DNN module) and classic machine learning algorithms (ML module).

Features

OpenCV 4.5.2
C++ and Python API (supported Python version 3)
Only CPU computation is supported
Input image or live camera (webcam) is supported

OpenCV DNN demos (binaries) are located at: /usr/share/OpenCV/samples/bin.

Input data, and model configurations are located at: /usr/share/opencv4/testdata/dnn.

To check the OpenCV sample algorithms, we don’t need to install anything new because everything is already installed. Before running the DNN model, you have to copy all the content from the below link, create a file called model.yaml, and paste all the content into it. Copy the classes from the following link as well. Using a USB pen drive, copy the model.yaml file and classes file, then connect the USB pen drive to the carrier module.

To download model.yaml click here

To download classes for the model click here

To copy the model.yaml file from the USB pen drive to yocto follow the commands:

df -h

cd /run/media/sda1 (or) sdb1

cp model.yaml /usr/share/OpenCV/samples/bin (for model.yaml)

cp object_detection_classes_yolov3.txt /usr/share/opencv4/testdata/dnn (for classes)

YOLO object detection example

Running the C++ example with image input from the default location:

/usr/share/OpenCV/samples/data/dnn

Executing with image or video file:

./example_dnn_object_detection --config=[PATH-TO-DARKNET]/cfg/yolo.cfg --model=[PATH-TO-DARKNET]/yolo.weights --classes=[PATH-TO-DARKNET]/object_detection_classes_yolov3.txt --width=416 --height=416 --scale=0.00392 --input=[PATH-TO-IMAGE-OR-VIDEO-FILE] –rgb

For our case,

./example_dnn_object_detection --config=/usr/share/opencv4/testdata/dnn/yolov3.cfg --model=/usr/share/opencv4/testdata/dnn/yolov3.weights –-classes=/usr/share/opencv4/testdata/dnn/object_detection_classes_yolov3.txt --width=416 --height=416 --scale=0.00392 --input=/usr/share/opencv4/testdata/dnn/dog416.png –rgb

Executing with live webcam:

./example_dnn_object_detection --config=[PATH-TO-DARKNET]/cfg/yolo.cfg --model=[PATH-TO-DARKNET]/yolo.weights --classes=[PATH-TO-DARKNET]/object_detection_classes_yolov3.txt --width=416 --height=416 --scale=0.00392 --device=[CAMERA_DEV_NUMBER] –rgb

For our case,

./example_dnn_object_detection --config=/usr/share/opencv4/testdata/dnn/yolov3.cfg --model=/usr/share/opencv4/testdata/dnn/yolov3.weights –-classes=/usr/share/opencv4/testdata/dnn/object_detection_classes_yolov3.txt --width=416 --height=416 --scale=0.00392 --device=1 –rgb