How to enable eIQ on IMX8MP This document provides detailed instructions for enabling eIQ on I-Pi SMARC IMX8M plus. TensorFlowLite OpenCV Prerequisites Download the prebuilt full Yocto Image (2G/4G). Use any of the following methods to flash the Yocto Image on the targeted device, Flashing the image in SD card click here. Flashing the image in eMMC click here. Flashing the image in eMMC using UUU tool click here. Once flashed the Yocto Image on targeted device turn on it and go to the Yocto terminal which is located on your left corner top side of the screen. According to the NXP source by far the NPU only enabled on armNN, ONNX, TensorFlowLite and DeepViewRT rest of them run on CPU only. Enabling NPU and CPU on TensorFlowLiteTensorFlow Lite is an open-source software library focused on running machine learning models and embedded devices. It enables on-device machine learning inference with low latency and small binary size. Features TensorFlow Lite v2.5.0 Multithreaded computation with acceleration using Arm Neon SIMD instructions on Cortex-A cores. Parallel computation using GPU/NPU hardware acceleration. C++ and Python API (supported Python version 3). A Yocto Linux BSP Image with machine learning layer included by default contains a simple preinstalled example called ‘label_image’ usable with image classification models. The example file is located at: /usr/bin/tensorflow-lite-2.5.0/examples. Go to: cd /usr/bin/tensorflow-lite-2.5.0/examples This is a sample image we used to test the CPU and NPU. You can use your own image but it must be in (.bmp) format. To run the example with mobilenet model on the CPU, use the following command, ./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt The output of a successful classification for the ‘grace_hopper.bmp’ input image is as follows: Loaded model mobilenet_v1_1.0_224_quant.tfliteresolved reporterinvokedaverage time: 39.271 ms0.780392: 653 military uniform0.105882: 907 Windsor tie0.0156863: 458 bow tie0.0117647: 466 bulletproof vest0.00784314: 835 suit To run the example application on the CPU with using the XNNPACK delegate, use the -x 1 switch, ./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt -x 1 To run the example with the same model on the GPU/NPU hardware accelerator, add the -a 1 (for NNAPI Delegate) or -V 1 (for VX Delegate) command line argument. To differentiate between the 3D GPU and the NPU, use the USE_GPU_INFERENCE switch. For example, to run the model accelerated on the NPU hardware using NNAPI Delegate, use this command: USE_GPU_INFERENCE=0 ./label_image -m mobilenet_v1_1.0_224_quant.tflite -i grace_hopper.bmp -l labels.txt -a 1 The output with NPU acceleration enabled should be as follows: Loaded model mobilenet_v1_1.0_224_quant.tfliteresolved reporterINFO: Created TensorFlow Lite delegate for NNAPI.Applied NNAPI delegate.invokedaverage time: 2.967 ms0.74902: 653 military uniform0.121569: 907 Windsor tie0.0196078: 458 bow tie0.0117647: 466 bulletproof vest0.00784314: 835 suit Running Benchmark ApplicationsA Yocto Linux BSP image with machine learning layer included by default contains a pre-installed benchmarking application. It performs a simple TensorFlow Lite model inference and prints benchmarking information. The application binary file is located at: /usr/bin/tensorflow-lite-2.5.0/examples. Go to the directory: cd /usr/bin/tensorflow-lite-2.5.0/examples To run the benchmark with computation on CPU, use the following command: ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite You can optionally specify the number of threads with the –num_threads=X parameter to run the inference on multiple cores. For highest performance, set X to the number of cores available. The output of the benchmarking application should be similar to: STARTING!Duplicate flags: num_threadsMin num runs: [50]Min runs duration (seconds): [1]Max runs duration (seconds): [150]Inter-run delay (seconds): [-1]Num threads: [1]Use caching: [0]Benchmark name: []Output prefix: []Min warmup runs: [1]Min warmup runs duration (seconds): [0.5]Graph: [mobilenet_v1_1.0_224_quant.tflite]Input layers: []Input shapes: []Input value ranges: []Input layer values files: []Allow fp16 : [0]Require full delegation : [0]Enable op profiling: [0]Max profiling buffer entries: [1024]CSV File to export profiling data to: []Enable platform-wide tracing: [0]#threads used for CPU inference: [1]Max number of delegated partitions : [0]Min nodes per partition : [0]Loaded model mobilenet_v1_1.0_224_quant.tfliteThe input model file size (MB): 4.27635Initialized session in 93.252ms.Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding150 seconds.count=4 first=147477 curr=140410 min=140279 max=147477 avg=142382 std=2971Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding150 seconds.count=50 first=140422 curr=140269 min=140269 max=140532 avg=140391 std=67Inference timings in us: Init: 93252, First inference: 147477, Warmup (avg): 142382, Inference(avg): 140391Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to theactual memory footprint of the model at runtime. Take the information at your discretion.Peak memory footprint (MB): init=3.14062 overall=10.043 To run the inference using the XNNPACK delegate, add the –use_xnnpack=true switch: ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_xnnpack=true To run the inference using the GPU/NPU hardware accelerator, add the –use_nnapi=true (for NNAPI Delegate) or – use_vxdelegate=true (for VX Delegate) switch: ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true The output with GPU/NPU module acceleration enabled should be similar to: STARTING!Duplicate flags: num_threadsMin num runs: [50]Min runs duration (seconds): [1]Max runs duration (seconds): [150]Inter-run delay (seconds): [-1]Num threads: [1]Use caching: [0]Benchmark name: []Output prefix: []Min warmup runs: [1]Min warmup runs duration (seconds): [0.5]Graph: [mobilenet_v1_1.0_224_quant.tflite]Input layers: []Input shapes: []Input value ranges: []Input layer values files: []Allow fp16 : [0]Require full delegation : [0]Enable op profiling: [0]Max profiling buffer entries: [1024]CSV File to export profiling data to: []Enable platform-wide tracing: [0]#threads used for CPU inference: [1]Max number of delegated partitions : [0]Min nodes per partition : [0]Loaded model mobilenet_v1_1.0_224_quant.tfliteINFO: Created TensorFlow Lite delegate for NNAPI.Applied NNAPI delegate, and the model graph will be completely executed w/ the delegate.The input model file size (MB): 4.27635Initialized session in 18.648ms.Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding150 seconds.count=1 curr=5969598Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding150 seconds.count=306 first=3321 curr=3171 min=3161 max=3321 avg=3188.46 std=18Inference timings in us: Init: 18648, First inference: 5969598, Warmup (avg): 5.9696e+06, Inference(avg): 3188.46Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to theactual memory footprint of the model at runtime. Take the information at your discretion.Peak memory footprint (MB): init=7.60938 overall=33.7773 Enabling CPU on OpenCVOpenCV is an open-source computer vision library and one of its modules, called ML, provides traditional machine learning algorithms. OpenCV offers a unified solution for both neural network inference (DNN module) and classic machine learning algorithms (ML module). Features OpenCV 4.5.2 C++ and Python API (supported Python version 3) Only CPU computation is supported Input image or live camera (webcam) is supported OpenCV DNN demos (binaries) are located at: /usr/share/OpenCV/samples/bin. Input data, and model configurations are located at: /usr/share/opencv4/testdata/dnn. To check the OpenCV sample algorithms, we don’t need to install anything new because everything is already installed. Before running the DNN model, you have to copy all the content from the below link, create a file called model.yaml, and paste all the content into it. Copy the classes from the following link as well. Using a USB pen drive, copy the model.yaml file and classes file, then connect the USB pen drive to the carrier module. To download model.yaml click here To download classes for the model click here To copy the model.yaml file from the USB pen drive to yocto follow the commands: df -hcd /run/media/sda1 (or) sdb1cp model.yaml /usr/share/OpenCV/samples/bin (for model.yaml)cp object_detection_classes_yolov3.txt /usr/share/opencv4/testdata/dnn (for classes) YOLO object detection exampleRunning the C++ example with image input from the default location: /usr/share/OpenCV/samples/data/dnn Executing with image or video file: ./example_dnn_object_detection --config=[PATH-TO-DARKNET]/cfg/yolo.cfg --model=[PATH-TO-DARKNET]/yolo.weights --classes=[PATH-TO-DARKNET]/object_detection_classes_yolov3.txt --width=416 --height=416 --scale=0.00392 --input=[PATH-TO-IMAGE-OR-VIDEO-FILE] –rgb For our case, ./example_dnn_object_detection --config=/usr/share/opencv4/testdata/dnn/yolov3.cfg --model=/usr/share/opencv4/testdata/dnn/yolov3.weights –-classes=/usr/share/opencv4/testdata/dnn/object_detection_classes_yolov3.txt --width=416 --height=416 --scale=0.00392 --input=/usr/share/opencv4/testdata/dnn/dog416.png –rgb Executing with live webcam: ./example_dnn_object_detection --config=[PATH-TO-DARKNET]/cfg/yolo.cfg --model=[PATH-TO-DARKNET]/yolo.weights --classes=[PATH-TO-DARKNET]/object_detection_classes_yolov3.txt --width=416 --height=416 --scale=0.00392 --device=[CAMERA_DEV_NUMBER] –rgb For our case, ./example_dnn_object_detection --config=/usr/share/opencv4/testdata/dnn/yolov3.cfg --model=/usr/share/opencv4/testdata/dnn/yolov3.weights –-classes=/usr/share/opencv4/testdata/dnn/object_detection_classes_yolov3.txt --width=416 --height=416 --scale=0.00392 --device=1 –rgb