PyTorch AI Framework Optimized for AMPERE® PrerequisitesPlease make sure to instal Docker OverviewAmpere Optimized PyTorch inference acceleration engine is fully integrated with the PyTorch framework. PyTorch models and software written with the PyTorch API can run as-is without modifications. PyTorch FrameworkPython is installed with Ampere Optimized PyTorch and all dependencies. No additional installation steps are needed. Version CompatibilityThis release is based on Pytorch 2.0.0 and comes with the compatible Torchvision 0.15.1 installed. PythonPytorch 2.0.0 is built for Python 3.10, supporting Ubuntu 22.04. Regarding other Python versions, please contact your Ampere sales representative. If you are using the software through a third party, contact their customer support team for help. You can also contact the AI team at ai-support@amperecomputing.com ConfigurationsAmpere Optimized PyTorch inference engine can be configured by a set of environment variables for performance and debugging purposes. They can be set in the command line when running Pytorch models (e.g., AIO_NUM_THREADS=16 python run.py -p fp32) or set in the shell initialization script. AIO_PROCESS_MODEThis variable controls whether the Ampere Optimized PyTorch inference engine is used to run the Pytorch model:• 0: disabled.• 1: enabled (Default). AIO_CPU_BINDEnables core binding. If enabled, each Ampere Optimized PyTorch thread will bind itself to a single core:• 0: Core binding disabled.• 1: Core binding enabled (Default). AIO_MEM_BINDBinds memory to NUMA (Non-uniform memory access) node 0. For optimal performance, numactl (https://linux.die.net/man/8/numactl) is preferred. numactl bind will affect both the Pytorch framework and the optimized framework buffers, while the optimized framework is unable to affect buffers allocated by the Pytorch framework:• 0: Membind disabled.• 1: Membind to node 0 (Default). AIO_NUMA_CPUSSelect the cores that Ampere Optimized PyTorch should bind to (if CPU_BIND is enabled):• Not set: use the first N cores of the machine, excluding hyper-threaded (Default).• Set: use N first cores from the list of cores for N threads. The list is in space separated, 0-basednumber format. For example, selecting cores 0 to 1: AIO_NUMA_CPUS=”0 1”. AIO_DEBUG_MODEControl the verbosity of debug messages:• 0: No messages• 1: Errors only• 2: Basic information, warnings, and errors (Default)• 3: Most messages• 4: All messages QuickstartThe following instructions run on Altra/Altra Max Linux machines installed with Docker. When you are already using a virtual machine pre-installed with the version of Ampere Optimized PyTorch (e.g. on a cloud service provider) that you need, you can skip the following step of launching Docker container.Note: This docker image is developed for benchmarking and evaluation purpose, not for deployment into production environment. We will provide required Debian, RPM and Python packages as needed for your production deployment. Launching Docker ContainerPulling Docker Image from Docker Hub repository $ docker pull amperecomputingai/pytorch:1.7.0Launching Docker Container $ docker run --privileged=true --rm --name pytorch-aio --network host -it amperecomputingai/pytorch:1.7.0Warning: This user has, by default, root privileges with Docker. Please limit permission according to your security policy. Running ExamplesYou can try Ampere Optimized PyTorch by either running the Jupyter Notebook examples or Python scripts on the CLI level. To run the Jupyter Notebook QuickStart examples follow the instructions below: Set AIO_NUM_THREADS to the requested value first. $ export AIO_NUM_THREADS=16; export OMP_NUM_THREADS=16$ cd /workspace/aio-examples/ $ bash start_notebook.sh If you run the Jupyter Notebook Quickstart on a cloud instance, make sure your machine has port 8080 open and on your local device run: $ ssh -N -L 8080:localhost:8080 -I <ssh_key> your_user@xxx.xxx.xxx.xxx Use a browser to point to the URL printed out by the Jupyter Notebook launcher. You will find Jupyter Notebook examples (examples.ipynb) under the /classification and /object detection folders. The examples run through several inference models, visualize results they produce, and present the performance numbers. To use CLI-level scripts: Set AIO_NUM_THREADS to the requested value first.$ export AIO_NUM_THREADS=16; export OMP_NUM_THREADS=16$ cd /workspace/aio-examples/ Go to the directory of choice, e.g.$ cd classification/resnet_50_v1 Evaluate the model.$ numactl --physcpubind=0-15 python3 run.py -p fp32 Ampere Optimized PyTorch Programming GuideOverviewAmpere Optimized PyTorch is powered by Ampere® AI backend that accelerates Deep Learning (DL) operations on the Ampere® Altra family of processors. Ampere Optimized PyTorch accelerates DL operations through model optimization, highly vectorized compute kernels and multi-thread operations that are automatically tuned to deliver the best latency and throughput on Ampere Altra processors. It delivers 2-5x gains over alternative backend solutions. Supported Inference OpsAmpere Optimized Pytorch accelerates most common Pytorch ops that are used in various types ofmodels. Here is a list of accelerated ops and formats (Note: non-accelerated ops will still runwithout a problem, at the original framework operator speed): Layer FP32 Explicit FP16 (Model defined) Implicit FP16 (Automatic on-the-fly conversion) Notes Conv2d Y Y Deconv2d Y Without bias Linear Y Y MaxPool2d Y AvgPool2d Y AdaptiveAvgPool2d Y Relu Y Y Relu6 Y Y LeakyRelu Y Y Softmax Y Y LogSoftmax Y Y Gelu Y Y Silu Y Y Sigmoid Y Tanh Y Y Transpose Y Y Permute Y Y BatchNorm Y Y LayerNorm Y GroupNorm Y InstanceNorm Y Add Y Y Y Int version not optimized Mul Y Y Y Int version not optimized Div Y Y Y Int version not optimized Pow Y Y Y Int version not optimized Matmul Y Y MM Y Y BMM Y Y PixelShuffle Y View Y Y Y Reshape Y Y Y Squeeze Y Y Y Unsqueeze Y Y Y Flatten Y Y Y Contiguous Y Y Size Y Y Y One dimension case EmbeddingBag Y Y Y Sum mode Embedding Y Y Split Y Y Chunk Y Y Sqrt Y Y Rsqrt Y Y Exp Y Y Log Y Y Zeros_like Y Mean Y Y Baddbmm Y Y Slice Y Y Neg Y Y Split with sizes Y Y Index Y Y Limited support Max Y Y Elementwise Min Y Y Elementwise Sub Y Y PyTorch JIT Trace While Pytorch Eager Execution provides excellent model building, programming, and debugging experience, it is slower than graph execution. So, Torchscript is typically used for inference deployment. In the current version of Ampere Optimized Pytorch, Torchscript mode is also accelerated. To use Ampere Optimized Pytorch, conversion of Pytorch module to Torchscript is needed. There are two ways to convert: torch.jit.script() or torch.jit.trace(input) API calls. See https://pytorch.org/docs/stable/jit.html for more details. After converting to Torchscript user should call torch.jit.freeze() to freeze the models and enable model optimizations for inference. Torch Compile (beta) Ampere Optimized Pytorch support torch.compile API introduced in Pytorch 2.0 release. This is new mode for optmizing model for infenence. To take advantage of it user has to compile the model with AIO backend by using compiled_model = torch.compile(model, backend=”aio”, options={“modelname”: “model”}). It is important to explicitly select “aio” backend and pass additional parameter named options with “modelname” field. See https://pytorch.org/get-started/pytorch-2.0/ for more information. Note: In this release this is a beta feature. Torchscript is likely to be faster than torch.compile. ThreadingAmpere Optimized PyTorch controls the number of Ampere Optimized Pytorch intra_op threads with torch.set_num_threads(). This controls both the number of threads used for ops delegated to Ampere Optimized Pytorch as well as the ops running on default CPU backend. Some default CPU backend ops (non-AIO) also need to set OMP_NUM_THREADS environment variable to control the intra_op threads. If the model contains nodes not supported by Ampere Optimized Pytorch we recommend setting following environmental variable:AIO_SKIP_MASTER_THREAD=1 Programming TipsIn the first two inference passes, Ampere Optimized Pytorch performs runtime compilation of PyTorch script and prepares Ampere Optimized Pytorch network. So, the latency of the first two pass is expected to be longer. Subsequent passes will be accelerated. Ampere Optimized PyTorch provides much better latency scaling as core count increase, comparing to other platforms. You can easily try the optimal number of cores with the above set_num_threads() function that can give you the best price / performance, while meeting your latency requirements. Models are optimized for shape of the tensors that that is used during the compilation phase (see above). Passing different shape tensors will work but is suboptimal. To get best performance pad varying shape tensors when running inference. If any issues occur, Ampere AI team is ready to help. Typically, the first step is to get more debug logs and send it to ai-support@amperecomputing.com. Please set environment variable AIO_DEBUG_MODE=5 to capture low level logs. LimitationsAmpere Optimized PyTorch doesn’t support dynamic ranks of tensors (different rank in subsequent passes).Dynamic shapes of Tensors are supported but not recommended, ideally one should pad inputs to the network to get best performance.We can also provide more in-depth profiling of your model to help enhancing performance to meet your needs.