Back to top

Best Practices for Optimizing AMD Versal AI Engine Designs

15 October 2024

versal AI core circuit with fidus and AMD partner logo

By Peifang Zhou, Versal AI Engine Design Expert

Versal is the latest generation of AMD FPGA offerings built on the TSMC 7nm FinFET process technology. Versal Adaptive SoCs represent a significant advancement in FPGA technology with their impressive architecture and outstanding capabilities. The integration of the Processing System (PS), Programmable Logic (PL), and AI Engine (AIE) interconnected by a high-bandwidth Network-on-Chip (NoC) offers a versatile and powerful platform to build a wide range of AI and DSP applications. For a deeper dive into the Versal platform’s architecture and capabilities, check out our comprehensive guide: Versal FPGA Platform: A Comprehensive Guide.

A key highlight of Versal Adaptive SoCs is the AI Engine (AIE), designed to handle complex computations efficiently. Featuring a 2D array of VLIW (Very Long Instruction Word) and SIMD (Single Instruction Multiple Data) vector processors (compute tiles), the AIE can contain up to 400 compute tiles in a single Versal device. This architecture offers impressive throughput and latency performance, making it an ideal solution for AI and DSP tasks.

C/C++ Techniques for Boosting Run-time Performance

This section discusses the use of C/C++ features to enhance the run-time performance with compile-time optimizations.

Using inline functions

It is a simple technique to optimize performance. By declaring functions as inline, you can eliminate the overhead associated with function calls, which is particularly beneficial for small and frequently called functions. This can lead to faster execution times and more efficient use of resources. Here’s a simple inline example in C/C++:

In this example, the add function is defined as inline, which instructs the compiler to insert the function’s code directly at the point of each call, rather than performing a traditional function call . Note that using inline functions may lead to a larger register consumption and an increase of the size of the program, It may not be efficient unless the functions to be inlined are short.

Using constexpr specifiers

This will enable the evaluations of variables and functions at compile-time. It is a great technique to enhance run-time performance by shifting numerical calculations from run-time to compile-time. Here’s a simple example of using constexpr in C/C++:

In this case, when you make a square(8) function call, it is evaluated at compile-time and the result (64) is directly embedded in the code, eliminating the need for a function call to calculate the result at run-time. Since the compiler does the calculations at compile-time, there isn’t any cost at run-time, saving precious AIE cycles for computations that need to be done at run-time. The technique also applies to templated functions.

Templates and “if constexpr”

Templates combined with the “if constexpr” construct provides a powerful one-two punch for writing generic, reusable, and efficient kernel code which can parameterize types and values, making your kernel code more adaptable to different data types and configurations without duplicating code.

The “if constexpr” construct enables compile-time decision-making to control the flow of code execution. When a kernel is instantiated, the compiler evaluates the “if constexpr” constructs. Depending on the evaluation result, it will decide which functions to be included for compilation and render all irrelevant code branches to be excluded at compile-time.

This approach ensures that only the necessary code is compiled and executed, leading to optimized executables and increased performance at run-time. Here a simple example of using a template and the “if constexpr” construct to write generic and efficient kernel code in a multi-tile AIE design:

In this example, the tile_id is parameterized, making the kernel function adaptable to different tiles without duplicating code. As instructed by the if constexpr constructs, the do_enter() and do_common() functions will be included and compiled for the first kernel, the do_common() and do_exit() functions for the last kernel, and the do_common() function only for all other kernels. Since all the if constexpr constructs are evaluated at compile-time, there is no code branching at run-time.

Restrict keyword

The restrict keyword in C/C++ is another powerful tool for optimization. When you declare a pointer with restrict, you’re telling the compiler that this pointer is the only way to access the object it points to. This allows the compiler to make more aggressive optimizations that it could not safely make otherwise, because it can assume that no other pointer will modify the object. Pointer aliasing refers to scenarios in which the same memory location can be accessed by different pointers. The strict no-aliasing rule applies when the restrict keyword is used. Here is a simple example of using the restrict keyword:

In this function, the compiler knows that a, b, and c point to separate memory locations, allowing it to optimize the loop more aggressively.

Hierarchical Design Setup for Maximizing Efficiency and Maintainability

The “Divide and Conquer” strategy is a powerful approach when implementing complex data processing algorithms in AIE. This approach streamlines development, debugging, and testing, making your AIE development more efficient and manageable. By breaking down the algorithm into smaller and manageable building blocks, you can focus on developing and optimizing each component individually. Once all individual blocks are functioning correctly and meeting the performance targets, you can then integrate them to form the complete system.

This method not only simplifies the development process, but also makes debugging and testing more efficient. If an issue arises, you can isolate it to a specific block rather than sifting through the entire algorithm. In addition, running AIE simulations for a single block is much faster than running simulations for the entire system. Another benefit of this modular approach is to facilitate the reuse of individual blocks across different projects to save time and effort. Lastly, it allows easier updates and scalability, as you can modify or replace individual blocks without overhauling the entire system.

A detailed description of the methodology of setting up a two-level hierarchy in AIE designs is the followings:

Block-Level Design

  • Foundation Level: This is where each individual block is defined. Each block is responsible for implementing a specific functionality or a set of related functions. For instance, one block might handle data input, another might process the data, and a third might handle data output.
  • Implementation: Focus on developing and optimizing each block independently. Ensure that each block meets its performance targets and functions correctly before moving on to the next.

Top-Level Design

  • Aggregation: This level involves stitching together all the blocks into a cohesive system.
  • Integration: Ensure that the blocks interact seamlessly and that data flows smoothly between them. This might involve setting up communication protocols, managing data dependencies, and handling any synchronization issues.

Here is a simplified example to demonstrate the directory setup of a two-level hierarchy in an AIE design called CDPA (Complex Data Processing Algorithm):

The above annotated screen capture illustrates a hierarchical setup to facilitate block-level development and top-level integration in AIE designs.

Conclusion 

This blog introduces a few best practices in AIE designs. By following these best practices, you will be able to create AIE designs that are not only efficient and high-performing but also scalable and easy to maintain.

At Fidus, we bring decades of experience in FPGA design and embedded systems development. Our team has worked across industries, including telecommunications, aerospace, defense, and consumer electronics, delivering customized solutions tailored to meet specific project requirements. We understand the complexities involved in integrating FPGAs into embedded systems and have the expertise to ensure optimal performance and reliability. 

Optimize FPGA Designs with AMD Versal Adaptive SoCs

Want to learn more about optimizing Versal AI designs? Watch our on-demand webinar: Strategies for Optimizing FPGA Designs with AMD Versal Adaptive SoCs. This session explores techniques for efficient data movement using NoC, CIPS, DFX, and AI Engines, as well as optimizing transfer speeds for high-performance systems.

Latest articles

Back to Blog
FPGA Co-Processors for Real-Time Edge Analytics: Design Patterns and Best Practices

FPGA Co-Processors are redefining what’s possible at the edge—enabling real-time analytics with precision, efficiency, and scalability. This guide explores proven design patterns, integration models, and optimization strategies to help engineering teams build smarter, faster embedded systems.

Read now
Secure Boot and Runtime Security in FPGA-Based Embedded Systems

This in-depth guide explores the evolving security challenges in FPGA-based embedded systems. Learn how to implement secure boot, protect against runtime threats, and build resilient architectures that meet the demands of aerospace, automotive, and medical applications. See how FPGAs like AMD Versal adaptive SoCs support robust security from design through deployment.

Read now
Balancing Hardware-Software Partitioning in FPGA-Based Systems

Explore best practices for hardware-software partitioning in FPGA-based systems. Learn how to evaluate trade-offs, model performance, and avoid common pitfalls through real-world case studies from telecom, AI, and industrial control. Get a step-by-step framework for architecting flexible, high-performance designs—whether you're targeting Zynq, Versal, or custom embedded platforms.

Read now

Experience has taught us how to solve problems on any scale

Trust us to deliver on time. That’s why 95% of our customers come back.

Contact us