Best Practices for Optimizing AMD Versal AI Engine Designs

15 October 2024

By Peifang Zhou, Versal AI Engine Design Expert

This blog will introduce two key best practices for optimizing AIE designs: C/C++ techniques for enhancing run-time performance and a hierarchical design setup. These practices are crucial for creating high-performance and maintainable AIE designs.

Versal is the latest generation of AMD FPGA offerings built on the TSMC 7nm FinFET process technology. Versal Adaptive SoCs represent a significant advancement in FPGA technology with their impressive architecture and outstanding capabilities. The integration of the Processing System (PS), Programmable Logic (PL), and AI Engine (AIE) interconnected by a high-bandwidth Network-on-Chip (NoC) offers a versatile and powerful platform to build a wide range of AI and DSP applications. For a deeper dive into the Versal platform’s architecture and capabilities, check out our comprehensive guide: Versal FPGA Platform: A Comprehensive Guide.

A key highlight of Versal Adaptive SoCs is the AI Engine (AIE), designed to handle complex computations efficiently. Featuring a 2D array of VLIW (Very Long Instruction Word) and SIMD (Single Instruction Multiple Data) vector processors (compute tiles), the AIE can contain up to 400 compute tiles in a single Versal device. This architecture offers impressive throughput and latency performance, making it an ideal solution for AI and DSP tasks.

C/C++ Techniques for Boosting Run-time Performance

This section discusses the use of C/C++ features to enhance the run-time performance with compile-time optimizations.

Using inline functions

It is a simple technique to optimize performance. By declaring functions as inline, you can eliminate the overhead associated with function calls, which is particularly beneficial for small and frequently called functions. This can lead to faster execution times and more efficient use of resources. Here’s a simple inline example in C/C++:

In this example, the add function is defined as inline, which instructs the compiler to insert the function’s code directly at the point of each call, rather than performing a traditional function call . Note that using inline functions may lead to a larger register consumption and an increase of the size of the program, It may not be efficient unless the functions to be inlined are short.

Using constexpr specifiers

This will enable the evaluations of variables and functions at compile-time. It is a great technique to enhance run-time performance by shifting numerical calculations from run-time to compile-time. Here’s a simple example of using constexpr in C/C++:

In this case, when you make a square(8) function call, it is evaluated at compile-time and the result (64) is directly embedded in the code, eliminating the need for a function call to calculate the result at run-time. Since the compiler does the calculations at compile-time, there isn’t any cost at run-time, saving precious AIE cycles for computations that need to be done at run-time. The technique also applies to templated functions.

Templates and “if constexpr”

Templates combined with the “if constexpr” construct provides a powerful one-two punch for writing generic, reusable, and efficient kernel code which can parameterize types and values, making your kernel code more adaptable to different data types and configurations without duplicating code.

The “if constexpr” construct enables compile-time decision-making to control the flow of code execution. When a kernel is instantiated, the compiler evaluates the “if constexpr” constructs. Depending on the evaluation result, it will decide which functions to be included for compilation and render all irrelevant code branches to be excluded at compile-time.

This approach ensures that only the necessary code is compiled and executed, leading to optimized executables and increased performance at run-time. Here a simple example of using a template and the “if constexpr” construct to write generic and efficient kernel code in a multi-tile AIE design:

In this example, the tile_id is parameterized, making the kernel function adaptable to different tiles without duplicating code. As instructed by the if constexpr constructs, the do_enter() and do_common() functions will be included and compiled for the first kernel, the do_common() and do_exit() functions for the last kernel, and the do_common() function only for all other kernels. Since all the if constexpr constructs are evaluated at compile-time, there is no code branching at run-time.

Restrict keyword

The restrict keyword in C/C++ is another powerful tool for optimization. When you declare a pointer with restrict, you’re telling the compiler that this pointer is the only way to access the object it points to. This allows the compiler to make more aggressive optimizations that it could not safely make otherwise, because it can assume that no other pointer will modify the object. Pointer aliasing refers to scenarios in which the same memory location can be accessed by different pointers. The strict no-aliasing rule applies when the restrict keyword is used. Here is a simple example of using the restrict keyword:

In this function, the compiler knows that a, b, and c point to separate memory locations, allowing it to optimize the loop more aggressively.

Hierarchical Design Setup for Maximizing Efficiency and Maintainability

The “Divide and Conquer” strategy is a powerful approach when implementing complex data processing algorithms in AIE. This approach streamlines development, debugging, and testing, making your AIE development more efficient and manageable. By breaking down the algorithm into smaller and manageable building blocks, you can focus on developing and optimizing each component individually. Once all individual blocks are functioning correctly and meeting the performance targets, you can then integrate them to form the complete system.

This method not only simplifies the development process, but also makes debugging and testing more efficient. If an issue arises, you can isolate it to a specific block rather than sifting through the entire algorithm. In addition, running AIE simulations for a single block is much faster than running simulations for the entire system. Another benefit of this modular approach is to facilitate the reuse of individual blocks across different projects to save time and effort. Lastly, it allows easier updates and scalability, as you can modify or replace individual blocks without overhauling the entire system.

A detailed description of the methodology of setting up a two-level hierarchy in AIE designs is the followings:

Block-Level Design

Foundation Level: This is where each individual block is defined. Each block is responsible for implementing a specific functionality or a set of related functions. For instance, one block might handle data input, another might process the data, and a third might handle data output.
Implementation: Focus on developing and optimizing each block independently. Ensure that each block meets its performance targets and functions correctly before moving on to the next.

Top-Level Design

Aggregation: This level involves stitching together all the blocks into a cohesive system.
Integration: Ensure that the blocks interact seamlessly and that data flows smoothly between them. This might involve setting up communication protocols, managing data dependencies, and handling any synchronization issues.

Here is a simplified example to demonstrate the directory setup of a two-level hierarchy in an AIE design called CDPA (Complex Data Processing Algorithm):

The above annotated screen capture illustrates a hierarchical setup to facilitate block-level development and top-level integration in AIE designs.

Conclusion

This blog introduces a few best practices in AIE designs. By following these best practices, you will be able to create AIE designs that are not only efficient and high-performing but also scalable and easy to maintain.

At Fidus, we bring decades of experience in FPGA design and embedded systems development. Our team has worked across industries, including telecommunications, aerospace, defense, and consumer electronics, delivering customized solutions tailored to meet specific project requirements. We understand the complexities involved in integrating FPGAs into embedded systems and have the expertise to ensure optimal performance and reliability.

Optimize FPGA Designs with AMD Versal Adaptive SoCs

Want to learn more about optimizing Versal AI designs? Watch our on-demand webinar: Strategies for Optimizing FPGA Designs with AMD Versal Adaptive SoCs. This session explores techniques for efficient data movement using NoC, CIPS, DFX, and AI Engines, as well as optimizing transfer speeds for high-performance systems.

Watch now

Latest articles

Back to Blog

24 July 2025

Custom FPGA Solutions for AI Acceleration in Embedded Applications

Custom FPGA solutions are unlocking new levels of AI acceleration at the edge. Learn how Fidus designs real-time, low-power inference engines for high-performance embedded systems—and what sets these architectures apart from off-the-shelf accelerators.

Read now

3 July 2025

Future-Proofing Embedded Designs: Migration Strategies Between FPGA Families

Migrating between FPGA families is inevitable in long-lifecycle embedded systems. This blog explores how to architect designs that simplify platform transitions, reduce rework, and future-proof your product against supply shifts and silicon obsolescence.

Read now

24 June 2025

Debugging Complex FPGA-Software Interactions

This deep dive explores how to tackle debugging challenges at the intersection of FPGA hardware and software. From clock domain crossings to distributed system issues, learn strategies, tools, and cultural best practices that reduce debug time and build more resilient embedded systems.

Read now

Experience has taught us how to solve problems on any scale

Trust us to deliver on time. That’s why 95% of our customers come back.

Partners

Best Practices for Optimizing AMD Versal AI Engine Designs

C/C++ Techniques for Boosting Run-time Performance

Using inline functions

Using constexpr specifiers

Templates and “if constexpr”

Restrict keyword

Hierarchical Design Setup for Maximizing Efficiency and Maintainability

Block-Level Design

Top-Level Design

Conclusion

Optimize FPGA Designs with AMD Versal Adaptive SoCs

Latest articles

Custom FPGA Solutions for AI Acceleration in Embedded Applications

Future-Proofing Embedded Designs: Migration Strategies Between FPGA Families

Debugging Complex FPGA-Software Interactions

Experience has taught us how to solve problems on any scale