Achieving 3D Visualization with Low-Latency, High-Bandwidth Data Acquisition, Transfer, and Storage
High-bandwidth, low-latency solutions come with tradeoffs. To find the right solution for 3D visualization, consider the following requirements:
Back to top
By Peifang Zhou, Versal AI Engine Design Expert
This blog will introduce two key best practices for optimizing AIE designs: C/C++ techniques for enhancing run-time performance and a hierarchical design setup. These practices are crucial for creating high-performance and maintainable AIE designs.
Versal is the latest generation of AMD FPGA offerings built on the TSMC 7nm FinFET process technology. Versal Adaptive SoCs represent a significant advancement in FPGA technology with their impressive architecture and outstanding capabilities. The integration of the Processing System (PS), Programmable Logic (PL), and AI Engine (AIE) interconnected by a high-bandwidth Network-on-Chip (NoC) offers a versatile and powerful platform to build a wide range of AI and DSP applications. For a deeper dive into the Versal platform’s architecture and capabilities, check out our comprehensive guide: Versal FPGA Platform: A Comprehensive Guide.
A key highlight of Versal Adaptive SoCs is the AI Engine (AIE), designed to handle complex computations efficiently. Featuring a 2D array of VLIW (Very Long Instruction Word) and SIMD (Single Instruction Multiple Data) vector processors (compute tiles), the AIE can contain up to 400 compute tiles in a single Versal device. This architecture offers impressive throughput and latency performance, making it an ideal solution for AI and DSP tasks.
This section discusses the use of C/C++ features to enhance the run-time performance with compile-time optimizations.
It is a simple technique to optimize performance. By declaring functions as inline, you can eliminate the overhead associated with function calls, which is particularly beneficial for small and frequently called functions. This can lead to faster execution times and more efficient use of resources. Here’s a simple inline example in C/C++:
In this example, the add function is defined as inline, which instructs the compiler to insert the function’s code directly at the point of each call, rather than performing a traditional function call . Note that using inline functions may lead to a larger register consumption and an increase of the size of the program, It may not be efficient unless the functions to be inlined are short.
This will enable the evaluations of variables and functions at compile-time. It is a great technique to enhance run-time performance by shifting numerical calculations from run-time to compile-time. Here’s a simple example of using constexpr in C/C++:
In this case, when you make a square(8) function call, it is evaluated at compile-time and the result (64) is directly embedded in the code, eliminating the need for a function call to calculate the result at run-time. Since the compiler does the calculations at compile-time, there isn’t any cost at run-time, saving precious AIE cycles for computations that need to be done at run-time. The technique also applies to templated functions.
Templates combined with the “if constexpr” construct provides a powerful one-two punch for writing generic, reusable, and efficient kernel code which can parameterize types and values, making your kernel code more adaptable to different data types and configurations without duplicating code.
The “if constexpr” construct enables compile-time decision-making to control the flow of code execution. When a kernel is instantiated, the compiler evaluates the “if constexpr” constructs. Depending on the evaluation result, it will decide which functions to be included for compilation and render all irrelevant code branches to be excluded at compile-time.
This approach ensures that only the necessary code is compiled and executed, leading to optimized executables and increased performance at run-time. Here a simple example of using a template and the “if constexpr” construct to write generic and efficient kernel code in a multi-tile AIE design:
In this example, the tile_id is parameterized, making the kernel function adaptable to different tiles without duplicating code. As instructed by the if constexpr constructs, the do_enter() and do_common() functions will be included and compiled for the first kernel, the do_common() and do_exit() functions for the last kernel, and the do_common() function only for all other kernels. Since all the if constexpr constructs are evaluated at compile-time, there is no code branching at run-time.
The restrict keyword in C/C++ is another powerful tool for optimization. When you declare a pointer with restrict, you’re telling the compiler that this pointer is the only way to access the object it points to. This allows the compiler to make more aggressive optimizations that it could not safely make otherwise, because it can assume that no other pointer will modify the object. Pointer aliasing refers to scenarios in which the same memory location can be accessed by different pointers. The strict no-aliasing rule applies when the restrict keyword is used. Here is a simple example of using the restrict keyword:
In this function, the compiler knows that a, b, and c point to separate memory locations, allowing it to optimize the loop more aggressively.
The “Divide and Conquer” strategy is a powerful approach when implementing complex data processing algorithms in AIE. This approach streamlines development, debugging, and testing, making your AIE development more efficient and manageable. By breaking down the algorithm into smaller and manageable building blocks, you can focus on developing and optimizing each component individually. Once all individual blocks are functioning correctly and meeting the performance targets, you can then integrate them to form the complete system.
This method not only simplifies the development process, but also makes debugging and testing more efficient. If an issue arises, you can isolate it to a specific block rather than sifting through the entire algorithm. In addition, running AIE simulations for a single block is much faster than running simulations for the entire system. Another benefit of this modular approach is to facilitate the reuse of individual blocks across different projects to save time and effort. Lastly, it allows easier updates and scalability, as you can modify or replace individual blocks without overhauling the entire system.
A detailed description of the methodology of setting up a two-level hierarchy in AIE designs is the followings:
Here is a simplified example to demonstrate the directory setup of a two-level hierarchy in an AIE design called CDPA (Complex Data Processing Algorithm):
The above annotated screen capture illustrates a hierarchical setup to facilitate block-level development and top-level integration in AIE designs.
This blog introduces a few best practices in AIE designs. By following these best practices, you will be able to create AIE designs that are not only efficient and high-performing but also scalable and easy to maintain.
At Fidus, we bring decades of experience in FPGA design and embedded systems development. Our team has worked across industries, including telecommunications, aerospace, defense, and consumer electronics, delivering customized solutions tailored to meet specific project requirements. We understand the complexities involved in integrating FPGAs into embedded systems and have the expertise to ensure optimal performance and reliability.
Want to learn more about optimizing Versal AI designs? Watch our on-demand webinar: Strategies for Optimizing FPGA Designs with AMD Versal Adaptive SoCs. This session explores techniques for efficient data movement using NoC, CIPS, DFX, and AI Engines, as well as optimizing transfer speeds for high-performance systems.
High-bandwidth, low-latency solutions come with tradeoffs. To find the right solution for 3D visualization, consider the following requirements:
Today’s analysis and emulation of genetic sequences demands a low-latency, high-bandwidth solution to transfer massive amounts of data between processors.
Creating a differentiated product takes a thoughtful approach to heterogeneous computing.
Trust us to deliver on time. That’s why 95% of our customers come back.