Offload Multiple Signal Classification (MUSIC) to AMD Versal™ AI Engines

WEBINAR– NOW AVAILABLE ON DEMAND

Speakers

Bachir Berkane, System and Algorithm Architect, Fidus Systems
Peifang Zhou, Senior Embedded Software Designer, Fidus Systems
Mark Rollins, Ph.D., Sr. Manager Technical Marketing, AMD

Key Takeaways

Discover how the MUSIC algorithm enables high-resolution direction-of-arrival (DOA) estimation in radar, sonar, and wireless systems.
Learn how AMD Versal™ AI Engines handle complex, compute-heavy signal processing workloads with real-time performance.
See how Fidus Systems mapped MUSIC onto a highly parallel, pipelined architecture using AI Engine tiles and optimized C/C++ kernels.
Explore performance optimization strategies using SIMD vectorization, VLIW instructions, and hardware-in-the-loop (HIL) validation.
Get inspired by a real-world demonstration running on the AMD VCK190 Eval Board, integrating MATLAB models with AIE processing in hardware.

Transcript

Timestamps for the Webinar

0:24 – Welcome and Introduction
2:47 – Overview of the MUSIC Algorithm
6:30 – MUSIC Algorithm Workloads and Computational Complexity
9:20 – MUSIC Implementation on AMD Versal AI Engines
12:42 – AMD and Fidus Collaboration Overview
13:43 – Versal AI Engine Architecture and Tile Design
20:33 – Introduction to Fidus Systems
21:50 – System Modeling and MUSIC Algorithm Breakdown
26:40 – Mapping MUSIC Operations to AI Engine Tiles
29:14 – QRD and Spectrum Processing Implementation
34:01 – AI Engine Design Principles and Parallelism
39:00 – Development Flow and Simulation Strategy
43:33 – System Integration and Hardware-in-the-Loop Testing
50:00 – Real-Time Demonstration and Performance Results
53:33 – Key Takeaways and Conclusion
55:08 – Q&A Session Begins

****

0:24 – Introduction and Webinar Overview

Jeff: All right, we are live. Thank you, everyone, for joining us today. My name is Jeffrey, and I will be your host for today’s webinar.

Today, we’re going to be talking about offloading Multiple Signal Classification, or MUSIC, to AMD Versal AI Engines. We have a packed presentation today for you, so we’re probably going to go roughly 50 minutes, with a little bit of time for Q&A. We have a rotating cast of speakers, so we’ll be able to answer your questions via text during the presentation.

If you have any questions, go ahead and ask those in the question box on the right-hand side of your console, and we’ll do our best to get to you over there. You will also find a survey—if you could fill that out before you leave here today, that would be super helpful. We try to bring you the best webinars that we can.

Above those two modules, you will see a Resources tab. In there, you’ll find a bunch of links to help you learn a little more about what we’re talking about today, as well as blogs from Fidus about the different initiatives they have with AMD. In the Documents sub-tab of the Resources, you will find a PDF copy of today’s slides. There are also a few extra links in there that you might want to check out.

And with that, I just want to do a couple of housekeeping things. This webinar will be made available for on-demand viewing, and you can access the recording later today. We’ll also be having a second session, which is just the same presentation offered at a different time zone. So, no need to come to both unless you’re really eager to get a question answered or talk to us a little bit more—it is going to be the same presentation.

All right, with that housekeeping out of the way, I will go ahead and throw the ball over to Mark from AMD to take the presentation away. Here we go.

2:47 – Introduction to the MUSIC Algorithm

Welcome to our webinar: Offload Multiple Signal Classification (MUSIC) to AMD Versal AI Engines. This is a joint project between AMD and Fidus, and we’re glad to be with you today.

Here’s an overview of our presentation. We will look at the MUSIC algorithm, provide an overview, and talk a little bit about the AMD–Fidus collaboration on this project. Then, we’ll introduce the AMD Versal AI Engine technology, and finally, have Fidus provide an overview and detailed description of the design that was built.

Now to the MUSIC algorithm—what is the MUSIC algorithm? MUSIC is an acronym that stands for Multiple Signal Classification. This is a very popular algorithm for frequency estimation and radar direction finding. In this webinar, we will focus on the direction-finding application that identifies and tracks the angle of arrival of closely spaced signals in radar, sonar, and other wireless systems.

MUSIC achieves its benefits using a subspace decomposition. This leverages advanced linear algebra techniques to identify the signal and noise subspaces inherent in every system. Once we have identified the signal and noise subspaces, we can then use them for estimating our direction of arrival. This is possible because the steering vector contains our unknown signal and will be orthogonal to the noise subspace. It is also a known function of our unknown direction parameters.

We can simply sweep the steering vector based on that relationship and find the direction parameter values where it becomes orthogonal to the noise subspace. MUSIC delivers improved resolution and accuracy over conventional methods for array processing applications. It is not limited in resolution due to wavelength, per the Rayleigh criterion. MUSIC is more accurate than Fourier-based methods of frequency estimation when signal frequencies are closely spaced. And finally, it’s more robust than other methods, such as Capon’s beamformer, when we have imprecise knowledge of the signal statistics.

MUSIC is adopted in many applications in radar processing and electronic warfare systems. It is used to estimate the direction of arrival of closely spaced signals surrounded by clutter. This is the dominant application and leverages antenna array technology with many radiating elements for beamforming.

MUSIC can also solve frequency and spectral estimation problems found in wireless, seismology, and radio astronomy. Separating signal and noise subspaces is useful for many purposes. In structural engineering, MUSIC may be applied to modal analysis to identify modal frequencies and shapes and assist with vibration analysis of various structures. Finally, MUSIC provides robust solutions in biomedical signal processing for blind source separation in audio processing, EEG, and MEG scans.

6:30 – MUSIC on AMD Versal AI Engines

The performance advantages of MUSIC come at a cost—its computational demands are significant. The complexity depends on these three system parameters:

M is the depth of the data snapshots (i.e., how many time-domain samples are used for each MUSIC computation).
N is the number of antenna elements in the system. This can range from a dozen to several hundred or more.
L is the number of discrete spectrum points or arrival angles we search during the sweep of our steering vector.

The textbook form of MUSIC performs these three compute workloads:

Compute the covariance matrix of the received data snapshot.
Compute an eigen-decomposition of that covariance matrix.
Sweep the MUSIC spectrum and perform a search to identify the angle of arrival of our targets.

The computational complexity of each workload is shown here. This can be quite a heavy burden. Various system requirements can drive these workloads from tens to hundreds of gigabytes per second. We may have hundreds of snapshot samples, dozens of array elements, hundreds of spectral bins to search, and all may need to be computed repeatedly in tens to hundreds of microseconds of real time.

Versal AI Engines offer significant advantages to address the computational workload challenge of MUSIC. They have a hardened, vectorized floating point datapath, offering eight giga-operations per second of compute per tile, and hundreds of tiles. The AI Engine offers massive bandwidth for moving data through compute workloads. The array also offers a large, distributed memory footprint that can be used efficiently for matrix storage. Finally, these compute workloads are delivered by VLIW SIMD vector processors. We have a software-based solution with no timing closure challenges.

Now let’s move to the specific MUSIC solution we have implemented on the Versal VC1902 device.

Here is a table of the system parameters of the solution. For simplicity, we use a uniform linear array with eight elements at half-wavelength spacing. Our data snapshots are 128 by 8. We assume a target throughput of one microsecond. This is the arrival rate of the data snapshots, and our MUSIC solution must keep up with this in real time.

9:20 – MUSIC Algorithm Implementation Approach

We use a QRD plus SVD subspace algorithm to identify the noise subspace—we’ll see more details on that shortly. Finally, the MUSIC spectrum sweep will use 256 bins, with limited use of the edge bins corresponding to broadside angles of arrival to our array.

A high-level view of the MUSIC algorithm adopted for AI Engines is shown here. We use a QRD plus SVD approach. We first compute the QR decomposition of the data snapshot matrix A and then perform an SVD on R to identify the noise subspace. This subspace is represented by the matrix VR.

Second, we compute the denominator of the MUSIC spectrum, shown here on the bottom right. This is a function of the steering vector for our uniform linear array. Finally, we simplify the workload by computing a null search of the MUSIC spectrum denominator, rather than performing a peak search on the actual spectrum. This avoids a costly division workload and gives the same result.

Now we’d like to say a few words about the collaboration between AMD and Fidus on this Versal MUSIC demo. This project was a close collaboration. Why does AMD collaborate in this way?

Well, we get early feedback on our AI Engine tools and design flows. We get to test drive and validate our new system methodology frameworks before rolling them out to customers. We also develop strong partnerships with service providers like Fidus, who help us support customers embarking on their Versal adoption journey.

This MUSIC project was technically challenging, and Fidus delivered on AI Engine innovation. They tackled heavy pipelining and loop unrolling of the QRD and SVD to achieve the target throughput. They vectorized and parallelized the MUSIC spectrum search across the array. They delivered a fully integrated hardware-in-the-loop design on our VCK190 evaluation board and a laptop running MATLAB on a PC.

The project has exceeded our expectations. We delivered a design code base as a Vitis GitHub tutorial last October. The demo was included at our recent AMD Signal Processing Working Group (SPWG) in Ireland last November. It was also shown at the Association of Old Crows International Symposium last December.

Fidus is AMD Adaptive Computing Partner of the Year, and this project demonstrates why.

AMD proposed this MUSIC project with aggressive goals: to implement MUSIC solely on AI Engines, with a one-microsecond target, operating in a closed-loop, Ethernet-based demo on our eval board. We provided a minimal set of MATLAB models and some early QRD/SVD prototypes that were not meeting the one-microsecond target.

12:42 – AMD and Fidus Collaboration

Fidus staffed the project with only three designers: a system algorithm expert, an AI engine expert, and an embedded FPGA expert. The Fidus approach was refreshing, to say the least. They dug in and quickly took ownership. Every week they made steady progress and solved technical challenges with innovative solutions. They provided transparent and honest weekly reports, and there were no surprises. Minimal steering was required by AMD. In the end, Fidus delivered a suite of high-quality outputs, including simulation models, design code, the full hardware-in-the-loop system, and a complete set of documentation. Our experience with this project was fantastic. Consider Fidus as your partner for your next Versal development project.

Next, we’ll hand it over to Udyan to overview the AI engine Versal technology.

13:43 – AMD Versal AI Engine Architecture Overview

Thank you, Mark. Now let’s briefly overview the architecture of AMD Versal AI engine technology. AI engine technology was first introduced in the Versal AI Core series of the seven-nanometer Versal adaptive SoC. This technology was created to keep up with the market needs for cost and power reduction with more compute and programmability. The AI engine is based on VLIW processors running at one gigahertz on the slowest speed grade. This is 100% software programmable, enabling a higher level of abstraction and allowing for faster functional simulation. Also, since AI engines are hardened, there is no timing closure required, supporting shorter development times.

One of the key architectural benefits is how we build up arrays of these cores—up to 472 in our largest devices and down to smaller devices with less than 10. These AI engines give tremendous scalability through the range of devices. There is high interface bandwidth to get data in and out of the AI engine array and from tile to tile within the array. This, along with a customizable memory hierarchy, enables very efficient application acceleration.

So, now we will look at the AI engine array at the top level and then drill down the hierarchy. The AI engine array integrates a two-dimensional array of AI engine tiles. Each AI engine tile includes the AI engine, local memory, and interconnect. Each AI engine tile is an ISA-based vector processor that is software programmable. Each tile integrates a very long instruction word or VLIW processor, integrated memory, and interconnects for streaming, configuration, and debug. Each AI engine also has a memory tile. This is a tile including memory local to the array.

Now let’s look at each AI engine tile more closely. The AI engine is a long instruction word VLIW processor with single instruction, multiple data, or SIMD vector units. This VLIW vector processor is hardened in seven nanometer at one gigahertz at the lowest speed grade, and increasing for faster speed grades. It is software programmable, so you can write C, C++ code, and a compiler will schedule and compile all instructions. The AI engine has a 32-bit scalar unit and a vector unit that supports both fixed-point and floating-point precision. The AI engine also has 16 kilobytes of program memory, an instruction fetch and decode unit, two load units, one store unit, and a control, debug, and trace unit. The control, debug, and trace unit is used for debug, trace, and profile functionality. The debugger connects to the platform management controller (PMC) on an AI engine-enabled Versal device using either a JTAG connection or a high-speed debug port (HSDP) connection.

Now let’s look at the interconnect resources with each AI engine tile. There is first a memory interface that allows each AI engine tile to use the memory embedded in the neighboring tiles. Of course, remember that each tile also has its own local memory. There is also a cascade interface that enables one to transfer partial results from one AI engine to another if more computation is required. There is an interconnect block that is a deterministic, 32-bit interconnect that allows communication with other AI engines and between AI engine and programmable logic. Finally, the local memory block also has data movers that allow non-neighbor data communication.

The AI engine array has large bandwidth between the array and the programmable logic. These are shown as red lines in the diagram on the right. When going to the PL, you see it goes through the switch and some clock domain crossing logic to the PL. This dashed arrow connection is directly to the programmable logic. It is important to understand that there is large bandwidth here for interfacing the AI engine array with the PL, with lots of paths for terabytes of bandwidth. The PS, or processing system, communicates with the AI engine through the NoC, short for network on chip. All registers and memories are memory mapped and can be accessed through the NoC from the PS or PMC, which stands for platform management controller. Lastly, accessing external memory uses the NoC via the DMA.

And finally, let’s look at the bandwidth numbers for a few select Devices. Let’s examine the bandwidth between the AI engine array and the programmable logic. This will be the main route between PL and AI engine for all data. The interface is configurable with 32 bit, 64 bit and 128 bit width operations. As shown in the red table, there are eight axis streams per column interface in the north direction, and six axis streams per column interface in the south direction, which are 64 bits wide. This gives a bandwidth of 32 gigabytes per second in the north direction and 24 gigabytes per second in the south direction. Note that not all columns are available. For example, of the 50 columns on the VC 1902 you have 39 columns available, giving a total bandwidth available of 1.3 terabytes per second north and one terabytes per second south, this interface runs at half the frequency of the AI engine. Now looking at the dark blue table, this shows the bandwidth within the AI engine array. When using the 32 bit width option, there are six North streams and four South streams available. However, these run at the AI engine array frequency, which is twice the speed of the Programmable Logic interface side.

Thus, the loss of bandwidth through the interface is less pronounced in this mode. The numbers presented here are for the lowest speed grade. The key item to remember from this slide is that there are high amounts of bandwidth available for data transfer between the programmable logic and the AI engine and within the AI engine array itself. And with that overview, I will now hand it over to Bachir from fidus.

20:33 – Introduction to Fidus Systems and Project Goal

Thanks, Udyan. Welcome, everyone to this webinar. I am Bachir Berkane, system and Android architect at Fidus Systems, together with Peifang, our senior embedded software designer, who will present the offloading of MUSIC operation into the AIE accelerator. Allow me first to briefly introduce Fidus Systems.

21:00- Company overview

Fidus is a full-service electronic systems design company with expertise covering FPGA and ASIC front-end design and verification, embedded software, hardware design, signal and power integrity, IC packaging, and thermal and mechanical design. Fidus is the trusted engineering partner for companies innovating in areas such as data centers, aerospace, communication, ML, insurance, and beyond. Whether supplying expert engineers or managing end-to-end full project, we help our customers bring complex designs to life efficiently and reliably. Over 20-plus years, Fidus has a proven track record of over 4000 successful projects, with 95% of our customers returning. Note that approximately 80% of these projects are FPGA-related.

Allow me to mention that we are proud to be recognized last year as AMD Adaptive Computing Partner of the Year, which is a testament to our leadership in high-performance FPGA design. Our partnership with AMD began over a decade ago, when we became the first ever Premier Design Services Member in North America with the largest team of AMD-certified engineers outside of AMD. Fidus provides unmatched expertise in areas such as AMD Adaptive SoCs, high-performance embedded platforms, and DSP and acceleration. Using proven methodologies and deep FPGA engineering experience, we help customers reduce risk, accelerate development, and maximize the performance of AMD’s cutting-edge technology.

21:50 – System and Algorithm Modeling

Back to the topic of our webinar. The path to map the MUSIC operation into the AIE accelerator and develop a real-time demonstration starts with system and task definition. These two tables summarize the main items of both the system under consideration and the MUSIC algorithm. In summary, 128 single snapshot matrix from a system of eight elements, linear antenna, is fed to discrete base MUSIC algorithm. Note also that the target platform to run the MUSIC operation is the VC1902 Adaptive SoC with integrated AIE engine accelerator.

Given the system and algorithm spec, the first step consists to build a system model, which includes both the data and the MUSIC models. Note that the models provide the golden data—that is, both the input synthetic data and the golden intermediate/output data produced by the MUSIC operations.

Next, we define the target architecture and how the operations of MUSIC are mapped to the compute resources of the AIE engine tiles to meet the target performance set to one microsecond per single snapshot. The mapping is based on initial profiling numbers, subject to further defining at the next stage.

Then the AIE software kernels are developed according to the previously defined mapping. These kernels are simulated at different levels to demonstrate correct operation and performance as shown. The refinement of the mapping takes place at this stage.

And finally, at the bottom of the pyramid, to demonstrate and test MUSIC on the AIE in real time, we architected and developed hardware-in-the-loop test bench. Let’s go through the system model briefly. The model developed in MATLAB consists of both the data model and a model for the MUSIC algorithm. The data model generates snapshot matrices that represent the response of the eight ULA antenna elements in the presence of a number of emitting or echoing sources. The antenna response also includes the modeling of thermal noise at each antenna element. Note that the snapshot matrix values are assumed to be complex, single precision floats.

26:40 – Mapping MUSIC to AI Engine Tiles

The MUSIC model, on the other hand, consists of three main components. The first one processes the snapshots with QR decomposition. The QRD decomposes the snapshot matrix into an orthogonal matrix Q and an upper triangular R matrix. The Q matrix is dropped, and only R is passed to the downstream stages of MUSIC.

The next noise subspace is identified through singular value decomposition of the R matrix, where the latter is decomposed into three components that include both singular values and singular vectors that span both the signal and the noise subspace. The noise subspace corresponds to the P smallest singular values sorted—in our case, by construction—where P is either a known parameter, which is the number of antenna minus the known number of sources, or it is used from an ad hoc thresholding defined through system simulation.

The last block is the space spectrum calculation and DOA identification. That consists to divide the field of view of the ULA into a grid of a number of points, which is equal to 256 points in our case, and compute the space spectrum metric for each point in the grid. DOA corresponds to the null bins in the space spectrum. As an example, the four peaks and corresponding nulls in the space spectrum shown on the plot indicate the location of the azimuth angles of the four sources on the 256-point field of view grid.

Note, as mentioned by Mark, that for performance consideration, we compute the null space rather than peak space spectrum. The mapping of MUSIC operation into the AIE processing tile is performed through loop unrolling and pipelining. The building blocks of MUSIC are iterative processes, where QRD and SVD consist of nested loops and the space spectrum and DOA identification is just a simple loop that goes through the 256-point grid to compute the spectrum and identify the potential DOAs when a condition is met.

To meet the processing requirement, the loops of the different blocks of the workflow are unfolded and pipelined. In the steady state with a pipeline of depth N, all stages operate in parallel processing the snapshot data. Movement from one pipeline stage to the other is through sharing the data in the local memory of the tiles.

Note that some loops are data-independent and could be parallelized. However, we opted to pipeline and unroll in all cases to simplify the implementation.

In the next two slides, we are going to illustrate loop unfolding/pipelining of both the Gram-Schmidt QRD and the spectrum calculation.

29:14 – QRD and Spectrum Computation on AIE

The Gram-Schmidt QRD consists of an outer loop and a decreasing number of inner loops with each iteration of the outer loop as shown. At each iteration k for the outer loop, where k is ranging from zero to seven, a norm process computes the diagonal element of R at row k and column k of the orthogonal matrix Q.

The inner loops of QRD, with eight minus k minus one iterations, execute a QR process. The QR process computes the remaining eight minus one minus k elements of row k of R and updates the eight minus k minus one columns of Q.

Mapping the QRD operations to the AIE compute resources is done through loop unrolling, as illustrated. The outer loop, iteration zero, is unfolded into a green norm process that computes the first column of Q and the first diagonal element of R, and into seven blue QR processes that compute the remaining seven elements of row zero of R and update the next seven columns of Q.

Note here that each process performs a task and passes the data into the next upstream process. Similarly, outer loop one is unfolded into one green process and six blue QR processes—one less than the previous iteration—and so on. And finally, the last iteration of the outer loop consists only of one norm green process that computes the last element of the R diagonal.

Note that columns of Q that are not required by the upper process can be dropped as shown. Green or blue process operations are mapped to a kernel that runs on a specific AIE tile. Each tile communicates data to the upstream tile through the local data memory.

The total required tiles to perform the QRD process is related to the number of antenna elements, n, as n times n plus one divided by two—which adds up to 66 when n is equal to eight.

The next slide illustrates how the operation of the spectrum calculation and DOA identification are mapped through the AIE tiles. First, to compute the spectrum, the field of view of the ULA is divided into a grid of 256 points. For each point on the grid, a metric referred to as the null space spectrum is computed. This metric evaluates the orthogonality of the noise subspace and a predefined signal steered in an azimuth angle that coincides with the grid point. The grid points that correspond to a null are identified as the azimuth angles of a source signal emitting or reflecting the signal from that direction.

To meet the target performance, the space spectrum set process is mapped to a 1D pipeline of M stages, with each stage computing the spectrum of 256 divided by M grid points. Through profiling, we determined that AIE vector compute resources can calculate up to four grid points—the spectrum of four grid points—within target performance of one microsecond.

The null search subprocess, on the other hand, is mapped to—and without going into details—an N-stage 1D pipeline that localizes minus-two-plus gradient changes on the 256-point spectrum using a two-step scheme. Each stage of the 1D pipeline performing the operations of the spectrum calculation and DOA identification is mapped to a single AIE tile. As shown, 64 tiles perform the spectrum calculations and 18 tiles perform the DOA search.

With this, I’ll hand the presentation over to Peifang, who will walk us through the AIE design principle and flow. Thank you.

34:01 – AI Engine Design Principles

Thank you, Bachir. In the next two slides, we’re going to highlight some of the key design principles for AI engine–based designs. The emphasis is on efficient data movement and parallelism.

While the AI engine is software programmable in C and C++, its approach is fundamentally different from conventional C/C++ software development. First and foremost, we need to focus on data movement and parallelism.

In typical software development, where a layered structure is often deployed, the application software at the upper level tends to be portable—meaning it can run on a wide variety of processors without any tight integration with the underlying hardware. This provides flexibility and portability.

On the other hand, the C/C++ code running on the AI engine is designed to be tightly coupled with hardware—meaning it is highly optimized for specific operations and can achieve much greater performance.

36:00 – Parallelism at Engine, Data, and Instruction Levels

Traditional software development is often associated with the control plane, which is responsible for managing resources and orchestrating execution of tasks related to operations, administration, and maintenance. It focuses on the correct implementation of operational logic and the control protocols to achieve desired functionalities.

The data plane, however, focuses on managing the data flow, performing computations on the actual data, and running any pre- or post-processing if needed. It is where the raw data is moved and processed in real time, often in a highly parallel and distributed manner across multiple hardware units.

In our case, the hardware unit is the AI engine with a 2D array of up to 400 compute tiles. Traditional software tends to be linear and sequential and is run by a CPU. In contrast, high-performance systems—like those using the AI engine—require a fundamentally different approach.

It’s all about leveraging massive parallelism and advanced data movement strategies. In AI engine–based systems, tasks are executed simultaneously, making the system capable of handling a large number of operations in parallel. This is particularly important for real-time applications like DSP, where rapid data processing is critical.

The metaphor of keeping the AI engine rolling with 400 cylinders firing refers to ensuring all available tiles are fully utilized. By moving and processing data simultaneously across all available tiles, we ensure that the system is highly efficient, with idle time minimized and throughput maximized. This is exactly how we achieved a breakthrough performance of a one-microsecond throughput rate.

We continue in this slide on the parallelism at three levels of the AI engine: from the top engine level to the middle data level, and all the way down to the instruction level.

At the engine level, we have a 2D array of up to 400 tiles. In the MUSIC implementation, we take advantage of these 400 tiles and deploy a deep pipeline architecture to maximize throughput. The deep pipeline design allows continuous data processing, which is critical for achieving high performance in tasks like AI and DSP.

For the middle data-level optimization, we use the SIMD (Single Instruction, Multiple Data) vectorization. The SIMD approach allows the system to process multiple data elements in parallel with a single instruction—leveraging the parallel capabilities of the hardware to perform computations faster.

In our MUSIC design, we use vectorized complex float data types and APIs for 8-lane SIMD computations. This means that instead of processing one value at a time, the system can process multiple complex numbers simultaneously, speeding up computations like matrix multiplication, which are very common in AI and DSP.

39:00 – Development Workflow and Simulation Strategy

At the bottom instruction level, we have 7-way VLIW (Very Long Instruction Word). The AI engine allows up to seven instructions to be issued in one cycle, enabling extreme parallelism at the instruction level. These seven instructions include:

One scalar operation
Up to two move operations (for moving data between registers or memory)
Two load operations (for loading data from memory to tile)
One vector assignment SIMD operation (for executing SIMD computation)
One store operation (for storing processed data back to memory)

In our MUSIC design, we use compiler directives to guide the AI engine compiler to generate optimized 7-way VLIW assembly code to maximize the utilization of hardware parallel processing features.

The next key design principle is divide and conquer. We divide the MUSIC processing chain into distinct individual blocks—which, as mentioned by Bashir, are QRD for QR decomposition, SVD for singular value decomposition, spectrum calculation, and finally, the DOA (Direction of Arrival) identification.

Then we use loop unrolling and pipelining within each block to optimize the performance. Once each individual block is designed and tested to make sure it meets performance targets, we then stitch together all the individual blocks for the final design.

And then, Fidus has published a blog to outline two best design practices in AI engine designs. The first one is the C/C++ techniques for boosting runtime performance with compile-time optimizations. The second one is the logical design setup for maximizing efficiency and maintainability. These two best practices were drawn from our MUSIC experience, but are written in a generic form without any specific reference to MUSIC.

Now we give a high-level overview of how the MUSIC development flowed during our MUSIC project execution. We first set up the processing chain in terms of breaking down the entire processing chain into individual blocks. This modular approach allows for easier debugging, faster simulation, and parallel development for different teams working on separate blocks.

After the processing chain is designed, we delve into individual blocks, which consist of kernel development and testing.

For the kernel development, we first determine the input and output of the block for integration with the other blocks. Then we design efficient data movement and memory access patterns within the block. Then we start writing the custom kernel functions to exploit SIMD and VLIW parallelism and address bottlenecks during the simulation and testing.

For the simulation and testing, we need to set up test benches. This will define test cases that exercise all possible edge cases and performance-critical paths of the design. Then we first run the x86 simulations to make sure that each individual block is functionally correct and that they do not expose any performance bottlenecks.

Next, we run the AIE simulations to address any potential bottlenecks in performance. These simulations are crucial to check the performance levels. During the AIE simulations, Vitis Analyzer is an indispensable tool to check the interactions between the kernels and the tiles.

43:33 – Final Integration and Hardware-in-the-Loop Testing

After all individual blocks are tested to meet individual performance targets, we can proceed to stitch together all these blocks for the final design. We call it the top-level design.

At the top level, we integrate all the blocks for the final design, and then as usual, we run x86 simulations to check the system-level integration issues such as data movement, data movement bottlenecks, memory access problems, etc., verifying the design is working at the system level.

Once the x86 functional correctness is verified, we run the AIE simulations to catch performance issues at the system level to ensure that the overall system performance targets are met before deploying the final design to hardware. In our case, it is the VCK190 evaluation kit.

And then we run the hardware-in-the-loop testing. This step deploys the design to the actual hardware, and it validates the system functionality and the performance running real-world tests.

Here is a diagram of the directory structure of the MUSIC reference design on the AMD GitHub website. So we have six individual functions—

50:00 – Demonstration Example and Performance Validation

Convert the control back to Bachir.

Thank you, Peifang. As I mentioned in the introduction, and also mentioned by Peifang, to demonstrate the AIE running MUSIC in real time, we developed a hardware-in-the-loop test bench that enables real-time testing of the MUSIC algorithm on the AIE accelerator.

The end system hardware-in-the-loop system consists of a host machine connected to the VCK190 board through a TCP/IP link. The MATLAB models running on the host machine generate both synthetic snapshots and reference data for MUSIC. As an example, in the following visualization, a batch of 16 successive snapshots is generated that captures regularly and tracks two moving sources in the field of view of the ULA.

The VC1902 SoC, connected to DDR4 on the VCK1904, has a bandwidth of 25 gigabytes per second. That is sufficient for real-time testing. The main components of the VC1902 SoC include the PS subsystem, the AIE accelerator, the fabric, and the DDR controller. All components connect to the NoC interconnect system.

A HIL app running on the PS extracts the snapshot data received through the TCP/IP link from the host computer and stores them in the DDR4. Then, data movers in the PS push the snapshots in the order received from the host to the AIE engine using the PLIO interface. To meet the bandwidth requirement, the AIE engine running the MUSIC app returns both estimated DOA and spectrum data after an initial latency corresponding to the depth of the pipeline. As shown, MUSIC data is stored in the DDR4 through the data movers.

Once the AIE completes the MUSIC operations on the entire snapshot batch, the HIL app reads the MUSIC results from the DDR4 and sends them through the TCP/IP link to the host computer for both visualization and correctness checks. As an example, in the following visualization, the MUSIC result of a 16-batch snapshot is shown, where the triangles are the reference position of the moving sources, while the colored lines depict the direction of sources as detected by MUSIC running on the AIE engine.

Furthermore, note on the bottom left side of the output diagram the accelerator performance numbers that are estimated using the AIE profiling counters. These performance numbers are received with the MUSIC results. Finally, note also that the data is exchanged between the host and the board via an ad hoc protocol over TCP/IP.

With this, I will hand back the presentation over to Peifang, who will conclude the presentation. Thank you

53:33 – Key Takeaways and Conclusion

Thank you, Bachir. I would like to highlight the key takeaways from today’s webinar.

The Versal family represents AMD’s latest FPGA offerings, with its AI engine recognized as a groundbreaking innovation in the industry. As demonstrated in this session, the AI engine is a disruptive force capable of accelerating complex, compute-intensive algorithms like MUSIC—significantly boosting performance.

The call to action is clear: unlock the full potential of the AI engine to enhance and accelerate your unique algorithm. Fidus brings extensive expertise in supporting Versal designs, ensuring optimal utilization of both the AI engine and the traditional PL/PS within FPGA systems. Visit the Fidus website and click the link below to connect with our team of Versal experts.

Thank you very much. Applause.

55:08 – Q&A Session

All right, sweet. All right. Thank you, everybody. Thank you for joining us and sticking around here for the Q&A. So we do have a few minutes for some questions. Real quick before we get there, just want to give a couple of reminders.

So you’ll be able to come back and watch this on demand anytime using the same link that brought you here. So we’ll have the recording available later today. And again, Session Two is just a replay of this same presentation. We wanted to give some people a couple of options for getting access for the Q&A. So if you want to come back, you can—you can watch the presentation again, but it’ll be the same content that you just heard.

So I see one question off the top asking about the source code, and I saw somebody already responded to that with the link to the GitHub. But if anybody else is interested, you can find that link in the Q&A, or you can also go into the Resources. We have the GitHub tutorial for this webinar linked there.

Okay, so let me go ahead and jump into this first question. Let me make sure…

Okay, so the first question is—maybe Mark can take this one—what motivated using the one-sided Jacobi for the SVD part of the algorithm and not divide-and-conquer approaches?

56:50

Yeah, thanks. Thanks, Jeffrey. This is a really good question and kind of targets the aspect of system partitioning of algorithms onto the AI engine.

One of the main reasons we chose this one was we actually had used it in our library block design, so we were familiar with it. But its choice is a good one for these size data snapshots. We have 128 by 8. We can fit the full matrix in the local tile memory as we build it. But if you were to apply MUSIC to a much larger array with, you know, hundreds or thousands of points, then I think you’d run into this situation where we wouldn’t have enough local tile memory to handle the full matrix at the same time. And then that’s going to drive you toward different partitions.

And so divide-and-conquer approach is a great approach, and we haven’t investigated that one in detail too much yet, but we do plan to extend our library to handle larger matrices and stuff, and I think this will be, you know, right at the top of the list when we go there. So this is a great question. I hope that helps to answer it.

58:03

I think so. Let’s see, so we got another question—maybe Bashir, you can touch on this one. Are we considering switching from complex float to complex 16-bit integers to improve performance?

58:19

Very good question. First, to answer—just to reverse back to the throughput—it’s right that if we handle the complex integer 16 versus complex single float, it would be a performance improvement of ×4. However, always working with integers or fixed point is a headache and really you have to be careful. And the work has to start at the system level. You have to make sure that the accuracy and the robustness of the solution is acceptable. This is on one hand.

And on the other hand also, once you start implementing fixed point, you have to make sure everything is correct with respect to the golden model, and you have to handle all the scaling carefully. And probably I will not—my remark—actually, we recommend not going that route.

Remember that all the operations are multiplications and then additions. And each time you multiply in 16, you need an additional 16 bits to represent multiplication. And with each addition, you need approximately one bit, which is a log, actually something which is approximately one bit. So you have to handle all those scalings carefully. Again, I would not recommend that. But if system work was done correctly, and that’s the way, yes, it’s possible and you will have a throughput. Thank you.

1:00:13

Sweet. All right, let me see here. So we are at time, but we do have a few more questions if we’re ready to stick around here a little bit. There was a question: do we have a performance table with the different sizes?

1:01:11

No, we don’t have that. We don’t have that kind of data. We built the design based on the fixed system specs that we presented in the webinar. But you can download the code and you can use the profiling features of the tools to, you know, get an idea of the clock cycles that each of the stages are taking. And then you could—you could build a simple spreadsheet, you know, by scaling out the sizes of the column vectors and such to sort of estimate, you know, what you might be able to achieve with a larger or smaller design.

1:02:18

May I add something? This is with respect to the scaling. As you know, the algorithm, as we presented it, it’s three parts: QRD, SVD, and spectrum.

The QRD part depends directly on the number of samples and also the number of antenna elements, and it scales as N squared with the number of antenna elements—actually N(N+1)/2.

And the SVD is also similar. However, since we are doing the one-sided Jacobi, that has iteration numbers on top of it. So it’s N squared times the number of iterations.

And lastly, the spectrum calculation—it’s linear with respect to the number of points, or the accuracy. So for instance, if you go from the 256, which is 0.7 degrees, to half the resolution—sorry—to 0.35, you will have to multiply also the number of times that you need.

So all I’m saying here—scaling, it will require, actually, more tiles. It may not fit the design, which is on one hand.

1:02:31

Okay, all right.

1:02:49

Well, we have a few more questions that we didn’t get to. Are there any other questions you want to touch on before we close out here? We can try to follow up offline for the ones we can’t get to now.

1:03:08

A quick one—just somebody mentioned that the latency was one millisecond. We never said that. Actually, the performance is one microsecond, and the latency of the system is in the tens of microseconds.

1:03:40

Got it. All right, yeah—Lucas’s question here, he asks if there’s more throughput on the table here. In a sense, there is. We only ended up using about 150 tiles in the array on this, and so you could probably double the performance if you, you know, pipelined out the kernels across the full array of resources like the Fidus guys showed us here.

1:04:00

There’s one last question actually that—“Is it possible to drop a DNN accelerator and train the models?” Training—I don’t know the answer. The answer—I don’t know for training, but for inference, it’s the same operations, actually.

1:04:18

There’s a tutorial on the GitHub that shows inference in the context of this MNIST handwritten digit database. So you can get an idea of how to map convolutional neural networks onto the Versal AI Edge devices from that tutorial. You’ll find it on the GitHub.

1:04:24

There, great, cool.

1:04:31

And that is on the same GitHub link that we said?

1:04:40

Yeah, it’s the same area. Just in the AI-ML version, there’s a folder for AI-ML designs, and then there’s a different folder for AIE designs.

1:04:49

All right, great. Well, thanks everybody for sticking around here to the end of the Q&A. And if you could fill that survey out before you leave here today, that would be super helpful.

But other than that, I hope you have a great rest of your day, and we’ll see you in the next one. Thank you very much. And I want to give a big thank you to the Fidus team for their help on this one and on the collaboration in general.

Thank you very much. Have a great rest of your day. I know you…

Featured Speakers:

Speaker Biographies

Bachir Berkane
System and Algorithm Architect, Fidus Systems
Bachir Berkane is a senior system and algorithm architect with over 32 years of experience in ASIC design, digital signal processing, and system modeling. He has led complex projects across high-performance computing, radar, and telecommunications, with deep expertise in hardware acceleration, AI-driven edge inference, and protocol-based system architectures. Throughout his career at Fidus Systems, IDT, and PMC-Sierra, Bachir has developed advanced algorithms, optimized DSP implementations, and contributed to cutting-edge FPGA and SoC designs. He holds a Ph.D. in Electronic Systems and is a published author with multiple patents. Bachir has also taught university-level courses in electrical engineering, solidifying his reputation as a key innovator in embedded systems and signal processing.

Peifang Zhou
Senior Embedded Software Designer, Fidus Systems
Peifang Zhou is an embedded software expert specializing in AMD SoC FPGAs, C/C++ development, real-time operating systems (RTOS), and MCU integration. His career spans advanced embedded software design and systems optimization, where he consistently delivers scalable, efficient solutions. Peifang holds B.Sc., M.Sc., M.A.Sc., and Ph.D. degrees in Electrical and Computer Engineering. He is widely recognized for his ability to bridge low-level embedded design with system-level integration, and is a frequent speaker and contributor in the field of embedded systems engineering.

Mark Rollins, Ph.D.
Sr. Manager, Technical Marketing, AMD
Dr. Mark Rollins leads technical marketing efforts at AMD, providing strategic direction and technical leadership in the development of DSP-focused hardware and software solutions. With a Ph.D. and decades of experience in embedded systems, signal processing, and semiconductor product development, Mark is instrumental in defining and promoting AMD’s adaptive computing platforms. He collaborates closely with partners and engineering teams to deliver real-time, high-throughput solutions using AMD's Versal architecture

Additional Resources

July 2024 Fidus receives AMD Partner of the Year award

Expand your knowledge with these additional resources from our website:

Technical Whitepaper: A High-Throughput Implementation of MUSIC using the AMD Versal AIE Engine
This whitepaper provides a deep technical dive into the implementation strategy, architecture decisions, and performance benchmarking for executing the MUSIC algorithm on AMD Versal AI Engines.
Blog Post: Best Practices for Optimizing AMD Versal AI Engine Designs
Explore practical C/C++ optimization techniques and architectural strategies to enhance AI Engine design efficiency and maintainability within AMD Versal Adaptive SoCs.
Blog Post: Mastering Scalable Radar Signal Processing with AMD Versal Devices
A behind-the-scenes look at the Fidus-AMD collaboration that enabled AI-powered radar signal processing using Space-Time Adaptive Processing (STAP) on the VCK190 evaluation board.
Build for AMD Versal® Adaptive SoCs and FPGAs
Access the official AMD landing page for developer tools, documentation, and design resources tailored for Versal devices, including AI Engine programming and system integration support.

Partners