Strategies for optimizing FPGA designs with AMD Versal Adaptive SoCs

WEBINAR RECORDING ON-DEMAND

Watch Video

Key takeaways

Deep dive into AMD Versal: Explore the architecture and capabilities of the VC1902 adaptive SoC and its application in FPGA designs.
Hands-on demonstration: Gain practical insights through a walkthrough of a recent AMD Vivado project setup, showcasing integration and configuration techniques.
Expert insights: Learn effective FPGA design strategies and the advantages of using the Network on Chip (NOC) for design partitioning and data movement from Normand Leclerc.
Understanding the AMD Versal Core Series: Discover the features and potential of the AMD Versal AI core series, crucial for advancing in FPGA designs.
Integrating Key System Components: Learn how to connect essential components like CIPS, DDR, and AI engines using the NOC, enhancing system performance and efficiency.
Optimizing Data Transfer: Develop skills to optimize data movement across the FPGA chip, employing the NOC for efficient design partitioning, which is key to managing complex designs.
Visualizing and Configuring NOC Structures: Gain insights into the NOC’s configuration and learn techniques to visualize its structure, enabling better understanding and use in FPGA projects.

Transcript

Timestamps:

[1:52] About Fidus
[4:10] Presentation: Strategies for Optimizing FPGA Designs
[4:34] High-Speed Data Moving Application Example
[4:54] Understanding Versal SoCs
[7:52] Using the AMD VCK190 Evaluation Kit
[8:27] Project Configuration
[9:14] Detailed NoC Configuration
[12:02] Connectivity and QoS Configuration
[15:14] Examining the NoC from the API Perspective
[16:01] Implementing the NoC
[19:09] Dynamic Function Exchange (DFX)
[23:52] Conclusion
[24:32] Q&A Session

****

[0:01] Introduction

Welcome, everyone, to today’s webinar, “Strategies for Optimizing FPGA Designs on AMD Versal, Adaptive SoCs, and FPGAs.” We’re thrilled to have you join us as we delve into various techniques and strategies for harnessing the power of the Versal product family in FPGA design. Whether you’re a seasoned engineer or just starting out in the field, today’s session is designed to equip you with the knowledge and insights needed to enhance your Versal-based projects’ efficiency and effectiveness based on our firsthand experience with the product.

[0:37] Housekeeping Items

Before we begin, let’s go over some quick housekeeping items to ensure a smooth experience throughout today’s webinar. We will be hosting a Q&A session at the end, so you can submit your questions at any time during the webinar using the Q&A button on your control panel. Feel free to type them in as they arise during the session, and we’ll address them later on. Should you encounter any technical issues or need assistance, please send a message to the chat, and one of our administrative team members will be here to help you.

[1:09] Our Expertise

We’re here today to talk about our experience with AMD Versal. But why do we feel we have the authority to speak on this topic? Well, in our first decade, Fidus did a ton of varied AMD Xilinx-related work. In 2011, Xilinx asked Fidus to become their first-ever premier design services member in North America. To this date, Fidus and AMD collaborate closely on opportunities and training together on the latest technologies, with the end goal of helping our clients get their solutions to market as quickly and smoothly as possible. We also happen to boast the largest team of AMD-certified engineers and professionals in North America.

[1:52] About Fidus

So, who is Fidus? Founded in 2001, we have grown into a 150-plus person, all North American electronic system design and development services company. We serve all industries and markets and have three brick-and-mortar locations: two in Canada and one in Silicon Valley, USA. We’ve been working with FPGA sockets since 2002, and it would be safe to say that more than 80% of our projects have included FPGA content. We do an incredible amount of work with AMD products and also have a go-to partnership with Altera (formerly Intel) and many years of experience with other manufacturers like Lattice, NXP, and Xilinx, to name a few. Our proudest achievement is our repeat customer rate—close to 95% of our customers return year after year with new projects and new business, a testament to the quality and efficiency of our work in improving our customers’ time to market.

[2:55] Our Services

As a full-service electronic systems design firm, our professionals and engineers cover a multitude of service disciplines, including FPGA design, embedded software, hardware, signal and power integrity, ASIC RTL design, and verification. We help our customers by supplying these expert skills directly or by managing complete projects from start to finish.

[3:16] Featured Speaker

Now, I’d like to welcome Normand Leclerc, a distinguished senior FPGA designer and team leader with over two decades of experience in FPGA design, simulation, and DSP architecture. He has extensive experience integrating complex electronic designs and optimizing DSP algorithms, with a deep focus on video and image transport and high-speed network infrastructures. Normand has extensive experience with AMD platforms such as Vivado and Versal and is a certified member of the AMD Alliance program. Normand’s experience has been crucial in pushing the boundaries of FPGA capabilities and implementing advanced network protocols on FPGA systems here at Fidus. We’re grateful to have him here talking about Versal today.

[4:10] Presentation: Strategies for Optimizing FPGA Designs

Thank you for attending. We hope you find this webinar informative and engaging. Please make sure you post those questions. On that note, over to you, Normand.

[4:16] Normand Leclerc:

Thank you, Alecia, for the introduction. Here are the strategies for optimizing FPGA designs on AMD Versal and Adaptive SoCs and FPGAs.

[4:34] High-Speed Data Moving Application Example

In this webinar, we will go through an example of a high-speed data moving application. We will cover its basic structure and configurations, look at the implementation, and discuss the expected performance of the system. Finally, we will glimpse how we can change the system to work under dynamic function exchange.

[4:54] Understanding Versal SoCs

Versal SoCs are a mix of scalar, adaptable, and intelligent engines designed to target a wide range of applications. They feature high-speed interfaces such as PCIe, multi-Gigabit Ethernet, and memory controllers, along with a good number of I/Os. These structures are distributed around the die to leave the adaptable engine, programmable logic (PL), in the center. This layout can make the die large, with scalar engines and memory controllers far from the I/Os and AI engines far from the memory.

To make things worse, some Versal devices add a few interconnected dies called Super Logic Regions or SLRs. How can we efficiently move data from one component to another when they can be placed on two opposite sides of the SoC? To efficiently move data, Versal includes a structure within the PL, which is a high-speed network on chip (NoC). The NoC can transfer data at a frequency of up to one gigahertz on a 128-bit data bus. The NoC has integrated high-efficiency DDR4 and LPDDR4 memory controllers, and some Versal models also have HBM memory controllers. The NoC is hard-wired and therefore always runs at optimum speed, requiring minimal effort from the replacement and routing engine. The NoC supports bandwidth allocation and three levels of quality of service. It can generate parity, transfer the poison bit, and do address translation for easy connections. The NoC interfaces use memory-mapped AXI and AXI stream interfaces to connect to the attached peripherals.

[7:52] Using the AMD VCK190 Evaluation Kit

In this webinar, we will be using the AMD VCK190, which is an AI evaluation kit that should be on everyone’s birthday gift Wishlist. This kit has multiple interfaces such as PCIe Ethernet in both RJ45 and SFP format and HDMI ports for both input and output. It has soldered on eight gigabytes of LPDDR4 memory and another eight gigabytes of DDR4 dual inline memory. For expansion, the card has an FMC expansion port. The kit targets data centers, radio, automotive, and radar applications. The project we are presenting is an HDMI 2.1 video frame grabber. We wanted to capture 8K video resolution at 60 frames per second, which translates to 48 gigabits per second of video data. We will imagine this system as part of a larger design and thus artificially contain it into a portion of the FPGA using a P block. In the VCK190, we see that the HDMI is connected to the GT transceivers on bank 106. So, we will transfer data from this corner down to the LPDDR controller and then use this one and its memory. We will use the NoC to transfer data from the HDMI transceivers here down to the LPDDR here.

[8:27] Project Configuration

The project we are using is an IPI design. I decided to do it this way because it’s easier to show. Although you could configure the sips and the NoC through the API, you could very well instantiate your block diagram into RTL. It’s just that easy to use the NoC in RTL code. So, we have the sips, the NoC, the AI engine, and the HDMI transceivers. I added a video frame buffer writer to transfer the video to memory. We have supporting blocks, clock generation, and reset generation, and we have an AXI SmartConnect.

[9:14] Detailed NoC Configuration

Zooming in on the sips and NoC, we see that they share 7 AXI interfaces. We have four cache coherent interfaces, one full power domain interface, one low power domain, and one for the PMC. We also have the video feed coming into the NoC for the DDR memory. On the master side, we have one master for the engine control and the other master for the HDMI frame grabber control. Let’s have a quick look at the NoC configuration. For a more detailed description, everything will guide you to PG313, the product guide for the NoC.

Configuration opens on the board tab, which has been populated with the board selection, the VCK190. Moving to the General tab, this is where we select the number of AXI interfaces, slave, master, and the number of clocks. We also have the interlock interfaces, which we don’t have in our design, so zero. In memory controller configuration, we have a single memory controller for memory controller ports, and we selected the lower base address at zero.

[10:53] Input and Output Configuration

In the Input tab, we associate each slave AXI to the peripheral it is attached to. The first one was cache coherent interfaces, so we selected that. The next one was the full power domain interface, which is non-coherent, so we selected that. Then it was the LPD and LPMC, which have their own categories, so we selected those as well. The seventh or eighth input was the video feed connected to the PL fabric. Moving to the Outputs tab, we had the first master connected to the AI engines, which has its own category, so we selected that. The second master was to the PL fabric, which was the HDMI frame grabber control.

[12:02] Connectivity and QoS Configuration

In the Connectivity tab, we have our connectivity matrix. This is where we tell the tool which slave AXI has access to which master. Here, our interfaces will be able to write to memory, so we selected a memory controller port for each one of them. We wanted the full power domain to control the HDMI frame grabber, so we selected that as well. Our video feed, which comes from the HDMI frame grabber, will only be able to talk to the memory, so it will write to the memory controller.

Regarding the Quality of Service (QoS), all our slaves will be best effort except the video feed, which requires a certain amount of bandwidth—48 gigabits per second or six megabytes per second. So, we entered six megabytes per second here and gave it the priority of isochronous, which guarantees a maximum latency. Address remap is for address translation. In our case, we don’t have any DDR basics.

[13:39] Final Configuration and Placement

This is the basic configuration for the clock and DDR controller type. Also, we’re using LPDDR4. Most of the options here are grayed out because we selected the board, the VCK190, and told it to use a specific DDR controller. So, there’s nothing much more to say here. Now let’s move to DDR memory. This is where we give all the timings for the DDR chips and also if we’re using ECC memory. Now to DDR address mapping. Here we have three predefined address mappings, or we can have our own.

In the last tab, Yard fenced, we have more SEC options, refresh options, power-saving options, and migration options. All the DDR configurations were left by default. These were configured by the board settings you selected when we created that file. So, we can close this and just regenerate the layout.

[15:14] Examining the NoC from the API Perspective

Let’s have a look at the NoC from the API perspective. Here we have an unplaced NoC. This is what it’s going to look like when it’s placed around it. But we have a number of NMUs and NSUs that are going to be used and in use. The NMUs are the NoC master units or the slave interfaces for the NoC, and the NSUs are the NoC slave units or the master AXI interfaces.

From here, we can select each site and manually place the NoC with lock constraints. We can’t move them around, but we can have the properties and manually place them here. I’ve never had any issue with the tool placing them automatically, but if you want to place them manually, this is where it will be done. So, what I did now is I highlighted all the nodes of the NoC to see where it goes on the die.

[16:01] Implementing the NoC

See all the structures of the NoC on the die. It’s all messy. Let’s clean it up and just get the tiles. Looking at just the tiles, we get a few dots here and there. These are all our structures. We also have the vertical line, so this is how it looks. All the NoC looks under the die. Let’s look at our implementation. We have the HDMI frame grabber here. We have the sips and the DDR controller right here. I constrained the HDMI frame grabber to one part of the chip, doing it on purpose so it doesn’t spread around. This ensures it is as far as possible from the DDR memory, showing that the NoC can move data efficiently with dynamic allocation.

[17:49] Quality of Service

Now let’s look at the NoC placed and routed. If you remember, when it wasn’t placed, it looked like this. Now, it is placed with our NMUs and NSUs and the DDR controller in the right place. We have the HDMI frame grabber here and extended here because we have the master interface here. This is what it would look like. Now let’s look at the quality of service that we requested. In the NoC QoS, we see that we required six gigabytes per second, and the estimate is also six gigabytes per second. This is fine, and we see the estimated cycles. The latencies here should always be consistent.

[19:09] Dynamic Function Exchange (DFX)

That’s why I wanted to constrain the HDMI into a P block, to enable dynamic function exchange (DFX) so that this region could be a partially reconfigurable region. I did that in another project with the exact same configuration, except it was modified for DFX. Now we have two NoCs, one for the static region, which has the DDR, and the other for the PR. Each NoC is connected to an interconnect back to the initial NoC. Here we see that we actually have one slave and one master interconnect—the master for control and the slave for the video feed coming back.

In the DFX version, we can look at the QoS. We don’t have any QoS, or the QoS that we have here, we can change it. It comes from the DFX version. If we go back to the design and open the NoC for the PR region, we have our master and slave—slave being the control, master being the video output, and another slave interface here, which is also the video input. The video input comes from the AXI slave to the NoC master interface.

[22:58] Connectivity and Configuration

In the connectivity, we have the AXI PL going to the master, interconnecting the NoC. In the QoS, we see the QoS is on the master interface for six gigabytes per second. It’s a little bit of changes. Everything is handled by the NoC, it’s just that easy. Of course, we had other signals, so I had to add some other functions to enable PR, but this is the subject of another webinar.

[23:52] Conclusion

In this webinar, we have seen that Versal devices are large FPGAs with an integrated network on chip (NoC) that can transfer data at high speed between different components. It spans the entire PL region and has multiple entry and exit points. Finally, it can manage quality of service and bandwidth allocation. I thank you for listening and hope I have taught you a thing or two. Giving control back to you, Alecia.

[24:32] Q&A Session

Alecia:

Thank you so much for that, Normand. I am now going to bring you to the Q&A part of the webinar. Let’s see, Normand, hopefully you can turn on your camera. There we go. There is Normand. I’m also going to invite Mike Brown, another senior FPGA designer here at Fidus, to join us and answer some of the questions that have come in today. Thank you, Normand, for the presentation.

Now, I’m just going to encourage everybody once again. I have a couple of questions that have come through. However, if you have a question for Normand or Mike, please do pop it into the Q&A. Let’s jump straight into them.

Mark S: Could you discuss some of the challenges you’ve faced with integrating CIPS, DDR, and AI engines within Versal? How do you recommend overcoming these challenges?

Normand: One of the challenges was to assign the DDR pins, the NoC DDR interfaces using the XPIOs. You have to properly select the pins in the FPGA. The other challenge is understanding which component of the CIPS you’re using and from which interface it connects to the NoC. The NoC has multiple interfaces: the cache coherent interface, non-coherent interface, and PMC interface. You have to know which tunnel the communication is done through to connect the proper interfaces to the link.

Mike: For devices with more than one super logic region, as Normand mentioned, timing can be quite challenging. Particularly for signals that are not going through the NoC, it is important to look at registering those signals that are crossing SLR boundaries.

Pooja: Can you give some examples of strategies for optimizing data transfer across the chip using the NoC? Do you have any examples of how you have successfully employed those strategies?

Normand: First, you have to know the traffic you’re dealing with. You have to know how much bandwidth you require and how often you require it. For example, I’m using the HDMI frame grabber. The video interface is a periodic signal, which has blanking, so it’s not constant. But it requires a certain amount of latency. You need your video at a certain time, or you’re going to skip a frame. That’s why I used isochronous QoS. You have to know your traffic and play with QoS and probably placement, but you might be constrained by your IO placements.

Mike: The NoC resources are finite. Make sure you understand where you’re not wasting those resources. Categorize the important high-bandwidth, low-latency type traffic and assign those to the NoC. Lower bandwidth, controller status interfaces can still use the general PL fabric, reserving the NoC for the really important stuff.

Steve: When setting the QoS parameters, isochronous was used for the video data. Can you explain what this means? Did it change the NoC implementation?

Norman: Isochronous means that you’re guaranteed a maximum latency, so the signal won’t linger in the NoC forever. This is crucial for video frame buffers because you need your frame at a certain time. It’s not high priority because controlling signals might be prioritized over the video signal, but you need to be sure your traffic goes through at a certain time. High priority means the signal goes over everything else, which might not be the best way for video. Best effort is not ideal because it doesn’t guarantee timing.

Mike: No additional thoughts here. Normand covered everything.

Alecia: Excellent. Thank you, Normand and Mike. We’re going to wrap up now. Please visit our website at fidus.com/partners/AMD. We’d love to speak to you more about any of your Versal challenges. We will follow up with you, and you will receive a copy of this recording in your inbox shortly. Thank you, Norman and Mike. We will be back with more Versal content soon. Thank you to all.

Speaker: Normand Leclerc, Senior FPGA designer

Norman Leclerc, a distinguished senior FPGA designer and team leader with over two decades of experience in FPGA design, simulation, and DSP architecture. He has extensive experience integrating complex electronic designs and optimizing DSP algorithms, with a deep focus on video and image transport and high-speed network infrastructures. Norman has extensive experience with AMD platforms such as Vivado and Versal and is a certified member of the AMD Alliance program. Norman's experience has been crucial in pushing the boundaries of FPGA capabilities and implementing advanced network protocols on FPGA systems here at Fidus.

Additional Resources

Expand your knowledge with these additional resources from our website:

Build for Zynq UltraScale+: Learn how Fidus can help you leverage the advanced capabilities of Zynq UltraScale+ for your projects, ensuring optimized performance and reduced development time.
Build for AMD Versal Adaptive SoCs & FPGAs: Discover our specialized services for AMD Versal Adaptive SoCs and FPGAs, designed to enhance your design efficiency and overall project success.
AMD Partner Page: Explore the benefits of our strong partnership with AMD, providing you with cutting-edge solutions and support for your FPGA and SoC designs.
A Comprehensive Guide to Versal™ FPGA Platform: Gain in-depth knowledge of the Versal FPGA platform and its applications.
FPGA: Accelerating Innovation in Technology: Read about how FPGAs are driving innovation across various industries.
The Role Of FPGAs In AI Acceleration: Discover the critical role FPGAs play in accelerating AI and machine learning applications.

Partners