Large Multi-National Mil/Aero Company Custom Camera Signal Processing Device

Project Challenges

– The group within the company was designing a next generation image processing technology with very high data rates with gigapixel per second throughput

– Data input requirement was to be from (16) 2 Gbit/s SerDes channels using a custom protocol

– Data output was to be from (4) 10.3125 Gbit/s SerDes Channels using CameraLink CLHS X protocol for compatibility to off the shelf frame grabbers

– Primary control input from (1) 10.3125 Gbit/s SerDes channel and or from a secondary USB2.0 port in the event high Speed SerDes was not available for control

– The production plan was to develop a digital ASIC to perform the video processing algorithms. There was a desire to have more system customization with the algorithms which lead to the selection of an FPGA for flexibility. This project would provide a platform that could be used across many different design applications in future developments.

– The architecture was designed around latest generation of FPGA’s from Xilinx (AMD), Virtex UltraScale+ VU9P (VCU118 development board). The design included both the ‘regular’ programmable fabric plus High Performance external DRAMs, Processing cores, and High Speed SerDes cores.

– Due to the high data rates required for processing, a combination of on-die SRAM and 2 channels of external DDR4 (x64) DRAM was used to meet the extreme memory bandwidth requirements of the architecture

-The algorithms used in processing the video data stream required custom video processing algorithms

  • While algorithmically similar to standard video filters, the analysis called for writing custom code for the filter logic to meet system performance and hardware availability
  • > 30 clock domains had to be managed in the design

Solutions

- HLS (High Level Synthesis) leveraged in the AMD (Xilinx) design software suite:

  • to code and verify video processing, using OpenCV based test bench for comparison with customer's legacy designs
  • to verify imaging in a smaller FPGA that had very short build and test time, with minimal effort when moved to the Ultra-Scale fabric to make design trade-offs and memory management and performance trade-offs could be analyzed much more quickly and efficiently
  • to support complex algorithms and bit packing with minimal development cost
  • to enhance capabilities beyond original scope within the budget and timeframe
  • Custom debugging features could be easily added when helpful by writing C++ code and included in the FPGA through HLS

- HLS was coded in C++ for use in single clock domain functions, design decision for simplification and speed of development, saving months of design effort

- Included filters and DMA primarily with AXI and AXI stream interfaces

  • Blocks with multiple DMA interfaces were easily managed and optimized by using HLS stream interfaces

- Minimized development time and achieving single clock per 8-pixel throughput

- Multi clock domain functions and clock domain crossings were managed with traditional RTL coding techniques and AMD (Xilinx) library elements

Results

– The customer now has HLS code that can be used in future projects to allow quick analysis of trade-offs for different architectures. Designs can be synthesized and analyzed quickly to target different requirements and performance characteristics

– The Xilinx/AMD toolset integrated with HLS support provided that capability and is optimized for the FPGA implementations

- HLS using C++ provided an optimal environment for:

  • Customer review and algorithm signoff
  • Simplified embedded code development
  • Able to translate code from the test bench to the embedded code

- Taking advantage of the existing IP cores (memories, processors, I/O interfaces) provided a quick solution that is compliant with existing standard products to allow scaling and other benefits within the AMD (Xilinx) design environment

Additional Feedback on HLS Debug

It was convenient to write in C and use C tools and C debugging for blocks in a simple way. The advantage to running large simulations in C code is that it runs significantly faster and using C level break points simplified debug.

Knowing what was in the C code gave us insight into what was going to be synthesized into hardware.
Data alignment checks quickly inserted into code to allow verification of multiple stream data, easily commented out / removed for implementation to keep implemented gate count within what was available in the FPGA.

Addition of debug ports to output very large amounts of data for data capture and analysis. Also able to capture wave forms using ILA.
Added custom DMA controllers to add critical metadata for debug.

We extensively used HLS constraints and clock rates to meet gate count, timing and latency when synthesizing the full design in the FPGA. Significantly faster to analyze the design without having to write new RTL.

HLS allowed us to pretty much ignore clock counting for data transfers. It dealt with managing the extra clocks of a transfer without us having to go through and calculate everything to the exact clock.

Added DRAM IDLE time counters in HLS that was readable by the microprocessor to analyze the read/write traffic to the DRAM. This gave us the ability to tune the memory read/write traffic to ensure appropriate bandwidth to deliver robust complex DMA management.