Speedcore Gen4 increases performance 60%, reduces power by 50% and die area by 65% while retaining the original Speedcore eFPGA IP’s abilities to bring programmable hardware-acceleration capabilities to a broad range of compute, networking and storage systems for interface protocol bridging/switching, algorithmic acceleration and packet processing applications.
With the Speedcore Gen4 architecture, Achronix has added a Machine Learning Processor (MLP) to the library of available blocks and delivers 300% higher system performance for artificial intelligence and machine learning (AI/ML) applications.
MLP blocks are highly flexible, compute engines tightly coupled with embedded memories to give the highest performance per watt and lowest cost solution for AI/ML applications.
“Achronix Speedcore eFPGA with Gen4 architecture provides an optimal balance of hardware acceleration previously found only in ASIC implementations,” said Robert Blake, president and CEO of Achronix Semiconductor. “Our new architecture adds the flexibility and reprogrammability of our proven FPGA technology to support exploding demand for new AI/ML and high data bandwidth applications.”
The dramatic increase in fixed and wireless network bandwidth, coupled with the redistribution of processing, and the emergence of billions of IoT devices means that multicore CPUs and SoCs cannot meet their requirements unaided. They need hardware accelerators, often reprogrammable to pre-process and offload computations to increase the systems’ overall compute performance.
In addition to the general requirements of compute and networking infrastructure, AI/ML demands a significant increase in high-density, targeted computing. The Achronix MLP exploits the specific attributes of AI/ML processing and increases performance for these applications by 300% compared to previous Achronix FPGAs. This is done through multiple architectural innovations that increase operating performance and the number of operations per clock cycle.
The MLP is a complete AI/ML compute engine. Each MLP includes a local cyclical register file that leverages temporal locality for optimal reuse of stored weights or data. The MLPs are tightly coupled with neighbouring MLP blocks and larger embedded memory blocks to deliver the highest processing performance, the highest operations per second and the lowest power profile. The MLPs support multiple precision fixed point and floating point formats including Bfloat16, 16-bit, half-precision floating point, 24-bit floating point and block floating point (BFP). Users can select the optimal precision for their application for performance, power and area.
To complement the MLP and increase the AI/ML compute density, Speedcore Gen4 look-up-tables (LUTs) can implement multipliers that are 2x more efficient than any industry standalone or embedded FPGA products. Leading FPGAs implement 6x6 multipliers in 21 LUTs whereas Speedcore Gen4 implements 6x6 multipliers in 11 LUTs and can operate at 1 GHz.
The new Speedcore Gen4 architecture has many architectural innovations that increase overall operating performance by 60% compared to the previous generation. All aspects of the LUTs have been enhanced to increase area efficiency and reduce resource usage which reduces power and die size and increases performance. Changes include doubling the size of the ALUs, doubling the registers per LUT, support for 7-bit functions and some 8-bit functions in a single level-of-logic delay, and dedicated high-speed connections for shift registers.
The routing architecture also has been enhanced with an independent and dedicated bus routing structure that includes dynamically selectable bus mixing that effectively create a distributed, run-time-configurable switching network. This is the first time that run-time logic functionality is available in the routing structure and it provides an optimal solution for high-bandwidth and low-latency applications.
Achronix’s ACE design tools include pre-configured, Speedcore Gen4 eFPGA example instances users can use to evaluate Speedcore Gen4 quality of results for performance, resource usage, and compile times.