Hardware architectural options for artificial intelligence systems

Editor's Choice

30 April 2024 Editor's Choice AI & ML

By Efinix, www.efinixinc.com.

With smart sensors creating data at an ever-increasing rate, it is becoming exponentially more difficult to consume and make sense of the data to extract relevant insight. This is providing the impetus behind the rapidly developing field of artificial intelligence (AI). When most of us think about AI, we picture video feeds with green boxes drawn around people, suspicious objects, or faces, to alert monitoring personnel of a situation that needs further attention. This may indeed be where AI got its first foothold, but increasingly, it is evolving and being used in everything from language translation to DNA sequencing.

Training and inference

In discussing AI we must first draw a distinction between training an AI model and using the trained model for inference on a real-world data stream. Training is many orders of magnitude more compute intensive than inference, and requires the compute resources of a data centre to achieve results in a reasonable time frame. The near-unlimited compute capability, large power budget and virtualised, time-sliced business models of the data centre mean that economics permit the use of expensive and highly parallel compute resources to achieve the training task. Here we see entire racks of GPUs and high-end FPGAs that specialise in massively parallel operations at the expense of power and budget. Given the increasing complexity of AI models and the massively asymmetric use case of training vs inference, it is likely that training will remain firmly in the data centre, although incremental training algorithms are being deployed that reduce the asymmetry and place some of the burden of training out at the network edge.

Inference on the other hand, while still extremely computationally intensive, can be achieved outside a data centre and closer to the network edge where the data is being created. Here, power, physical space and budgets are constrained. If AI inference is to be deployed at the edge, at scale, it must be cost effective, run on available power, and operate in the harsh environments found at the networks edge. These constraints are seemingly at odds with each other and severely limit the choices of a designer seeking to deploy a sufficiently performant AI capability at the edge.

AI architectures

AI architectures are evolving rapidly. What seemed impossibly complex, and the subject of academic research just a few years ago, today seems mundane. Efforts to provision sufficient compute resources to run state-of-the-art AI models have largely been successful only to find that the AI models have progressed to new levels of complexity and accuracy. The following is a very high-level discussion of three major categories of AI architectures and the challenges they present.

Convolutional Neural Networks (CNNs)

CNNs have been the workhorse of AI for many years. In their simplest form, they take a frame of input data and step a smaller analysis window across the data in two dimensions, performing progressive mathematical convolutions. Subsequent steps of normalising the result, and resolution reduction of the input data before repeating the convolution process, have the effect of reducing the data to a mathematical abstraction. With sufficient convolution parameters, repeated steps, and the use of fully connected and hidden neural network layers, the presence of features within the input data frame can be numerically extracted.

CNNs lend themselves to data sets such as video, where a two-dimensional array of input data (and subsequent video frames) can be exploited to expose massive parallelism in the algorithms and greatly increase performance. CNNs then benefit from processing architectures that can expose parallel processing capability such as FPGAs and GPUs, rather than sequential CPU and MCU architectures.

Recursive Neural Networks (RNNs)

While CNNs perform well on arrays of data that have a limited time dependency between sets, there are applications such as language translation, mathematical biology and financial analysis where insight can be gained from the order of data elements over time. For these applications we need to include some form of memory in the AI algorithm so that past data can be used to gain insight from subsequent readings. This is the realm of RNNs. RNNs include a feedback path in their convolution process so that past state can be carried forward to alter the calculation result for the current state. The content of the feedback path can be used to change the characteristics of the model and make it somewhat forgetful of its past. The sequential nature of the RNNs data structure, along with the need to process each data element in turn with regard to the past, means that it is more difficult to exploit parallelism in RNN models, and sequential architectures such as CPUs and MCUs can be applied.

Transformers

The memory capability of RNNs makes them good at processing sequential data. The feedback mechanism though makes for a leaky memory where the further back in time an event occurred, the less likely it is to have an impact on the current calculation. This is particularly problematic for tasks such as natural language processing or translation. It is the nature of translations for example that not every word in one language can be sequentially translated into an equivalent in another. In most cases, the context of an entire sentence must be known before an equivalent can be successfully derived. This is the realm of transformers.

Transformers were introduced to the world in a Google paper in 2017 and have all but displaced the other techniques since. Transformer algorithms have a concept of self-attention. In language applications for example, each word of an input sentence is analysed in parallel. Its position within the sentence is recorded as well as the potential importance of the word to the context of the sentence. Words that are critical to the context of the sentence are given ‘attention’. This is the way a human brain might consider a sentence to translate and results in a translation that has a correct translation of the context and meaning of the original rather than a word for word translation of the English. Clearly transformers contain massive parallelism. They are more complex and have higher compute requirements than their CNN predecessors, and benefit from the massive compute and parallel architectures of FPGAs and GPUs.

Transformers have revolutionised language processing, but find use across a myriad applications where context of data in a time-sensitive stream is critical. Their ability to extract context and propose additional content with similar intent has led to Generative Transformers that specialise in creating ‘original’ content. Generative Pre-trained Transformers (GPTs) are being made commercially available in a wide variety of applications. Some specialise in human-type interactions; most notable, recently, ChatGPT that has an almost uncanny ability to propose original, human like content in response to a spoken or written input.

FPGAs and edge inference

Solutions for the edge must have low power consumption, be cost effective and be able to operate in harsh environments. Faced with an increasing need for parallel compute at near hardware speeds, the options for designers are becoming increasingly sparse.

GPUs and traditional FPGAs offer a parallel compute capability, but consume too much power and are too expensive. Microcontrollers and microprocessors are sequential in nature, and lack the performance for full AI models. Custom silicon is prohibitively expensive to develop, and comes with design times longer than entire generations of AI architectures. The only viable option for a designer is an Efinix FPGA.

It is worth an aside here to mention that the open source community has put a lot of work into developing tools that can quantise AI models and reduce their complexity to a point where running on a microcontroller can, in some cases, deliver the desired performance. Tools such as TensorFlow Lite can create models that, with the aid of runtime libraries, can execute on microcontroller architectures. Efinix has embraced this effort, and has developed a series of custom instruction libraries that run on the Sapphire embedded RISC-V SoC inside Efinix FPGAs. In this way, quantised models can be made to run many hundreds of times faster, delivering a high-performance AI capability in a small, low-power footprint. This is the basis of the Efinix TinyML platform, and is free of charge on the Efinix GitHub.

As AI models increasingly migrate from traditional CNN architectures to transformers, the compute requirements and need for parallelism will push them out of the reach of microcontrollers, and will leave the ultra-efficient Efinix architecture as the only viable option for state-of-the-art AI at the edge.

Transformers and the future of AI

The utility, accuracy and complexity of AI models will continue to increase, and at an exponential rate. As it does, the efficiency of the revolutionary Efinix FPGA architecture will come into its own. The parallelism and compute capability of the FPGA will continue to deliver the performance needed with ultra-low power consumption to provide the platform for this exciting field of innovation.

Credit(s)

Tel:	+27 11 608 0144
Email:	[email protected]
www:	www.nuvisionelec.com
Articles:	More information and articles about NuVision Electronics

Share this article:

Categories

Editor's Choice

Hardware architectural options for artificial intelligence systems

Further reading:

Publications by Technews