In this article, we'll look at three of the most common choices for core processor architectures used in artificial intelligence systems: scalars, vectors, and spaces. For each case, we will summarize some of its performance characteristics and types of optimization algorithms. In a later article, we'll discuss in more depth how they are implemented and how they perform on different types of AI workloads.
If there is no fairly popular "Flynn taxonomy", any explanation of the processor architecture is incomplete. Because nomenclature is very common. Its original intent was to describe how a Harvard architecture computer can consume instructions and data streams, and it makes the most sense in this situation. despite this,Modern processors are usually closer to one feature than others, so we often refer to them in this way, but we should note that assuming that any modern processor is fully compliant with one of these types, it will be a serious Oversimplification. Introduced here is a slightly more modern taxonomy.
SISD: Single Instruction Single Data
The simplest form of CPU is suitable for this class. Each cycle of the CPU takes instructions and data elements and processes them to modify the global state. This concept is the foundation of computer science, so most programming languages are compiled into a set of instructions for this architecture. Most modern CPUs also emulate SISD operations, although very different concepts may be used in software and hardware.
SIMD: Single Instruction Multiple Data
The simplest SIMD architecture is a vector processor, similar to having a wider data type The SISD architecture, so each instruction runs on multiple consecutive data elements. A little more complicated is thread parallelism.Where a single instruction operates on multiple thread states, this is a more general programming model.
MISD: Multi-instruction data
There is no general consensus on what is an error handler, so I will not here. Make restrictions. Consider an architecture that can execute multiple arbitrary instructions in a single cycle on a single data input. This basically requires multiplexing from output to input without storing intermediate results. Later, we will see the advantages of this advanced architecture.
MIMD: Multi-instruction and multi-data
I’m not limiting, a very long instruction word (VLIW) The processor is best suited for this category. The purpose of such a processor is to expose a programming model that more accurately fits the available resources of the processor. The VLIW instruction can send data to all execution units simultaneously, which has great performance advantages through instruction level parallelism (ILP), but the compiler must be architecturally aware and perform all scheduling optimizations. In general, this has proven to be challenging.
Scalar (CPUs): Mixed Performance
The modern CPU is a very complex system designed to perform well Kind of task. Its elements cover every category of Flynn. You can of course program it as a SISD machine, it will provide you with output as if the program was calculated in the order you gave it. However, each CISC instruction is typically converted to multiple RISC instruction chains for execution on a single data element (MISD). It will also look at all the instructions and data you provide and arrange them in parallel to execute the data on many different execution units (MIMDs). There are also many operations, such as performing the same calculations on many parallel aligned data elements (SIMDs) in the AVX instruction set. In addition, since multiple cores and multiple threads run in parallel to use resources simultaneously on a single core, almost any type of parallelism in the Flynn taxonomy can be achieved.
If the CPU is going to run in simple SISD mode, grab each one from memory one at a time Instructions and data elements, then no matter how high the frequency, it will be very slow. In modern processors, only a relatively small portion of the die area is dedicated to actually performing arithmetic and logic. The rest is devoted to predicting what the program will do next, and arranging the instructions and data for efficient execution without violating any causal constraints. Perhaps the most closely related to CPU performance and other architectures is the handling of conditional branches. Instead of waiting to resolve a branch, it predicts which direction to go and then fully restores the processor state when an error occurs. Hundreds of such tricks have been etched on silicon, which are tested on a wide variety of workloads and offer great advantages when executing highly complex arbitrary code.
Moore's Law Philosophy
In my first job, I was assigned to integrate a very expensive dedicated integration Circuit,This is considered necessary for real-time decoding of satellite images. I noticed that this design has been around for a few years. I did some calculations and told me that I can have almost the same computing power on Intel processors. Before the ASIC was available, I wrote the algorithm in C and demonstrated it on the Pentium III CPU. At that time, 'Dennard Scaling' was so fast that the performance of general-purpose processors outperformed the need for dedicated processors in a short period of time. Perhaps the biggest advantage of choosing a general-purpose processor is that it is easy to program, making it the platform of choice for algorithm development and system integration. The algorithm can be optimized to a more specialized processor, but the CPU is already very good at doing this for you. In my special case, the first version of the satellite uses the Reed-Solomon code, but the future design is still considering the Turbo code. The down-site using the ASIC must replace the entire system, and our site will use simple software updates and regular CPU upgrades. So you can spend time optimizing your code and spending time on innovative applications. The inference of Moore's Law is that it will soon be fast enough.
Vectors (GPU and TPU): Simple and Parallel
In many ways, vector processors are the simplest modern system. Structure: A very limited computational unit that is repeated multiple times on the chip to perform the same operations on large amounts of data. These are the first popular graphics, so the term GPU. In general, GPUs do not have the predictive gymnastics functionality that the CPU does to optimize complex arbitrary code, and specifically has a limited set of instructions that are limited to support certain types of computations. Much of the advancement in GPU performance has been achieved through the basic technical extensions of density, area, frequency and memory bandwidth.
A recent trend is to extend the GPU instruction set to support general purpose computing. These gp instructions must be tuned to run on the simd architecture, which exposes some advantages and disadvantages, depending on the algorithm.Many of the algorithms that are programmed to run as a repeating loop on the CPU actually do the same for each adjacent data element of the array in each loop. Through the efforts of some programmers, they can be easily parallelized, sometimes on a large scale on the GPU.
It’s worth noting. If there are any conditions on any element, then all branches must run on all elements. For complex code, this may mean that the computation time grows exponentially relative to the CPU. The GPU has a very wide memory bus that provides excellent streaming data performance, but if memory access is inconsistent with vector processor elements, each data element requires a separate request from the memory bus, and the CPU has a very sophisticated predictive caching mechanism. Can greatly compensate for this.
The memory itself is very fast and very small, and relies on data access transfers on the PCIe bus. In general, the development of GPGPU algorithms is much more difficult than the CPU. However, this challenge is solved to some extent by discovering and optimizing efficient parallel algorithms that achieve the same results through uniform execution of branches and aligned memory accesses.Often, these algorithms are less efficient in raw operations, but faster in parallel architectures.
The algorithms popular in many artificial intelligences are based on linear algebra, and the large expansion of the parameter matrix This has made great progress in this field. GPU parallelism allows for large-scale acceleration of the most basic linear algebra, so it is suitable for AI researchers as long as they remain within the dense linear algebra over the matrix, the matrix is large enough to occupy most of the processing elements, small enough to hold the GPU Memory. However, this acceleration is so fast that, to date, with these limitations, great progress has been made in in-depth learning.
The two main drivers of modern development in GPUs are the Tensor Processing Units (TPUs), which perform full-matrix operations in one cycle, while multi-GPU interconnects are used to handle larger The internet. We have experienced even greater differences between the hardware architecture of dedicated graphics and the hardware designed for AI.
Today,We have encountered even greater differences between the hardware architecture of the dedicated graphics and the hardware designed for AI. The simplest difference is that in terms of precision, AI is developing techniques based on low-precision floating-point and integer arithmetic. Slightly sluggish is the shortcut that GPUs use to render compelling complex scenes in real time, often using very specialized computing units. Therefore, the similarity between the architectures ends with the highest level of optimization for both.
ASICs or FPGAs can be designed for any type of computing architecture, but here we focus on specific types. The architecture is somewhat different from other options and is related to artificial intelligence. In a clock architecture such as a CPU or GPU, each clock cycle loads a data element from a register, moves the data to a processing element, waits for the operation to complete, and then stores the result back to the register for the next operation. In a spatial data stream, operations are physically connected on the processor so that the next operation is performed once the result of the calculation, and the result is not stored in the register.When the medium complex units that contain their own state in the registers local to the processing element are linked together in this way, we call them "Systolic Arrays".
Power, Delay, and Throughput
There are some direct advantages that are easy to implement. In register-based processors, power consumption is primarily caused by data storage and transfer between registers. The only energy consumed is the processing element and the data is transferred to the next stage. Another major advantage is the delay between elements, which is no longer limited to clock cycles. There are also potential advantages in terms of throughput because data can be clocked into Systolic Arrays at the rate limited by the slowest processing stage. Data is output at the same rate at the other end with some delay between them to establish a data stream. This can be more energy efficient and/or faster than the synchronous clock-execute-storage loop, depending on the architecture's goals.
If the CPU is the easiest to program and the GPU presents a bigger challenge, then FPGA requires a lot of effort and skills, while ASIC requires a lot more cost and engineering investment. Nevertheless, the benefits of specific algorithms are still great.
To see how much advantage this has, consider the "standardized delay" of driving another inverter in modern silicon technology, measured in picoseconds, with clock cycles approaching nanoseconds. Similarly, transmission energy is a function of resistance and capacitance, which can be calculated according to the length of interconnection, and the distance between processing elements can be several orders of magnitude shorter than the distance between registers that hold data between clock cycles. FPGA does not have much advantage because there are additional fan-out, switching and transmission delays between components, but it provides flexibility to adapt to a variety of data stream architectures through a single chip. Although any type of algorithm can be implemented, there are limitations in complexity, because the condition requires the layout of two branches, which greatly increases the area and reduces the utilization efficiency. FPGA and ASICS can also optimize the tradeoff between layout efficiency and speed by mixing synchronization and contraction structures.
Data stream system
The most common systolic array type for AI implementations is the tensor core, which is integrated into the synchronization architecture as part of the TPU or GPU. Many different types of convolution cores are also proposed. A complete data stream implementation of the whole deep learning architecture (such as ResNet-50) has been implemented in the FPGA system, which achieves the most advanced performance in terms of delay and power efficiency. Customizability also allows for arbitrary bit-length accuracy, which reduces layout size and processing latency, but must be carefully adjusted to meet the statistical performance requirements of the system. However, the main unique function is that the real-time nature of processing allows AI to integrate with other signal processing components in real-time systems.
When selecting AI processors for a particular system, it is important to understand the relative advantages of each algorithm in the context of the algorithm used, as well as the system requirements and performance objectives. In the following chapters, we will introduce some considerations and examples. We will see that each of these processor architectures has an advantage over other processor architectures in terms of various system-level considerations.