At "Summary of Tsinghua AI Chip Report One", We have talk something about AI chip basics and status and its development in China. Now let's talk something about AI scholars.
Based on the data from the Aminer talent pool of Tsinghua University, the distribution of scholars in the field of artificial intelligence chips in the world is shown in the figure. It can be seen from the figure that the scholars in the field of artificial intelligence chips are mainly distributed in North America, followed by Europe. The research on chips is closely followed, and the talents in South America, Africa and Oceania are relatively scarce.
▲ Global distribution of research scholars in the field of artificial intelligence chips
The United States is the core of technological development in the field of artificial intelligence chips by country statistics. The number of people in the UK is closely behind the United States. .Other experts are mainly located in China, Germany, Canada, Italy and Japan.
▲Global distribution of research scholars in the field of artificial intelligence chips
A statistical analysis of the migration routes of the most influential 1000 people in the field of artificial intelligence chips in the world has been carried out, and the comparison of the talent surpluses of the countries shown in the following figure is obtained.
▲National talents reverse surplus
It can be seen that the loss and introduction of talents in various countries is relatively balanced. Among them, the United States is a big country with talent mobility, and the input and output of talents are both leading. Britain, China, Countries such as Germany and Switzerland are second only to the United States, but the difference in talent mobility between countries is not obvious.
Artificial intelligence chips currently have two development paths: Is to continue the traditional computing architecture,Accelerate hardware computing capabilities, mainly represented by three types of chips, namely GPU, FPGA, ASIC, but the CPU still plays an irreplaceable role; the other is to subvert the classic von Neumann computing architecture, using the brain Neural structure to improve computing power, represented by the IBM TrueNorth chip.
The computer industry has used the term CPU since the early 1960s. So far, the CPU has changed dramatically from form, design to implementation, but its basic working principle has not changed much. Usually the CPU consists of two main components, the controller and the operator.The traditional internal structure of the CPU is shown in Figure 3. From the figure we can see that essentially only a single ALU module (logical unit) is used to complete the data calculation, and the other modules exist to ensure the instruction. Can be executed one after another in an orderly manner. This versatile structure is ideal for traditional programming calculations, and can be increased by increasing the CPU's main frequency (increasing the number of instructions executed per unit time). However, for deep learning, which does not require too many program instructions, but requires the computational requirements of massive data operations, this structure is somewhat incapable. Especially under the power limitation, the instruction execution speed cannot be accelerated by unrestrictedly increasing the operating frequency of the CPU and memory. This situation leads to an insurmountable bottleneck in the development of the CPU system.
▲Traditional CPU internal structure diagram (ALU only is the main calculation module)
GPU is the first processor to perform parallel acceleration calculations, which is faster than CPU and more flexible than other accelerator chips.
The traditional CPU is not suitable for the execution of artificial intelligence algorithms,The main reason is that its calculation instructions follow the serial execution method and fail to exert the full potential of the chip. In contrast, GPUs have a highly parallel structure and are more efficient than CPUs in processing graphics data and complex algorithms. Comparing the difference in structure between GPU and CPU, most of the CPU area is controller and register, and GPU has more ALU (ARITHMETIC LOGIC UNIT) for data processing. This structure is suitable for parallel processing of dense data. The structure comparison between CPU and GPU is shown in the figure. The speed of the program on the GPU system is often several tens of times or even thousands of times higher than that of a single-core CPU. As companies such as NVIDIA and AMD continue to advance their support for GPU massively parallel architectures, GPUs for general purpose computing (ie GPGPUs, GENERAL PURPOSE GPUs, general purpose computing graphics processors) have become an important means of accelerating parallelizable applications.
▲CPU and GPU structure comparison chart (referenced from NVIDIA CUDA documentation)
The development history of GPU can be divided into three stages, and the development process is as shown in the figure:
The first generation of GPU (previously 1999), part of the function is separated from the CPU, and the hardware acceleration is realized. With GE (GEOMETRY ENGINE) as the representative, it can only accelerate the 3D image processing without software programming features.
Second generation GPU (1999-2005) for further hardware acceleration and limited programmability. In 1999, NVIDIA released the GeForce256 image processing chip "for complex mathematical and geometric calculations", using more transistors as execution units instead of using complex control units and buffers like CPUs, T&L ( Functions such as TRANSFORM AND LIGHTING) are separated from the CPU and implement a fast transition, which becomes a real sign of the GPU. In the next few years, GPU technology developed rapidly and the computing speed quickly exceeded the CPU. In 2001, NVIDIA and ATI introduced GEFORCE3 and RADEON 8500 respectively. The pipeline of graphics hardware was defined as a stream processor. Vertex-level programmability appeared, and the pixel level also had limited programmability. However, the overall programmability of the GPU is still relatively high. limited.
Third-generation GPU (after 2006), the GPU is easy to create a programming environment, and you can write programs directly. In 2006, NVIDIA and ATI introduced the CUDA (Compute United Device Architecture) programming environment and CTM (CLOSE TO THE METAL) programming environment respectively, making the GPU break the limitations of the graphics language into a true parallel data processing super accelerator.
In 2008, Iphone put forward a general parallel computing programming platform OPENCL (OPEN COMPUTING LANGUAGE, Open Operating Language), which is different from the UDA-bound graphics card, OPENCL has nothing to do with specific computing devices.
Development Stage of GPU Chip
At present, GPU has developed to a more mature stage. Google, FaceEBOOK, Microsoft, TWITTER and Baidu are using GPU to analyze pictures, videos and audio files to improve search and image tagging applications. In addition, many automobile manufacturers are using GPU chips to develop driverless vehicles. Moreover, GPU is also used in VR/AR related industries.
But GPU also has some limitations. Deep learning algorithm is divided into two parts: training and inference. GPU platform is very efficient in algorithm training. However, the advantages of parallel computing can not be fully exploited when single input is processed in inference.
FPGA is the product of further development on the basis of programmable devices such as PAL, GAL, CPLD and so on. Users can define the connection between these gates and memory by burning in the FPGA configuration file. This kind of burning is not one-off, for example, the user can configure the FPGA as a microcontroller MCU, and after using it, the user can edit the configuration file to configure the same FPGA as an audio codec. Therefore, it not only solves the shortcomings of flexibility of customized circuits, but also overcomes the shortcomings of limited number of original programmable gate circuits.
The parallel computing of data and tasks can be carried out simultaneously in the field of FPGA, which can improve the efficiency of processing specific applications more obviously. For a particular operation, the general CPU may need several clock cycles; while the FPGA can directly generate a special circuit by programming the recombination circuit, which consumes only a few or even one clock cycle to complete the operation.
In addition, due to the flexibility of the FPGA, many low-level hardware control and operation technologies which are difficult to implement by using general purpose processors or ASIC can be easily implemented by using the FPGA. This feature leaves more room for the implementation and optimization of the algorithm. At the same time, the one-time cost of the FPGA is much lower than that of ASIC. In the case that the demand for the chip is not yet large, the deep learning algorithm is not stable, and the need for continuous iterative improvement, it is one of the best choices to use the reconfigurable characteristics of the FPGA chip to achieve semi-customized AI chip.
In terms of power consumption, as far as architecture is concerned, FPGA also has inherent advantages. In traditional Feng's architecture, execution units (such as CPU cores) need instruction memory, decoder, operator of various instructions and branch jump processing logic to execute arbitrary instructions. The functions of each logic unit in the FPGA are determined when reprogramming (i.e. burning in), and no instructions are needed, and no common instructions are needed. Sharing memory can greatly reduce the power consumption per unit of execution and improve the overall energy consumption ratio.
Because of the flexible and fast characteristics of FPGA, there is a trend to replace ASIC in many fields. The application of FPGA in the field of artificial intelligence is shown in the figure.
Application of FPGA in Artificial Intelligence
At present, the demand of AI computing represented by in-depth learning mainly uses GPU, FPGA and other general chips suitable for parallel computing to achieve acceleration. In the absence of a large-scale rise in industrial applications, the use of such existing generic chips can avoid the high investment and high risk of specialized research and development of custom chips (ASIC). However, because this kind of general purpose chip is not designed specifically for in-depth learning, there are inherent limitations in performance, power consumption and so on. With the expansion of the scale of artificial intelligence applications, such problems become increasingly prominent.
As an image processor, GPU is designed to deal with large-scale parallel computing in image processing. Therefore, there are three limitations in the application of deep learning algorithm: first, parallel computing can not be fully utilized in the application process. Deep learning includes two computational steps: training and inference. GPU is very efficient in the training of deep learning algorithm, but for the case of inference with single input, the advantage of parallelism can not be fully exploited. Second, it is impossible to configure the hardware structure flexibly. GPU adopts SIMT computing mode, and its hardware structure is relatively fixed. At present, the deep learning algorithm is not completely stable. If the deep learning algorithm changes greatly, GPU can not flexibly configure the hardware structure like the FPGA. Thirdly, the energy efficiency of the deep learning algorithm is lower than that of the FPGA.
Although FPGA is highly regarded, and even the new generation of Baidu Brain is developed on the basis of the platform of FPGA, it is not specially designed for the application of deep learning algorithms. There are many limitations in practical application. First, the computing power of basic units is limited. In order to achieve reconfigurability, there are a large number of very fine-grained basic units in the FPGA, but the computing power of each unit (mainly relying on LUT lookup table) is far lower than that of ALU module in CPU and GPU; secondly, the proportion of computing resources is relatively low. In order to achieve reconfigurable characteristics, a lot of resources within the FPGA are used for configurable on-chip routing and interconnection; thirdly, there is still a big gap between speed and power consumption relative to the dedicated customization chip (ASIC); fourthly, the price of the FPGA is more expensive, and the cost of a single FPGA is much higher than that of the dedicated customization chip in the case of large-scale.
Therefore, with the development of artificial intelligence algorithm and application technology, and the maturity of the industrial environment of ASIC, full customized artificial intelligence ASIC also gradually reflects its own advantages. The representative companies engaged in the research and development and application of such chips at home and abroad are shown in the figure.
A Survey of Research and Development of Special Chips for Artificial Intelligence (including Brain-like Chips)
After the stability of the deep learning algorithm, the AI chip can be fully customized by ASIC design method, so that the performance, power consumption and area can be optimized for the deep learning algorithm.
Brain-like chips do not use the classical von Neumann architecture, but are based on neuromorphological architecture design, represented by IBM Truenorth. IBM researchers used memory cells as synapses, computing units as neurons, and transmission units as axons to build prototypes of neurochip. At present, Truenorth uses Samsung's 28-nm power technology, and the chip composed of 5.4 billion transistors has 4096 synaptic cores. The power consumption of real-time operation is only 70 mW. Because synapses require variable weights and memory functions, IBM has experimentally implemented a new type of synapses using phase-change non-volatile memory (PCM) technology compatible with CMOS technology, speeding up the commercialization process.