Introduction
Google created hardware accelerators called Tensor Processing Units, or TPUs, for machine learning workloads. TPUs are made accessible as scalable computing resources on Google Cloud by the web service known as Cloud TPU.
In this blog, we will see what cloud TPU is, how it works, and when we need to use it.
So, let’s begin.
Working of Cloud TPU
TPUs use hardware created for huge matrix operations, which are frequently encountered in machine learning methods, to train your models more effectively. Thanks to TPUs' on-chip high-bandwidth memory (HBM), Larger models and batch sizes are possible. You can scale up your workloads by connecting TPUs in groups called Pods.
To understand how TPUs work, first we need to understand how CPU and GPU works.
Working of CPU
A general-purpose processor known as a CPU uses the von Neumann design. That indicates that a CPU utilises memory and software. The flexibility of CPUs is their main advantage. Any sort of software can be loaded on a CPU for a wide range of purposes.
Every time a calculation is made, a CPU loads values from memory, runs a calculation on the values, and then saves the outcome back in memory. Memory access is slower than computation speed, which might reduce the overall throughput of CPUs. The von Neumann bottleneck is another name for this situation.
Working of GPU
Hundreds of Arithmetic Logic Units (ALUs) are embedded into a single GPU to increase the throughput. There are almost 2500 to 5000 ALUs in a GPU due to this the performance is increased by almost 1000 times.
However, the GPU is still a general-purpose processor that must support a wide range of software and applications. Consequently, GPUs and CPUs both share the same issue. A GPU must access registers or shared memory to receive operands and store the results of intermediate calculations for each calculation made by one of the hundreds of ALUs.
Working of TPU
The main function of TPUs is Matrix processing, which combines multiply and accumulates operations. TPUs have a huge physical matrix made up of thousands of multiply-accumulators that are coupled to one another directly. This arrangement is known as systolic array architecture. On a single CPU, Cloud TPU v3 has two systolic arrays of 128 × 128 ALUs.
Data is streamed into an infeed queue by the TPU host. Data is loaded into HBM memory by the TPU from the infeed queue. The TPU loads the outcomes into the outfeed queue after the calculation is finished. After reading the results from the outfeed queue, the TPU host saves them in host memory.
The TPU loads the parameters into the MXU from HBM memory to execute the matrix operations.
Data is then loaded from HBM memory by the TPU. The outcome of each multiplication is passed on to the following multiply-accumulator. The output is the total of all the outcomes of multiplying the data by the parameters. The matrix multiplication method doesn't involve any memory access.
TPUs can therefore do calculations involving neural networks at high computational throughput.