Simultaneous_and_heterogeneous_multithreading References

Simultaneous and heterogeneous multithreading (SHMT) is a software framework that takes advantage of heterogeneous computing systems that contain a mixture of central processing units (CPUs), graphics processing units (GPUs), and special purpose machine learning hardware, for example Tensor Processing Units (TPUs).^[1]^[2]

Each component processes information differently. Often data has to move among processors, which can create bottlenecks, with one processor starving while waiting on another to finish.^[1]

Architecture

The system defines virtual processors and virtual operations (VOPs). VOPs decompose into one or more high-level operations (HLOPs). It then distributes the operations across the processors. The runtime system then dynamically maps virtual processors to physical processors, assessing resource availability in order to keep all the processors busy. The scheduler employs a light-weight, quality-aware work-stealing (QAWS) policy.^[1]

Conventional runtimes use assign one processor (set) to each subtask, leaving other types of processors idle. In other words, the CPU(s) run (possibly in parallel), then when that subtask completes, the next subtask is handed to the GPU(s). When they finish the next subtask is handed to the TPU(s).^[2]

Adding software pipelining allows the second subtask to run using partial results from the first subtask, which improves resource utilization.^[2]

SHMT takes things a step further, identifying subtasks that can run independently of others to the appropriate processor type, allow even better parallelism. Some subtasks can be performed on multiple processor types. SHMT can divide a single subtask across such processor types. Thus the fundamental breakthrough is to keep more processors working simultaneously, reducing time and energy costs.^[2]

Benchmark

Researchers tested the concept using a typical smartphone configuration tweaked so that it resembled a data center server.^[1]

The hardware was Nvidia's Jetson Nano module containing a quad-core ARM Cortex-A57 processor (CPU) and 128 Maxwell architecture GPU cores. A Google Edge TPU was connected via its M.2 Key E slot. The processors communicated via an onboard PCI Express (PCIe) interface. Shared data was hosted in a 4 GB 64-bit LPDDR4. The Edge TPU adds an 8 MB device memory. Ubuntu Linux 18.04 was the operating system.^[1]

Compared to a conventional system performance increased by 1.95X boost, while energy consumption was reduced by 51%, on a range of benchmarks, including Black–Scholes, DCT8X8, DWT, FFT, Histogram, Hotspot, Laplacian, MF, Sobel, SRAD, and GMEAN.^[1]

References

^ ^a ^b ^c ^d ^e ^f McClure, Paul (February 22, 2024). "Software tweak doubles computer processing speed, halves energy use". New Atlas. Retrieved 2024-02-25.
^ ^a ^b ^c ^d Hsu, Kuan-Chieh; Tseng, Hung-Wei (2023-12-08). "Simultaneous and Heterogenous Multithreading". Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '23. New York, NY, USA: Association for Computing Machinery: 137–152. doi: 10.1145/3613424.3614285. ISBN 979-8-4007-0329-4.

[:2-1] ^ ^a ^b ^c ^d ^e ^f McClure, Paul (February 22, 2024). "Software tweak doubles computer processing speed, halves energy use". New Atlas. Retrieved 2024-02-25.

[:3-2] Hsu, Kuan-Chieh; Tseng, Hung-Wei (2023-12-08). "Simultaneous and Heterogenous Multithreading". Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO '23. New York, NY, USA: Association for Computing Machinery: 137–152. doi: 10.1145/3613424.3614285. ISBN 979-8-4007-0329-4.

[1]

[2]

v t e Parallel computing
General	Distributed computing Parallel computing Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing

Architecture

Benchmark

See also

References