טכניון מכון טכנולוגי לישראל
הטכניון מכון טכנולוגי לישראל - בית הספר ללימודי מוסמכים  
Ph.D Thesis
Ph.D StudentMorad Amir
SubjectMulticore and Processing-in-Memory Architectures
DepartmentDepartment of Electrical Engineering
Supervisor Professor Ran Ginosar
Full Thesis textFull thesis text - English Version


Abstract

Fifty years ago, Gordon Moore forecast a bright future for VLSI technology scaling. However, scaling cannot be sustained indefinitely. The field of computing is already struggling with delay and bandwidth required to access memory ("the memory wall"), and with energy dissipation ("the power wall"). These challenging issues calls for significant research investments to develop new architectures for next generation computing systems.

Present work is divided into two parts. The first part is dedicated to analysis and optimization of parallel and manycore architectures. The second part of my work is devoted to developing a new massively parallel processing architecture, based on the integration of a sequential processor to massively parallel SIMD in-memory computing.

In the first part of my thesis, as part of my first manuscript, I describe the development of closed-form analytical optimization frameworks for multicore architectures. I consider a workload comprising a consecutive sequence of program execution segments, where each segment can either be executed on general-purpose processor or offloaded to a heterogeneous limited range accelerator, and research the following question: which subset of the accelerators should be integrated and what is the optimal resource allocation among them?

In my second manuscript I address the following complementary research question: Given a multicore and a workload consisting of concurrent tasks and resource constraints, what is the optimal selection of a subset of the available cores and what is the optimal resource allocation among them?

In my third manuscript I research the following question: given (a) a multicore architecture consisting of last level cache (LLC), processing cores and a NoC interconnecting the cores and the LLC; (b) workloads consisting of sequential and concurrent tasks; and (c) physical resource constraints (area, power, execution time, off-chip bandwidth), what is the optimal selection of a subset of the available processing cores and what is the optimal resource allocation among all blocks?

In the second part of my thesis, as part of my forth manuscript, I detail GP-SIMD, a novel and promising hybrid general purpose-SIMD computer architecture that resolves the issue of synchronization by in-memory computing, through combining data storage and massively parallel processing. Comparative analysis shows that this novel architecture may outperform Associative Processor as well as a variety of conventional architectures.

In my fifth manuscript I detailed an efficient implementation of Dense Matrix Multiplication (DMM) and Sparse Matrix Multiplication (SpMM) on GP-SIMD. GP-SIMD is shown to be more power efficient than the Associative Processor and a variety of conventional architectures.

In my sixth manuscript, I propose and investigate a resistive implementation of the GP-SIMD architecture. GP-SIMD SRAM shared memory array consumes over 96% of the die size. This array consumes leakage power, and further, as CMOS feature scaling slows down, conventional SRAM memory experience scalability problems. We thus searched for alternative non-volatile implementation of the shared memory array. In this work I show that resistive memory technology potentially allows scaling the GP-SIMD from few millions to few hundred millions of processing units on a single silicon die.