M.Sc Thesis

M.Sc StudentGeller Ishay
SubjectDataflow Trace Cache (DFTC):
A Dynamic Translation Processor Architecture for
Power-Efficient High-Performance
DepartmentDepartment of Electrical and Computers Engineering
Supervisors ASSOCIATE PROF. Yitzhak Birk
PROF. Avi Mendelson
Full Thesis textFull thesis text - English Version


Current out-of-order superscalar machines execute instructions in their dataflow order. However, dynamic dependency extraction requires power-hungry structures, and recurs in every use of the same code.  This problem worsens for wider issue architectures.  An alternative architecture, EDGE (Explicit Data Graph Execution), runs sequences of explicit dataflow blocks that communicate through register files (and memory). It is claimed to offer a good performance/power ratio, but is incompatible with existing instruction sets. Our goal is to overcome this compatibility and to do so efficiently.

We propose DFTC, a dynamic translation and optimization architecture that is proposed as an accelerator core within an ACCMP framework. Our architecture comprises a novel trace predictor, a special trace cache that stores translated dataflow code, an EDGE back-end, and a simple conventional engine. Our focus in this work has been on two critical elements: the trace predictor and the dataflow trace cache.

Our trace predictor identifies hard-to-predict branches that terminate traces. They are either resolved naturally or handed over to a secondary, less accurate, branch predictor (not covered in this work). Using this technique, the trace predictor achieves a 96% hit-rate, and an average trace length of 23.8 PISA instructions (measured on a set of ten SPEC CPU2000 benchmarks). Using a 300KB dataflow trace cache (DFTC) and a novel filtering approach, 99.7% of the instruction stream is executed (in translated form) on the EDGE back-end with an average (harmonic mean) of one DFTC miss every 4.3K instructions. The DFTC is backed by lower levels of translation storage, so the penalty for such DFTC misses is small.

By combining the front-end simulation results with an analytic performance model, we roughly estimate that DFTC could accelerate the IPC rate of SPEC CFP2000 benchmarks by a factor of 1.6X (compared with conventional high-end processors). In addition, we expect that at least some clock acceleration can be attained without exceeding the power/thermal envelope. Finally, DFTC's execution core area is coarsely estimated as 26mm2 in a 65nm technology. That is, less than the size of conventional high-end cores.