Hardware Acceleration for Data Processing (HADP) - Fall 2016


  • 27.09.2016  Talk by Prof. Onur Mutlu: Processing Data Where It Makes Sense: Enabling In-Memory Computation (Abstract below)
  • 04.10.2016 Talk by Prof. Ce Zhang: Machine learning with modern hardware: Overview and open opportunities (Abstract below)
  • 11.10.2016 Talk by Prof. Torsten Hoefler: Progress in automatic GPU compilation and why you want to run MPI on your GPU (Abstract below)




The seminar is intended to cover recent results in the increasingly important field of hardware acceleration for data science, both in dedicated machines or in data centers. The seminar aims at students interested in the system aspects of data processing who are willing to bridge the gap across traditional disciplines: machine learning, databases, systems, and computer architecture. The seminar should be of special interest to students interested in completing a master thesis or even a doctoral dissertation in related topics.


The seminar will start on September 20th with an overview of the general topics and the intended format of the seminar. Students are expected to present one paper in a 30 minute talk and complete a 4 page report on the main idea of the paper and how they relate to the other papers presented at the seminar and the discussions around those papers. The presentation will be given during the semester in the allocated time slot. The report is due on the last day of the semester (December 23rd).

Attendance to the seminar is mandatory to complete the credit requirements. Active participation is also expected, including having read every paper to be presented in advance and contributing to the questions and discussions of each paper during the seminar.

Course Material


Date: 27th Sept., 2016

Speaker: Prof. Onur Mutlu

Title: Processing Data Where It Makes Sense: Enabling In-Memory Computation



Today's systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: 1) data access from memory is already a key bottleneck as applications become more data-intensive and memory bandwidth and energy do not scale well, 2) energy consumption is a key constraint in especially mobile and server systems, 3) data movement is very expensive in terms of bandwidth, energy and latency, much more so than computation. These trends are especially prevalent in the data-intensive server and energy-constrained mobile systems of today.

At the same time, conventional memory technology is facing many scaling challenges in terms of reliability, energy, and performance. New memory technologies that provide new opportunities are also emerging. As a result, memory system architects are open to organizing memory in different ways and making it more intelligent, even if it comes at higher costs. The emergence of 3D-stacked memories as well as several non-volatile memory technologies, which can closely integrate computation and storage units, is an evidence of this.

In this talk, we will discuss some recent research that aims to enable computation close to data. After motivating trends in applications as well as technology, we will discuss two promising directions for enabling in-memory computation: 1) performing bulk data operations in memory by exploiting the properties of DRAM operation with low-cost changes, 2) exploiting the control logic layer in 3D-stacked memory technology to accelerate important data-intensive applications. In both approaches, we will discuss relevant cross-layer research and design challenges in architecture, systems, and programming models. We will also briefly discuss the promise of emerging non-volatile memory technologies and their potential role in system design and in-memory computation.


Date: 4th October, 2016

Speaker: Prof. Ce Zhang

Title: Machine learning with modern hardware: Overview and open opportunities



In this talk, I will first give a self-contained introduction of machine learning that will serve as the background of machine-learning related papers provided in the reading list for this seminar. Then, I will describe two of our previous works on running machine learning algorithms with modern hardware such as NUMA machines and GPUs. The first system contains a database-like optimizer to optimize for the tradeoff space of running first-order methods on NUMA machines. The second system is a fully compatible end-to-end version of the popular Deep Learning framework Caffe with rebuilt internals. We built this system to examine the performance characteristics of training and deploying general-purpose convolutional neural networks across different hardware architectures (CPUs vs. GPUs).

Last, I will talk about a subset of related topics that my group is working on, including distributed machine learning over heterogeneous clusters, first-order methods with embedded GPUs, and limited precision networks for distributed deep learning.


Date: 11th October, 2016

Speaker: Prof. Torsten Hoefler

Title: Progress in automatic GPU compilation and why you want to run MPI on your GPU



Auto-parallelization of programs that have not been developed with parallelism in mind is one of the holy grails in computer science.

It requires understanding the source code's data flow to automatically distribute the data, parallelize the computations, and infer synchronizations where necessary. We will discuss our new LLVM-based research compiler Polly-ACC that enables automatic compilation to accelerator devices such as GPUs. Unfortunately, its applicability is limited to codes for which the iteration space and all accesses can be described as affine functions. In the second part of the talk, we will discuss dCUDA, a way to express parallel codes in MPI-RMA, a well-known communication library, to map them automatically to GPU clusters. The dCUDA approach enables simple and portable programming across heterogeneous devices due to programmer-specified locality. Furthermore, dCUDA enables hardware-supported overlap of computation and communication and is applicable to next-generation technologies such as NVLINK. We will demonstrate encouraging initial results and show limitations of current devices in order to start a discussion.


Rodríguez Gonzalo Manuel BLAS Comparison on FPGA, CPU and GPU IEEE Computer Society Symposium on VLSI, 2010 (https://www.microsoft.com/en-us/research/publication/blas-comparison-on-...) 18.10.16 Ingo Müller
Neunert Michael Liang et al.: Floating point unit generation and evaluation for FPGAs, FCCM'03 (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1227254&tag=1) 25.10.16 Torsten Hoefler
Radler Andreas DeHon: Fundamental Underpinnings of Reconfigurable Computing Architectures, Proc. IEEE (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7086421) 25.10.16 Gustavo Alonso
Kurmann Nico Shaw et al., "Anton, a special-purpose machine for molecular dynamics simulation," ISCA 2007. (http://dl.acm.org/citation.cfm?doid=1250662.1250664) 1.11.16 Torsten Hoefler
Farshidian Farbod Trimberger: Three Ages of FPGAs: A Retrospective on the First Thirty Years of FPGA Technology, Proc. of IEEE (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=7...) 8.11.16 Gustavo Alonso
Margomenos Spyridon Kobori et al.: A Cellular Automata System with FPGA, FCCM'01 (http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1420908) 8.11.16 Gustavo Alonso
Mansourighiasi Nika Ahn et al., "A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing," ISCA 2015. (https://users.ece.cmu.edu/~omutlu/pub/tesseract-pim-architecture-for-gra...) 15.11.16 Arash Tavakkol
Farrukh Waleed Fast Support Vector Machine Training and Classification on Graphics Processors ICML 2008 (https://www2.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-11.pdf) 15.11.16 Gustavo Alonso
Kipfer Kevin DaDianNao: A Machine-Learning Supercomputer MICRO 2015 (http://ieeexplore.ieee.org/document/7011421/) 22.11.16 Onur Mutlu
Taubner Tim Shafiee et al., "ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars," ISCA 2016. (https://www.cs.utah.edu/~rajeev/pubs/isca16.pdf) 22.11.16 Onur Mutlu
Romero Julien Deep Learning with Limited Numerical Precision ICML 2015 (https://arxiv.org/pdf/1502.02551.pdf) 29.11.16 Ce Zhang
Fischer Marc cuDNN: Efficient Primitives for Deep Learning ArXiv (http://arxiv.org/abs/1410.0759) 29.11.16 Ce Zhang
Pesic Igor Canis et al.: LegUp: high-level synthesis for FPGA-based processor/accelerator systems, (http://dl.acm.org/citation.cfm?id=1950423) 6.12.16 Mohsen Ewaida
Striebel Lukas Zhu and Janapa Reddi, "WebCore: Architectural Support for Mobile Web Browsing," ISCA 2014. (http://3nity.io/~vj/downloads/publications/zhu14webcore.pdf) 13.12.16 Ce Zhang


Seminar Hours

Tuesdays, 13:00-15:00 in ML J 34.1