Data Proc. on Mod. HW

Data Processing on Modern Hardware

Machine Learning on FPGAs

Machine learning algorithms benefit from inherent parallelism, deep pipelining and custom precision capabilities that an FPGA provides.

 

Linear Model Training with ZipML

In our ZipML framework (http://zip.ml), we are exploring these ideas to accelerate training and inference.

Our FCCM'17 paper (FPGA-accelerated Dense Linear Machine Learning:
A Precision-Convergence Trade-off) studies how to accelerate training of dense linear models, using low-precision data and performing the training on an FPGA. We present both single-precision floating-point and low-precision integer (8, 4, 2 and 1 bit) capable FPGA-based trainers, and study the trade-offs that effect the end-to-end performance of dense linear model training. You can grab both software and FPGA designs presented in this work in our repository:

github.com/fpgasystems/ZipML-XeonFPGA

github.com/fpgasystems/ZipML-PYNQ

 

Inference of Decision Tree Ensemble on CPU-FPGA Platforms

In this work we developed an inference system for decision tree ensembles on Intel's Xeon+FPGA platform. In our FPL'17 paper we explore the design of flexible and scalable FPGA architecture. In addition we combine CPU and FPGA processing to scale to large tree ensembles with millions of nodes. The developed system targets XGBoost, one of the most successful boosted trees algorithms in machine learning. In our future steps, we want to explore using low precision for representing either data or tree node's threshold values, which enables the processing of even larger ensembles on the FPGA at higher performance. 

 

People

 

Publications

Kaan Kara, Dan Alistarh, Gustavo Alonso, Onu Mutlu, Ce Zhang
IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), May 2017
Muhsen Owaida, Hantian Zhang, Ce Zhang, Gustavo Alonso.
IEEE 27th International Conference on Field-Programmable Logic and Applications (FPL), September 2017, Ghent, Belgium.

 

MULTICORES

During the last few years, there have been radical changes in the hardware and the processor architecture landscape. Intel’s co-founder Gordon E. Moore’s famous conjecture that the number of transistors on integrated circuits doubles roughly every two years still holds today. However, multi-core processors caused a paradigm shift in hardware where increasing number of transistors are invested in parallelism and advanced features on chip designs. In addition to increasing core counts, sophisticated features like advanced instruction sets, SIMD instructions, simultaneous multi-threading (SMT), larger and multiple layers of caches are being introduced in the newer processor designs. All of these game-changing technological trends constitute the driving force behind how new data processing systems and algorithms should be designed and tailored today, albeit making it even more challenging. Moreover, as there are no established designs for the contemporary multicores, there is also no clear consensus on how to design data processing algorithms to take advantage of available features as well as suggest new features to the hardware architects. In this research area, we design novel hardware-conscious algorithms and systems that are targeted to the recent features in hardware to extract the performance premises of modern multi-core machines and which provide insights for upcoming generations of multicores.

In "Parallel Joins on Multi-Core" project we have investigated an extensive set of join approaches, both in terms of algorithms and implementations on a broad range of recent multi-core platforms. Our investigations have provided conclusive answers to the existing controversies in the literature. One of the existing debates on joins is whether hash join should be hardware-conscious or hardware-oblivious given the advances on the hardware side such as simultaneous multi-threading and automatic hardware prefetching. Our results indicate that hash join algorithms that are hardware-conscious perform generally better than the hardware-oblivious counterparts. On an algorithmic aspect, there has been controversies in the literature whether sort-merge joins or hash joins are now a better option due to features in new hardware such as wide SIMD instructions. In our study using most optimized versions of both sides, we have shown that hash joins still maintain an edge over sort-merge joins despite the advances on SIMD. Further details on Parallel Joins can be found here.

We have also looked at several data stream algorithms and proposed new solutions for approximate frequent item counting and stream joins for multi-cores that circumvent the limitations of existing solutions.

We have many available open projects which can be suitable for master thesis or lab projects. Please contact us if you are interested.

Current Project Members:

Past Project Members:

  • Cagri Balkesen (PhD graduate, now with Oracle Labs)
  • Daniela Dorneanu (PhD Student)
  • Jens Teubner (Senior researcher, now Professor at TU Dortmund)
  • Pratanu Roy (PhD graduate, now with Oracle Labs)

Publications

Conference Papers

  • Pratanu Roy, Jens Teubner, and Rainer Gemulla. Low-Latency Handshake Join. Proc. of the VLDB Endowment (PVLDB), Volume 7 (9), Hangzhou, China, 2014.

Journal Papers

Demos, Tutorials, Workshops and Technical Reports

  • Cagri Balkesen, Louis Woods, and Jens Teubner. New Hardware Architectures for Data Management. 15. GI-Fachtagung für Business, Technologie und Web (BTW 2013), Tutorial, Magdeburg, Germany, March 2013.

 


 

 Back to Research

FPGA

FPGAs in the Datacenter

The limitations of today's computing architectures are well known: high power consumption, heat dissipation, network and I/O bottlenecks, and the memory wall. Field-programmable gate arrays (FPGAs), user-configurable hardware chips, are promising candidates to overcome these limitations. With tailor-made and software-configured hardware circuits it is possible to process data at very high throughput rates and with extremely low latency. Yet, FPGAs consume orders of magnitude less power than conventional systems. Thanks to their high configurability, they can be used as co-processors in heterogeneous multi-core architectures, and/or directly be placed in critical datapaths to reduce the load that hits the system CPU.

Current projects:

Past projects:

Current project members:

Former members:

  • René Müller (PhD graduate, now with FH Bern, Switzerland)
  • Jens Teubner (Postdoc, now Professor at TU Dortmund, Germany)
  • Louis Woods (PhD graduate, now with Apcera, USA)

Publications

(For complete list please see each individual project)

Conference papers

Journal publications

Demos, Tutorials and Workshops

Acknowledgements

Our FPGA work is funded in part by grants from Xilinx, as part of the Enterprise Computing Center (www.ecc.ethz.ch), and Microsoft Research, as part of the Joint Research Center MSR-ETHZ-EPFL. We use FPGA equipment generously donated by Xilinx, by Intel through the Intel Altera Heterogeneous Research Platform, and acquired under the Maxeler University Program.

Parallel & Distributed Joins

Project Description

The architectural changes introduced with multi-core processors have triggered a redesign of main-memory join algorithms for relational database systems. Join processing in database systems is a complex and a demanding operation. Traditionally, there have been sorting and hashing based approaches for implementing joins. However, in the last few years several diverging views have appeared regarding how to design and implement main-memory joins on multi-core processors.

In this project we have investigated an extensive set of join approaches, both in terms of algorithms and implementations on a broad range of recent multi-core platforms. Our investigations have provided conclusive answers to the existing controversies in the literature.

Hash Joins

On the one hand, there are two camps on how to implement main-memory hash joins on multi-core. Hardware-oblivious hash join variants do not depend on hardware-specific parameters. Rather, they consider qualitative characteristics of modern hardware and are expected to achieve good performance on any technologically similar platform. The assumption that these algorithms make is that hardware has now become good enough at hiding its own limitations—through automatic hardware prefetching, out-of-order execution or simultaneous multi- threading (SMT)—to make hardware-oblivious algorithms competitive without the overhead of carefully tuning to the underlying hardware. Hardware-conscious implementations, such as (parallel) radix join, aim to maximally exploit a given architecture by tuning the algorithm parameters (e.g., hash table sizes) to the particular features of that architecture. The assumption here is that explicit parameter tuning yields enough performance advantages to warrant the effort required.

In this project, we compare the two approaches for hash joins under a wide range of workloads (relative table sizes, tuple sizes, effects of sorted data, etc.) and configuration parameters (VM page sizes, number of threads, number of cores, SMT, SIMD, prefetching, etc.). The results show that hardware-conscious algorithms generally outperform hardware-oblivious ones. However, on specific workloads and special architectures with aggressive simultaneous multi-threading, hardware-oblivious algorithms become competitive. As an answer to the controversy on how to implement hash joins on existing multi-core architectures, the project shows that it is still important to carefully tailor algorithms to the underlying hardware to get the necessary performance.

Sort-Merge Joins & Sort vs. Hash Revisited

With the advent of modern multi-core architectures, it has been argued that sort-merge join is now a better choice than radix-hash join. This claim is justified based on the width of SIMD instructions (sort-merge outperforms radix-hash join once SIMD is sufficiently wide), and NUMA awareness (sort-merge is superior to hash join in NUMA architectures). Through extensive experiments on the original and optimized versions of these algorithms, we show that, contrary to these claims, radix-hash join is still clearly superior, and sort-merge approaches to performance of radix only when very large amounts of data are involved. Thus, in this project we resolve another controversy in the literature regarding the relative performance of sort and hash based join algorithms. The project provides the fastest implementations of these algorithms, and covers many aspects of modern hardware architectures relevant not only for joins but for any parallel data processing operator.

Project Members

Past Project Members

Publications

Source Code

Contacts

Claude Barthels

Cagri Balkesen (now with Oracle Labs; e-mail)

 

Data Cyclotron - Past Project

Project Description

The transport mechanisms offered by modern network cards that support remote direct memory access (RDMA) significantly shift the priorities in distributed systems. Complex and sophisticated machinery designed only to avoid network traffic can now be replaced by schemes that can use the available bandwidth to their advantage. One such scheme is Data Cyclotron, a research effort that we pursue jointly with the database group at CWI Amsterdam. Based on a simple ring-shaped topology, Data Cyclotron offers ad-hoc querying over data of arbitrary shape and arbitrary size.

One example of how database queries can be processed efficiently in a Data Cyclotron ring is cyclo-join. Using the ring topology, cyclo-join leverages distributed main-memory and CPU resources to process arbitrary database joins. Other than existing schemes, cyclo-join does not depend on pre-established data distributions or specific join predicates (see our DaMoN 2009 paper for details on cyclo-join).

Project Members

Project members (in alphabetical order):

  • Philip Frey
  • Romulo Gonçalves (CWI)
  • Martin Kersten (CWI)
  • Jens Teubner

Publications

 

 

Avalanche - Past Project

Avalanche: Data Processing on Bare Metal

Project Description

The limitations of today's computing architectures are well known: high power consumption, heat dissipation, network and I/O bottlenecks, and the memory wall. Field-programmable gate arrays (FPGAs), user-configurable hardware chips, are promising candidates to overcome these limitations. With tailor-made and software-configured hardware circuits it is possible to process data at very high throughput rates and with extremely low latency. Yet, FPGAs consume orders of magnitude less power than conventional systems. Thanks to their high configurability, they can be used as co-processors in heterogeneous multi-core architectures, and/or directly be placed in critical data paths to reduce the load that hits the system CPU.

In Avalanche we work on a realization of these promises. To this end, we are building a processing stack for FPGA-based database processing. Our Glacier compiler translates a useful subset of a stream query language into hardware circuits, e.g., to make the latency and throughput advantages of FPGAs accessible to financial trading applications. Our recent hardware solution to the frequent item problem significantly outperforms known software solutions—at a fraction of the resource consumption.

 
Project members
  • Gustavo Alonso (professor)
  • Cagri Balkesen (PhD student)
  • Pratanu Roy (PhD student)
  • Jens Teubner (postdoc)
  • Louis Woods (PhD student)

Former members:

Publications

Acknowledgements

Avalanche was supported by the Swiss National Science Foundation.