Shark: SQL and Rich Analytics at Scale

TitleShark: SQL and Rich Analytics at Scale
Publication TypeConference Paper
Year of Publication2013
AuthorsXin, R., Rosen J., Zaharia M., Franklin M. J., Shenker S., & Stoica I.
Page(s)13-24
Other Numbers3433
Abstract

Shark is a new data analysis system that marries query processingwith complex analytics on large clusters. It leverages a novel distributedmemory abstraction to provide a unified engine that can runSQL queries and sophisticated analytics functions (e.g., iterativemachine learning) at scale, and efficiently recovers from failuresmid-query. This allows Shark to run SQL queries up to 100× fasterthan Apache Hive, and machine learning programs more than 100×faster than Hadoop. Unlike previous systems, Shark shows that it ispossible to achieve these speedups while retaining a MapReducelikeexecution engine, and the fine-grained fault tolerance propertiesthat such engine provides. It extends such an engine in severalways, including column-oriented in-memory storage and dynamicmid-query replanning, to effectively execute SQL. The resultis a system that matches the speedups reported for MPP analyticdatabases over MapReduce, while offering fault tolerance propertiesand complex analytics capabilities that they lack.

Acknowledgment

We thank Cliff Engle, Harvey Feng, Shivaram Venkataraman, RamSriharsha, Tim Tully, Denny Britz, Antonio Lupher, Patrick Wendell,Paul Ruan, Jason Dai, Shane Huang, and other colleagues inthe AMPLab for their work on Shark. We also thank Andy Pavloand his colleagues for making their benchmark dataset and queriesavailable. This research is supported in part by NSF CISE Expeditionsaward CCF-1139158 and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP,Blue Goji, Cisco, Clearstory Data, Cloudera, Ericsson, Facebook,General Electric, Hortonworks, Huawei, Intel, Microsoft, NetApp,Oracle, Quanta, Samsung, Splunk, VMware and Yahoo!, and by aGoogle PhD Fellowship.

URLhttps://www.icsi.berkeley.edu/pubs/networking/ICSI_sharksql13.pdf
Bibliographic Notes

Proceedings of 2013 ACM SIGMOD/PODS Conference, New York, New York, pp. 13-24

Abbreviated Authors

R. Xin, J. Rosen, M. Zaharia, M. Franklin, S. Shenker, and I. Stoica

ICSI Research Group

Networking and Security

ICSI Publication Type

Article in conference proceedings