Spark: Cluster Computing with Working Sets

TitleSpark: Cluster Computing with Working Sets
Publication TypeConference Paper
Year of Publication2010
AuthorsZaharia, M., Chowdhury M., Franklin M. J., Shenker S., & Stoica I.
Other Numbers3385

MapReduce and its variants have been highly successfulin implementing large-scale data-intensive applicationson commodity clusters. However, most of these systemsare built around an acyclic data flow model that is notsuitable for other popular applications. This paper focuseson one such class of applications: those that reusea working set of data across multiple parallel operations.This includes many iterative machine learning algorithms,as well as interactive data analysis tools. We propose anew framework called Spark that supports these applicationswhile retaining the scalability and fault tolerance ofMapReduce. To achieve these goals, Spark introduces anabstraction called resilient distributed datasets (RDDs).An RDD is a read-only collection of objects partitionedacross a set of machines that can be rebuilt if a partitionis lost. Spark can outperform Hadoop by 10x in iterativemachine learning jobs, and can be used to interactivelyquery a 39 GB dataset with sub-second response time.


We thank Ali Ghodsi for his feedback on this paper. Thisresearch was supported by California MICRO, CaliforniaDiscovery, the Natural Sciences and Engineering ResearchCouncil of Canada, as well as the following BerkeleyRAD Lab sponsors: Sun Microsystems, Google, Microsoft,Amazon, Cisco, Cloudera, eBay, Facebook, Fujitsu,HP, Intel, NetApp, SAP, VMware, and Yahoo!.

Bibliographic Notes

Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud'10), pp. 1-7, Boston, Massachusetts

Abbreviated Authors

M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker and I. Stoica

ICSI Research Group

Networking and Security

ICSI Publication Type

Article in conference proceedings