Advanced Apache Spark Meetup @ Bigcommerce

Deepak Srinivasan

08 Sep 2015 • 2 min read

Past Thursday, we at Bigcommerce hosted our first Advanced Apache Spark meetup from our SF office on the topic "How Spark Beat Hadoop @ 100 TB Sort". Thanks to Chris for coming up with such great topics with emphasis on deep dives into Spark internals. We had a great turnout and looking forward to host many such great meetups.

The premise of the topic "Sorting in Spark" is explained in the paper by Reynold and team - http://sortbenchmark.org/ApacheSpark2014.pdf

Theme of the talk:

Seek once, scan sequentially
Focusing on CPU Cache Locality and Memory Hierarchy
Go off-heap - Further explained in Project Tungsten, topic for a future meetup
Customized data structures

Excerpts from the meetup:

Key metric measured: Throughput of sorting 100TB of 100-byte data, 10-byte key Total time includes launching app and writing output file.

Resources: Generated 500TB of disk I/O, 200TB network I/O on a commercially available hardware.

Winning Results:

	Hadoop World Record	Spark 100 TB	Spark 1 PB
Data Size	102.5 TB	100 TB	1000 TB
Elapsed Time	72 mins	23 mins	234 mins
# Nodes	2100	206	190
# Cores	50400 (Physical)	6592 (Virtualized)	6080 (Virtualized)
# Reducers	10,000	29,000	250,000
Rate	1.42 TB/min	4.27 TB/min	4.27 TB/min
Rate/Node	0.67 GB/min	20.7 GB/min	22.5 GB/min
Sort Benchmark Daytona Rules	Yes	Yes	No
Environment	Dedicated data center	EC2 (i2.8xlarge)	EC2 (i2.8xlarge)

Key takeaway: Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. PB time beats previously reported results based on Hadoop MapReduce (16 hours on 3800 machines)

Slides 23 through 46: Shuffle overview, configuration attributes, operation pitfalls, {key, pointer} shuffle-sort tip, and Spark CPU & Memory optimization principles are worth noting.

By winning Daytona GraySort contest, Spark further substantiated the reason for its widespread adoption as the successor to Hadoop MapReduce.

References:

Slides - http://www.slideshare.net/cfregly/advanced-apache-spark-meetup-how-spark-beat-hadoop-100-tb-daytona-graysort-challenge

Video: https://www.youtube.com/watch?v=yUPsit8I_7E

Stay tuned for more meetups hosted by Bigcommerce. If you are interested in joining Data Engineering team at Bigcommerce then please review Lead Data Engineer profile or our careers page for all other open positions.

Get updates in your inbox