Advanced Apache Spark Meetup @ Bigcommerce

Past Thursday, we at Bigcommerce hosted our first Advanced Apache Spark meetup from our SF office on the topic "How Spark Beat Hadoop @ 100 TB Sort". Thanks to Chris for coming up with such great topics with emphasis on deep dives into Spark internals. We had a great turnout and looking forward to host many such great meetups.

The premise of the topic "Sorting in Spark" is explained in the paper by Reynold and team - http://sortbenchmark.org/ApacheSpark2014.pdf

Theme of the talk:

  • Seek once, scan sequentially
  • Focusing on CPU Cache Locality and Memory Hierarchy
  • Go off-heap - Further explained in Project Tungsten, topic for a future meetup
  • Customized data structures

Excerpts from the meetup:

Key metric measured: Throughput of sorting 100TB of 100-byte data, 10-byte key Total time includes launching app and writing output file.

Resources: Generated 500TB of disk I/O, 200TB network I/O on a commercially available hardware.

Winning Results:

Hadoop World Record Spark 100 TB Spark 1 PB
Data Size 102.5 TB 100 TB 1000 TB
Elapsed Time 72 mins 23 mins 234 mins
# Nodes 2100 206 190
# Cores 50400 (Physical) 6592 (Virtualized) 6080 (Virtualized)
# Reducers 10,000 29,000 250,000
Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Rate/Node 0.67 GB/min 20.7 GB/min 22.5 GB/min
Sort Benchmark Daytona Rules Yes Yes No
Environment Dedicated data center EC2 (i2.8xlarge) EC2 (i2.8xlarge)

Key takeaway: Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. PB time beats previously reported results based on Hadoop MapReduce (16 hours on 3800 machines)

Slides 23 through 46: Shuffle overview, configuration attributes, operation pitfalls, {key, pointer} shuffle-sort tip, and Spark CPU & Memory optimization principles are worth noting.

By winning Daytona GraySort contest, Spark further substantiated the reason for its widespread adoption as the successor to Hadoop MapReduce.

References:

  1. https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
  2. http://sortbenchmark.org/ApacheSpark2014.pdf
  3. http://sortbenchmark.org/Yahoo2013Sort.pdf
  4. http://0x0fff.com/spark-architecture-shuffle/
  5. https://sparkhub.databricks.com/event/winning-daytona-graysort-shuffle-network-cpu-cache-and-perf-optimizations/
  6. http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
  7. https://sparkhub.databricks.com/event/winning-daytona-graysort-shuffle-network-cpu-cache-and-perf-optimizations/

Slides - http://www.slideshare.net/cfregly/advanced-apache-spark-meetup-how-spark-beat-hadoop-100-tb-daytona-graysort-challenge

Video: https://www.youtube.com/watch?v=yUPsit8I_7E

Stay tuned for more meetups hosted by Bigcommerce. If you are interested in joining Data Engineering team at Bigcommerce then please review Lead Data Engineer profile or our careers page for all other open positions.