Advanced Apache Spark Meetup @ Bigcommerce
Past Thursday, we at Bigcommerce hosted our first Advanced Apache Spark meetup from our SF office on the topic "How Spark Beat Hadoop @ 100 TB Sort". Thanks to Chris for coming up with such great topics with emphasis on deep dives into Spark internals. We had a great turnout and looking forward to host many such great meetups.
The premise of the topic "Sorting in Spark" is explained in the paper by Reynold and team - http://sortbenchmark.org/ApacheSpark2014.pdf
Theme of the talk:
- Seek once, scan sequentially
- Focusing on CPU Cache Locality and Memory Hierarchy
- Go off-heap - Further explained in Project Tungsten, topic for a future meetup
- Customized data structures
Excerpts from the meetup:
Key metric measured: Throughput of sorting 100TB of 100-byte data, 10-byte key Total time includes launching app and writing output file.
Resources: Generated 500TB of disk I/O, 200TB network I/O on a commercially available hardware.
Winning Results:
Hadoop World Record | Spark 100 TB | Spark 1 PB | |
---|---|---|---|
Data Size | 102.5 TB | 100 TB | 1000 TB |
Elapsed Time | 72 mins | 23 mins | 234 mins |
# Nodes | 2100 | 206 | 190 |
# Cores | 50400 (Physical) | 6592 (Virtualized) | 6080 (Virtualized) |
# Reducers | 10,000 | 29,000 | 250,000 |
Rate | 1.42 TB/min | 4.27 TB/min | 4.27 TB/min |
Rate/Node | 0.67 GB/min | 20.7 GB/min | 22.5 GB/min |
Sort Benchmark Daytona Rules | Yes | Yes | No |
Environment | Dedicated data center | EC2 (i2.8xlarge) | EC2 (i2.8xlarge) |
Key takeaway: Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache. PB time beats previously reported results based on Hadoop MapReduce (16 hours on 3800 machines)
Slides 23 through 46: Shuffle overview, configuration attributes, operation pitfalls, {key, pointer} shuffle-sort tip, and Spark CPU & Memory optimization principles are worth noting.
By winning Daytona GraySort contest, Spark further substantiated the reason for its widespread adoption as the successor to Hadoop MapReduce.
References:
- https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
- http://sortbenchmark.org/ApacheSpark2014.pdf
- http://sortbenchmark.org/Yahoo2013Sort.pdf
- http://0x0fff.com/spark-architecture-shuffle/
- https://sparkhub.databricks.com/event/winning-daytona-graysort-shuffle-network-cpu-cache-and-perf-optimizations/
- http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
- https://sparkhub.databricks.com/event/winning-daytona-graysort-shuffle-network-cpu-cache-and-perf-optimizations/
Stay tuned for more meetups hosted by Bigcommerce. If you are interested in joining Data Engineering team at Bigcommerce then please review Lead Data Engineer profile or our careers page for all other open positions.