Spark for Big Data Analytics [Part 1]

In this Oceans of Data series, I will begin diving into Apache Spark. Spark is the most active open source project in big data according to Databricks latest survey with 1000+ contributors from 250+ organizations.

Spark is growing faster than Hadoop. It has already moved beyond the early-adopter phase and now is rapidly rising. Analytics vendors have been and continue to add Spark connectors or capabilities. We are also seeing migrations from Hadoop MapReduce usage to Spark today.

Looking at job trends, Spark has strong upward growth in demand. Thus if you haven’t done so already, it is time to learn a little Spark.

Spark job trends
Source: Indeed Job Trends

I began learning about Spark after purchasing an O’Reilly’s Advanced Analytics with Spark book and playing hands-on with Databricks Community Edition tutorials.

databricks

Databricks training includes introductions to the Spark runtime, Spark SQL and pySpark, a Python programming interface to Spark. They also have a wonderful library of other resources, MOOCs, books and community links.

sparkhub

Overview of Apache Spark

Apache Spark is all about scalable, efficient analysis of Big Data. Apache Spark is an open source big data processing framework.

Spark architecture
Source: Databricks

It was developed in 2009 in UC Berkeley’s AMPLab, and open sourced the next year as an Apache project. Spark is different from all the other open source projects in that it has numerous projects including but not limited to core Apache Spark runtime, Spark SQL, Spark Streaming, MLlib, ML, and GraphX and so on.

  • Spark SQL is Apache Spark’s module for working with structured data. It provides JDBC and ODBC connectivity for querying Spark datasets with SQL like queries using traditional analytics, data discovery, BI and visualization tools. Spark SQL also lets users being in data from different formats such as JSON, Parquet or a relational database, transform it, and expose it via Spark for ad-hoc querying.
  • Spark Streaming can be used for processing the real-time streaming data. This is based on micro batch style of computing and processing.
  • Spark MLlib is a scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
  • Spark GraphX is an API for graphs and graph-parallel computation. It includes a collection of graph algorithms and builders to simplify graph analytics tasks.

Outside of the foundational Spark libraries, there is an expanding ecosystem of other Spark libraries including but not limited to Zeppelin,  SparkR, SnappyData, CaffeOnSpark, Spark Cassandra, BlinkDB and Tachyon.

Spark’s growth and adoption is fueled by a programming language that is easy to learn, an integrated compute engine and phenomenal performance speeds to solve complex data problems at scale.

Spark apps on Hadoop clusters can run up to 100 times faster in memory
and 10 times faster on disk.

It also helps that Spark data frames, conceptually like a table, is not Hadoop Distributed File System (HDFS) dependent. Spark integrates and coexists with a wide range of commonly used commercial and open source third-party data storage systems.

sparksql
Source: Databricks

Spark provides a comprehensive framework to manage big data processing with a variety of data set types including text and graph data. It can also handle batch pipelines and real-time streaming data. Using Spark libraries, you can create big data analytics apps in Java, Scala, Clojure, and popular R and Python languages.

Spark brings analytics pros an improved MapReduce type query capability with more performant data processing in memory or on disk. It can be used with datasets that are larger than the aggregate memory in a cluster. Spark also has savvy lazy evaluation of big data queries which helps with workflow optimization and reuse of intermediate results in memory. The Spark API is easy to learn.

Spark provides powerful pipeline data processing capabilities. It can to combine different techniques and processes together into a unified manner. Without using Spark, a data pro might need to stitch together a series of separate big data processing frameworks such as Apache Oozie.

Upcoming Spark Series Articles

In the next series article, I will share how to spin up a Spark cluster and get started analyzing data hands-on with Spark SQL. After that, we will look at SparkR, PySpark, streaming and data science libraries.

In the meantime, here are a few other resources to jump start your analytics skills with Spark: