Spark 1.4 adds support for R, Python 3, cluster management

Spark's latest incarnation offers improved container management and clustering features


Apache Spark, the big data processing framework that is a fixture of many Hadoop installs, has reached its 1.4 incarnation. With it comes support for R and Python 3 -- two languages in wide use by data crunchers -- as well as better leveraging of containers and cluster management tools used to manage distributed work.

The R programming language, mainly used for statistical analysis and data science, is a perfect fit for driving a data-processing framework like Spark. SparkR, the Spark 1.4 package that adds R support, allows R programmers to write code that scales out across multiple cores or Spark nodes, and to read and write all the data formats supported in Spark. (Also supported in R is Spark SQL for allowing SQL queries of Spark data.)

Support for Python 3 is another key addition in 1.4. Python remains one of the go-to languages for scientific data work, both because of its ease of use and its rich array of packages for math, statistics, and machine learning. Support for Python was first added to Spark in 2012, but was limited to the Python 2.x branch. As Python 3 came into wider use (after becoming the default Python interpreter in Fedora), pressure mounted to add Python 3 support as well.

Since Spark is meant to run across multiple nodes, some of the other improvements in 1.4 revolve around better support for current clustering technologies. Spark on Mesos, for instance, can now be launched by way of a Docker image and makes use of Mesos' cluster mode.

According to Databricks, one of the corporate supporters of the Spark project, 1.4 also sets the stage for Project Tungsten, a major future initiative that is nothing short of an inside-out reworking of Spark to better leverage the capabilities of the hardware it runs on. Among other things, this will involve circumventing the JVM's object model and garbage collection and making more direct use of L1 through L3 caches on the CPU -- concepts that could conceivably be applied to other Java projects dealing in big data that need a performance boost.

None of this work will likely surface in a useful form until Spark 1.5, but the goal is to make Spark run faster without requiring Spark applications to be rewritten. Beyond 1.5, Databricks is mulling further hardware-oriented optimizations -- for instance, to "leverage SSE/SIMD instructions out of modern CPUs and the wide parallelism in GPUs to speed up operations in machine learning and graph computation."

This story, "Spark 1.4 adds support for R, Python 3, cluster management" was originally published by InfoWorld.