New Trends In Big Data & AI - Highlights from 2018 Spark + AI conference
Contributed By - Hari Gottipati
More than 4000 people attended 2018 the Spark +AI summit organized by Databricks in Moscone Center, San Fransisco from June 4th to June 6th. Though it is focused on Spark, there are a lot of topics that covered entire Big Data and AI eco-system. Close to 200 sessions covered by 225 speakers on various topics across 11 different tracks. The speakers are from various big companies (Google, Facebook, Apple, Microsoft, eBay, LinkedIn, Overstock, Tesla, Uber etc.) and many startups.
Separate compute from storage for cost efficiency
While Hadoop promotes keeping compute and storage together, the rise of AI and ML workloads, which are unpredictable in nature, are driving the separation to drive cost efficiency. Predictable workloads require permanent compute, but it’s a waste for unpredictable workloads as these are more on-demand based and only requires computing power once in a while. If you separate compute from storage, you can spin off compute nodes as needed as opposed to keeping it available all the time.
Also, Spark doesn’t get much advantage when the data and compute are together as there is less disk I/O. Map Reduce can definitely take advantage of this as it writes intermediate data to the disk. The rise of Spark is also causing the shift to move towards the separation of compute and storage.
Solving Analytics performance challenges
Lots of data is captured in the data lake, but we all know that data is not ready for analytics without an acceleration engine in between. Some of the SQL optimization engines (Impala, Drill, etc.) can enhance the performance, but not ready for real-time analytics. Databricks announced Delta to address the performance issues that we have facing right now. It is going to sit on top of your data lake and leveraging indexing & caching it is going to be 10-100X faster.
Productionize ML workloads
Every developer wants to use the best ML library out there to improve the models. 2% accuracy improvement could save millions of dollars for your companies. However, the problem is, there are myriad of frameworks and I am sure there is one release every month. In Ml, you usually want to play with every available algorithm to check the accuracy improvement. Thus you need to use a lot of libraries in the production environment. To address this, we need a framework/runtime environment that integrates and optimizes the popular ML libraries out of there. This acts like an out-of-the-box environment to deploy your ML models.
Another challenge is with ML life cycle. It is manual, cumbersome, inconsistent and disconnected. Often data prep is taken care of by data engineers (IT), model development is taken care of by data scientists (IT or LOB) and deployment is taken care of by IT team. Data scientists build models and iterate through different variations before moving into production and often the final model will be re-implemented by IT team to move into production. Also, in ML, it is difficult to track which parameters, code, and data went into each iteration to produce a model. Without tracking, it is very hard to make the code work again. To address all these concerns, Databricks announced an open source project called MLflow.
MLflow
MLflow is an open source platform for the machine learning lifecycle. It works with any ML library, language and you can even use this for your existing code. Currently, it has 3 components:
MLflow Tracking - log and query experiments (code, data, config, results) using Python or REST.
MLflow Projects - packaging format for ML code in a reusable and reproducible way to run on any platform.
MLflow Models - the standard format for sending models to diverse deploy tools.
Run Spark workloads natively on Kubernates cluster
The latest version of Spark (2.3) has a lot of features and one of the notable features is Kubernates support. Two popular open source projects Apache Spark and Apache Kubernates combine their functionality to launch Spark workloads natively on a Kubernetes cluster leveraging the Kubernetes scheduler backend.
Traditionally, big data processing workloads have been run in YARN/Hadoop stack. However, unifying the control plane on Kubernates simplifies cluster management and allows to leverage extensibility features in Kubernetes, such as custom resources and custom controllers.
When you submit a Spark application to a Kubernetes cluster, Spark creates a Spark driver within a Kuberbetes pod. Spark driver creates Spark executor pods within Kubernete pods and executors execute application code. When the application complete, the executor pods terminate and are cleaned up. However, the driver pod remains in “completed” status until it is garbage-collected or manually cleaned up. This model is very suitable for unpredicted workloads where you create the resources on demand and release the resources once you complete the processing.
Stream processing enhancements in Spark
Continuous Processing - a new mode, Continuous mode is introduced in Spark 2.3 to execute streaming queries with millisecond low-latency by changing only a single line of user code.
Stream-Stream Joins - structured streaming in Spark 2.0 supported joins between a streaming DataFrame/Dataset and a static one, but 2.3 introduces stream-to-stream joins. It allows joining two streams of data, buffering messages until you arrive at a matching condition between two streams. For example, ad ad-clicks stream and an impression stream share a common id and you would like to run streaming analytics, such as, which impression led to a click.