Discover Top Posts Tagged with #spark-summit

IoT and the Autonomous Vehicle in the Clouds: Spark Summit East talk by Jay White Bear

Spark Summit 2016 – Key Highlights

I recently had the amazing chance to meet many interesting folks at Spark Summit 2016 and also learnt quite a bit about the technology updates and where the industry is heading. In this blog, I would like to summarize my key take-aways from the event. Spark summit 2016 keynote was heavily focussed on Deep Learning (DL). Jeff Dean of Google TensorFlow project showcased how they are using DL in most of their products- be it instant replies in Inbox app, Google photos app suggesting text related to photos, Google real-time language translation from images or suggesting solar panel for your home by analyzing your house rooftop. They have even provided APIs for the community to use the DL models without having to spend the time in re-inventing the wheel to solve critical business problems.

Here are some great links if you would like to delve deeper on some of the DL products by Google:

Project Sunroof

Vision API – Image Content Analysis

Cloud Machine Learning – Predictive Analytics

We are constantly seeing the increase of DL in day-to-day products and they are getting better and better. Jeff even claimed that currently 60% of the replies in Inbox mail app happens through smart replies which relies extensively on Deep Learning. Isn’t it amazing?Not just Google, even Andrew Ng, Chief Data Scientist of Baidu and CoFounder of Coursera had shared lot of awesome data products he is building which extensively use Deep Learning (DL). Needless to say AI is going to revolutionize many industries ranging from Healthcare, Industrial, Manufacturing & Transportation.

New features in Spark 2.0 & MLlib 2.0

Structured streaming which combines streaming and interactive analysis

Tungsten phase 2 speedups 5-20x

Unification of DataSets and DataFrames

DataFrame API will become primary but RDD based API will still exist in maintenance mode

Expansion of Python/R API

Model persistence

MLlib for exploratory data analysis

Following new algorithms have made into 2.0: - Generalized Linear Model - Approximate counting of distinct elements - Approximate Quantile algorithms have been added

Customizing ML pipelines - 29 feature transformers (Tokenizer, Word2Vec) - 21 models (for classification, regression, clustering) - Model tuning & evaluation

Other interesting talks related to Data Science

Huohua Distributed time series analysis by TwoSigma - Timeseries RDD in Huohua - Temporal joins - Group function on time series data

Elasticsearch-hadoop project

Apache SystemML project is going strong

Baidu has built Parallel Asynchronous Distributed Deep Learning Engine (PADDLE) with CPU & GPU support to perform vision, speech, and NLP workloads at scale

Automatic features generation and model training on Spark using bayesian approach showed lot of interesting optimization opportunity in hyper parameter tuning

Red Hat team showed how they are analyzing log data to find anomalies and reducing False alarms by using techniques like Ensembles of Decision trees and Self organizing maps

The summit overall was an amazing exposure into the diverse initiatives being done in Spark and how are companies positioning their needs amidst the Industrial Internet boom. The next months will be truly interesting to watch the interesting use-cases data science will empower users with.

Important links

Apache Spark MLlib 2.0 Preview: Data Science and Production

Approximate Algorithms in Apache Spark: HyperLogLog and Quantiles

Spark 2.0 will offer Interactive Querying of Live data

Bayesian optimization for Hyperparameter tuning

#Spark #Spark-summit #scala #R #Python #DiggBigg #big-data