Apache Spark and Machine Learning

by Andy Petrella download a PDF brochure

Description

Big Data is now around for many years as the solution for nowadays challenges brought by the massive datasets available. The initial technologies were disruptive compared to legacy stacks, however they are now suffering the age, specially their usability is slowing down their introduction in the global market. Furthermore, the Data Science has been understood to be the underpinning needs required to leverage a good data management and their processing. Nevertheless, this brings more problems onto the tables, by shifting the needs from ETL to recurrent or stream processing. Apache Spark is rising out of the water with a new disruptive model allowing all kind of business to easily work with distributed technologies and process their Big or Fast Data. Hence, this seminar deeply covers the underlying concepts behind the Apache Spark project. Although the model is simpler than other technologies, it is still fundamental to grasp the ideas and the features in Apache Spark that will allow any business to unleash the power of their infrastructure and, or data. The focus in this seminar is to explain based on concrete and reproducible examples run interactively from the Spark Notebook. Not only Spark Core will be extensively decrypted but also the streaming and machine layers that are part of the global project. It’s a matter of fact that Spark is an important piece of modern architecture, but it cannot be the only one covering the whole pipeline, that’s why this seminar will also tackle the Spark ecosystem including its integration with the Apache project Kafka, Cassandra and Mesos.

What you will learn

Learn about distributed computing tools and paradigms
Learn Apache Spark core concepts
Lean Apache Spark SQL components (incl. DataFrame, Dataset and Tungsten)
Learn about Distributed Machine Learning in Spark using MLlib and H2O
Learn to use the Spark Notebook for fast, interactive and reactive Spark development
Learn how to build a complete Distributed Data Science Pipeline

Main Topics

Introduction to the distributed storage
Concepts of distributed computing, Map Reduce
In depth explanations of Spark Core
Developing Spark application using the DataFrame and Dataset APIs
Handling Fast Data using Spark Streaming
Adding Apache Kafka as the source of Spark Streaming
Saving the processed views in Apache Cassandra
Machine learning principles using MLlib
Extending the machine learning capabilities using H2O