Machine Learning and Apache Spark Workshop
by Vincent Van Steenbergen download a PDF brochure
Description
Big Data is now around for many years as the solution for nowadays challenges brought by the massive datasets available. The initial technologies were disruptive compared to legacy stacks, however they are now suffering the age, specially their usability is slowing down their introduction in the global market. Furthermore, the Data Science has been understood to be the underpinning needs required to leverage a good data management and their processing. Nevertheless, this brings more problems onto the tables, by shifting the needs from ETL to recurrent or stream processing. Apache Spark is rising out of the water with a new disruptive model allowing all kind of business to easily work with distributed technologies and process their Big or Fast Data.
Hence, this course deeply covers the underlying concepts behind the Apache Spark project. Although the model is simpler than other technologies, it is still fundamental to grasp the ideas and the features in Apache Spark that will allow any business to unleash the power of their infrastructure and, or data.
The focus in this course is to explain based on concrete and reproducible examples run interactively from the Spark Notebook. Not only Spark Core will be extensively decrypted but also the streaming and machine layers that are part of the global project. It’s a matter of fact that Spark is an important piece of modern architecture, but it cannot be the only one covering the whole pipeline, that’s why this seminar will also tackle the Spark ecosystem including its integration with the Apache project Kafka, Cassandra and Mesos.
What you will learn
- Learn about distributed computing tools and paradigms
- Learn Apache Spark core concepts
- Lean Apache Spark SQL components (incl. DataFrame, Dataset and Tungsten)
- Learn about Distributed Machine Learning in Spark using MLlib and H2O
- Learn to use the Spark Notebook for fast, interactive and reactive Spark development
- Learn how to build a complete Distributed Data Science Pipeline
Main Topics
- Introduction to the distributed storage
- Concepts of distributed computing, Map Reduce
- In depth explanations of Spark Core
- Developing Spark application using the DataFrame and Dataset APIs
- Handling Fast Data using Spark Streaming
- Adding Apache Kafka as the source of Spark Streaming
- Saving the processed views in Apache Cassandra
- Machine learning principles using MLlib
- Extending the machine learning capabilities using H2O