Introduction To Building A Distributed Neural Network on Apache Spark With BigDL And Analytics Zoo

Abstract: In this training session you will get hands on experience with developing neural network using Intel BigDL and Analytics Zoo on Apache Spark. You will learn how you can use Spark DataFrames and build deep learning pipelines through implementing some practical examples.
Target Audience: AI developers and aspiring data scientists who are Experienced in Python and Spark. Also big data and analytics professionals interested in neural networks

• Experience in Python programming
• Entry level knowledge of Apache Spark
• Basic knowledge of deep learning and techniques in deep learning

Training outline:

Introduction to Deep Learning on Spark, BigDL and Analytics Zoo - 25 minutes
We will begin with a brief introduction to Apache Spark and the Machine Learning/Deep Learning ecosystem around Spark. Then we will introduce Intel BigDL and Analytics Zoo, two deep learning libraries for Apache Spark. We will go into the architectural details of how distributed training happens in BigDL. We will cover the model training process, including how the model, weights and gradients are distributed, calculated, updated and shared with Apache Spark.

Setting Up Sample Environment - 10 minutes
The instructors will highlight the major components of our demonstration environment, including the dataset, docker container and example code along with the public location of these resources and how to set them up.

Exercise 1 - Quick and simple image recognition use case with BigDL - 45 minutes
We will work through a simple image recognition use case that trains a CNN. The goal of this exercise is a simple introduction in to using BigDL with image datasets. Participants will get exposure to:
• How to use read images into Spark data frames
• Building transformation pipelines for images with Spark
• How to train a deep learning model using estimators

Exercise 2 - Transfer Learning for Image Classification Models - 45 minutes
Participants will get exposure to:
• How to build a pipeline in Spark to preprocess images
• How to import a model a trained model from other frameworks like TensorFlow
• How to implement transfer learning on the imported model with the preprocessed images

Quick break: Answer questions or help out anyone who is having trouble - 10 minutes

Exercise 3 - Anomaly Detection or Recommendation system with Intel Analytics Zoo - 30 minutes
In this exercise we will show participants
• How to build an initial pipeline for feature transformation
• How to Build a recommendation model in BigDL/Analytics Zoo
• How to perform training and inference for this use case

Exercise 4 - Model Serving - 15 minutes
In this exercise we will show participants to how to build an end to end pipeline and put their model to production. They will get exposure to:
• Model serving using POJO API
• Integration into web services and streaming services like Kafka for model inference
• Distributed model inference

Practical Knowledge - Discussion of practical experience using Spark and Hadoop for machine learning and deep learning projects - 15 minutes
We will have a discussion on the following topics:
• Spark parameters and how to set them: How to allocate the right amount of executors, cores and memory
• Performance Monitoring
• Tensorboard with BigDL
• Collaboration and reproducing experiments with a data science workbench tool.

Wrapping up / Questions - 15 minutes

Bio: Yuhao Yang is a senior software engineer in Intel Big Data team, focusing on deep learning algorithms and applications. His area of focus is distributed deep learning/machine learning and has accumulated rich solution experiences, including fraud detection, recommendation, speech recognition, visual perception etc. He's also an active contributor of Apache Spark MLlib (GitHub: hhbyyh).