Engineering a Performant Machine Learning Pipeline: From Dask to Kubeflow

Abstract: The lifecycle of any machine learning model, regular or deep, consists of (a) the pre-processing/transformation/augmenting of data (b) the training of the model with different hyper-parameter values/learning rates (c) the computing of results on new data/test sets. Whether you are using transfer learning, or a from-scratch model, this process requires a large amount of computation, management of your experimental process, and the quick perusal of results from your experiment. In this workshop, we will learn how to combine off-the shelf clustering software such as kubernetes and dask, with learning systems such as tensorflow/pytorch/scikit-learn, on cloud infrastructure such as AWS/Google Cloud/Azure to construct a machine-learning system for your data science team. We'll start with an understanding of kubernetes, move onto analysis pipelines in sklearn and dask, finally arriving at kubeflow. Participants should install minikube on their laptops (https://kubernetes.io/docs/tasks/tools/install-minikube/), and create accounts on the Google Cloud.

Bio: Richard Kim is the founder and CEO of Markov Lab, an AI startup that explores the application of probabilistic modeling and deep learning in the analysis and prediction of financial data. Richard is a Chartered Financial Analyst (CFA) with years of fundamental equity research experience and academic research in artificial intelligence from MIT. Richard has earned his Master’s in Sciences from Massachusetts Institute of Technology where he authored several papers in computational cognitive models of ethical decision makings for autonomous vehicles, one of which was published in October 2018 issue of Nature, “The Moral Machine experiment.