24/7 Data Pipeline Without Pager Duty

Abstract: Data Engineers: This talk is for you!It is commonly reported that data scientists spend only 20% of their time doing analysis, and 80% of their time finding, modeling and cleaning data. These tasks are critical to enable accurate feature engineering and avoid a garbage in garbage out situation. Even more time consuming is building a stable data pipeline (ETL) used to automate the delivery of valid data to continuously retrain and write new predictions.In this session, we’ll discuss how to build a reliable, highly available ETL system that doesn’t require babysitting. Combining modern DevOps practices and cloud technologies, we’ve built a robust late-binding ETL system that scales with our growing business without sacrificing reliability or availability. We’ll talk about ingesting unstructured, semi-structured and relational data, and the inherent challenges with automating this. As a use-case, attendees will learn how our system consumes data from a variety of sources (raw files, databases, third party APIs, etc.) and deposits data in our massively parallel processing (MPP) database, Snowflake.As an analytics professional, you know that trust and maintenanceare important challenges to handle. By the end of this session the audience will learn what it means to havehigh confidence in their data, because the pipeline is self-healing, with failing jobs retried and schema changes propagated automatically from their source. At CarGurus, we've worked to create active process monitoring and alerting that everyone can see, even our stakeholders. That helps us build trust while making sure that those inevitable occasional off-hours problems are visible and easy to triage.We combine query tagging, logging, and slack alerts to have a clear picture of exactly what component of our process is responsible for any failures, and why.System changes, bug fixes, and enhancements can bebuilt and deployed automatically, usually in under 5 minutes. That means that when data scientists want to look at additional data, theycan grab itimmediately from a production source. We’ll discuss how to build thisdeployment system using off the shelf components in AWS, Docker, Jenkins, and Terraform.

Bio: As Vice President of Analytics and Process Innovation, Greg oversees data engineering, business intelligence and enterprise applications at CarGurus. Prior to joining CarGurus, Greg was head of analytics platform engineering at AthenaHealth, where he and his team built a platform to support large scale machine learning applications, insight generations and data-driven decision making. Greg holds a BA in economics from Brandeis University.