Data Pipeline Frameworks: The Dream and the Reality | Beeswax

Data Pipeline Frameworks: The Dream and the Reality | Beeswax

HomeData CouncilData Pipeline Frameworks: The Dream and the Reality | Beeswax
Data Pipeline Frameworks: The Dream and the Reality | Beeswax
ChannelPublish DateThumbnail & View CountDownload Video
Channel AvatarPublish Date not found Thumbnail
0 Views
Download the slides: https://www.datacouncil.ai/talks/data-pipeline-frameworks-the-dream-and-the-reality

ABOUT THE CONVERSATION:

There are several commercial, managed service, and open source choices of data pipeline frameworks on the market. In this talk, we’ll discuss two, the AWS Data Pipeline managed service and the open source software Airflow. These frameworks have very different feature sets and operational models, but they’ve both helped and failed us in similar ways for our needs.

To understand the reasons, we analyze our experience of first building a data processing platform on Data Pipeline and then developing the next generation platform on Airflow. We find that managed service and open source framework are leaky abstractions and therefore both frameworks required us to understand and build primitives to support deployment and operations.

Similarly, we discuss the need to implement cross-cutting aspects such as logging, monitoring, security, and configuration, which arises from the shortcomings of existing, pre-implemented components. Generalizing from specific pain points and solutions, we argue that almost every organization building a data platform using a pipeline framework or service will run into many of the same issues, as idiosyncratic framework/service implementations will conflict with an organization’s existing code, preferences, and procedures.

So where do you draw the line? What value can you expect from a data pipeline framework or service? What do you need to package, integrate, or fully implement yourself? To build a robust data pipeline platform for your organization, you need to bridge the gap between the dream of the framework and the reality of production. This talk will help you do just that.

ABOUT THE SPEAKER:

Mark Weiss is a Senior Software Engineer at Beeswax, the online advertising industry's first extensible programmatic buying platform, where he focuses on designing and building data processing infrastructure and applications to support reporting and machine learning. He has previously held various individual contributor and leadership roles and has spent much of his career working on ETL systems and data-driven distributed platforms. Mark has previously spoken at DataEngConf NYC and is a frequent speaker and moderator at the NYC Python Meetup. He also blogs and hosts the podcast /"Using Reflection/" at http://www.usingreflection.com and can be found on Github, Twitter, and LinkedIn at @marksweiss. He lives in Brooklyn, NY

ABOUT DATA COUNCIL:
Data Council (https://www.datacouncil.ai/) is a community and conference series that provides data professionals with the learning and networking opportunities they need to grow their careers. Be sure to subscribe to our channel for more videos, including DC_THURS, our series of live online interviews with leading data professionals from top open source projects and startups.

FOLLOW DATA ADVICE:
Twitter: https://twitter.com/DataCouncilAI
LinkedIn: https://www.linkedin.com/company/datacouncil-ai

Please feel free to share this video with your friends and family if you found it useful.