By Erik Gfesser in case study — Oct 13, 2020

Building a Data Platform on AWS from Scratch: Part 2

This article is the second in a multi-part case study series discussing the data platform my team and I built on AWS for the CIO of a Chicago-based global consultancy. As explained in the first part, one goal for this platform was to centralize data assets and data analysis across the firm using a common architecture, building on top of it to meet the use case needs of each organization. Additionally, business units and consultancy practices had become siloed, each making use of disparate processes and tooling, so in addition to centralizing data assets and data analysis, another goal was to implement the concept of data ownership, enabling the sharing of data across organizations in a secure, consistent manner.

The conceptual architecture comprising data pipelines, data lake, data models, visualizations, and machine learning models is laid out in this previous piece. Alongside this discussion, key guidelines from the CIO are considered, including organization by something called an "insight zone", as well as industry terms such as bronze, silver, and gold data as these relate to the data pipeline stages built for the platform. Finally, a time-sensitive look at the tech stack selection process is provided, alongside the technologies we ended up implementing for the platform minimum viable product (MVP). So in order to get the most out of this article, it might make sense to read through these discussions first.

Rather than walking through details of the implementation architecture, the goal of this second article is to walk through some architectural discussion points relating to what was implemented for the remaining build effort. These points are presented here in somewhat of a piecemeal fashion, albeit with ties back to the first article, by elaborating on some of the initial architectural decisions that were made, how these decisions held up during the course of the project, and aspects of the build-out one might want to consider as they tackle their own data platform challenges. As before, in order to cater this article to a more diverse audience an attempt has been made here to explain from the context of both data and software development work, as many traditional practitioners in these areas do not have significant experience in both.

This post is for subscribers only

Already have an account? Sign in.

This post is for subscribers only

Subscribe to Erik on Software