My takes on Evolving Data Pipelines/ETLs in 2024
I work on a product that provides analytics over the customer's data, and I have done a lot of massaging customer's data to fit it into the database of the product. Combined with what I experienced/observed while at my previous job which was also a data intensive startup, I found myself opinionated on what works/what scales/what is sustainable in the realms of Data Pipelines and ETL processes. I'm dropping a brain dump as a 1) writing practice 2) something to reference back in a few years for fun.
How it starts
At the earliest stage of a data analytics product, there probably is only a few files that is shared by the customer over email and a script running on a laptop to process the file. The script loops over records in the file, applying some logic and inserting into the primary db. The product provides value over customer's data, and everyone is happy. This gets repeated to accomodate the next few customers, writing a different script that caters to the customr's data every time. This is 100% the correct way to do things at an early stage because automating something that won't pan out has a lot of waste. Subscribing to the phrases like "premature optimizations is evil" and "do the things that don't scale" https://gyazo.com/4bfa1894f4e1d1a099c678ef1504dbc0
Automations
Then comes a time where manually running the script becomes unsustanable.
Reasons could be:
# of customers grow
Frequency of authoring file grows
more data processing needed for product feature
e.g. On day 1 product only was about data A, but now product supports analytics on data B.
More logic to run for each customer
This necessitates automation where the submission of file triggers some process that runs the processing logic that once used to be running on the laptop. It can look like a file arriving in a s3 bucket triggering a lambda, which extracts the data, applies logic and inserts data into the db. At this stage distinct logic would be running for each customer; it is just replacing scripts on laptops with stuff on the server/cloud.
https://gyazo.com/0bfe56131436f9aad1849182ad106108
Format/logic variety
Then comes a time time where writing said scripts for each customer becomes hard. The difficulty arises from different customers submiting data in different formats and schema, and the variety outnumbers the dev's capacity. The dev won't have sufficient context to interpret the customer's data to write the script correctly, and any clarification might turn into a few day turnaround to get answers (i.e. maybe an expert knows, but also maybe the expert has to ask the customer to get an answer). Frustration builds up across the board because things are not moving as fast as possible.
https://gyazo.com/5fb87cda64d0f5048155c2f999b9fbec
Here! This is where there are choices, and where I have an opinion.
Keep producing scripts
Maybe dev becomes an expert, and writes scripts faster
Or maybe the expert becomes the dev and starts writing scripts
Or build something that empower/enable the expert/customer to get data into the db without the dev
My opinion is at this point is, something should be built. OK. What to build?
Instead of a script, build decoupled components that would do the steps of ETL separately, that can be operated/configured by the expert/customer. And build it in the order of L->T->E because that is the order of proximity to value, and later as I discuss in later sections later component can immediately take advantage of the component built prior.
The system
Buiding the loading system
Builind a user operable (not a dev) loading system first is beneficial beause it is immediately valuable.
The expert can author a file(with a pre-determined format) on their laptop and immediately get it in the product.
Definitely not the only way, but think a system that pretty much psql \copy the given data into the target table.
To empower a non dev to author said file, I found duckdb to be an extremely exciting solution if the expert is willing to work some sql. It can query excel files and write results in another format that is db friendly like csv/parquet. Builing the transform system
The transform system works on a set of files in a known format, applies the transform and produces input that the loading system can take. The transform should be fully user defined.
To achieve that either:
have a catalogue of incoming data fully defining the data types/semantics of the incoming data.
some configuration slot to supply the transformation logic
sql is great choice for this:
by nature it is a language for transforming a set of rows into another
vs. accomodating programming language scripts requires sanitization/learning curve for a non dev
Building the extract system
The extract system absorbs difference between file formats(csv/xlsx/xml...) and how files are received(sftp, email...), and produces inputs to the transform system. If the transform system works on parquet files, the extract system's job is to map whatever comes into a parquet file, be it xmls in a zip, csv, delivered via sftp or email.
The permutations of file formats and deliveries are actually manageable because after all there are only so many file formats, and often converting one to another is a solved problem. Also by definition of sharing files, the mechanism needs to be widely adopted - a file sharing method is not suitable for sharing file if there aren't enough participants. All of this is to say, the extraction specification should be possible to be reduced to a handful of options.
Beyond ETL
Data augmentation
Sanitization