Data Engineering

01.

Getting started as a Data Engineer

Even though I graduated from the FIAP Data Engineering MBA course in the end of 2023, this is for sure a field I still want to learn much more about.

In the Data Analytics section, I talk about why knowing which questions to ask is one of the key skills of a good Data Analyst. I started my data career by thinking about these good questions, carrying over one-off analysis, and creating dashboards in Excel and existing Business Intelligence solutions.

Later down the road, when working with products that generated a lot of data, I often had a specialized team to help me with general data asks. Although it is nice to have someone doing the data work for you, it was a double-edged sword as I would often find myself at the mercy of someone else, who had thousands of other data requests to work on.

This was when my interest in SQL sparked. My goal was to become independent from others, so I could get the information I needed myself if necessary, or double-check information provided by others. I can't express with enough words how much useful this skill has been in my career.

When I joined Red Ventures in 2020, there was no Data Team, and Product Managers were responsible for everything. It was my opportunity to go one level deeper. There, I learned how to modify and create my own data models and investigate information on databases for different purposes.

Then, I found that sometimes I would still need to go one level deeper. What if the information is not available in the data source? What to do to add it there? How to make information available faster? These questions and needs only made my interest in the Data Engineering field grow.

My first interactions with Data Engineering tools such as Databricks, ETL, and CDPs such as Segment were also during my time at Red Ventures. However, Product Managers there were busy with multiple things, and I would only have the opportunity to deepen that knowledge in my next experience while leading a Data Team at BairesDev.

There, I was responsible to create and lead a Data Analytics team from scratch. One of my first missions, even before having the team officially created, was to save a highly critical dashboard from exploding. The fun fact is, this dashboard was destined to stop working on the CEO's birthday. It was also the single source of truth for all the 100+ customers we had at that point and most of the C-Level team.

Our internal team that was responsible for the client's marketing campaigns relied heavily on it, as for most of the day they were live discussing numbers from that dashboard with C-levels from our partner companies. If the dashboard stopped working, it would definitely be one of the worst birthday gifts the CEO could get. And the worst thing is, my name would be on the gift's label.

But... how did I know the dashboard was destined to stop on the CEO's birthday?

Well, let's take a step back and start from the beginning. As I mentioned in the Data Analytics section, I was working in a completely new operation, there was no data team before and we grew very fast from 1 to 500 clients. Everything that existed before that was a hacky dashboard made by someone in a hurry to display basic information from the market campaigns we were running for customers.

It was a good MVP, but it started breaking all the time. As we would send thousands of emails every day per each client, as we advanced from 1 to 100 clients and from 1 month to 6 months of operation, this meant the amount of data we needed to process was growing exponentially.

The ETL process in place dropped the tables behind the dashboard every morning, querying data from a MySQL database and loading it in an aggregated fashion afterward. This was not an issue at first, but as we had more data to load, it started to finish later and later every day.

Note that the table was dropped only after the data was loaded, which meant that there would be no information available for a certain period of time.

Again, this was not an issue in the beginning, because this process was being carried out very early in the morning. However, day after day it was completing itself later: 7 am, 7:10 am, 7:30 am, 8 am...

As the ultimate responsible for it, my first instinct was to understand how long we had until the situation became unsustainable, given that we also had multiple other requests coming from the C-Level and the team that relied on that dashboard.

In order to negotiate the Roadmap with them, it was vital to communicate the exact dashboard situation, especially because they were not fully aware of the time bomb we had inside the house.

For this, we added a Slack notification to print the time the process was completed every day, giving us an accurate view of the issue and a way to monitor it closely.

It was very clear it was getting bad fast, but some of our stakeholders didn't seem to get the message. They would only care about pushing their requests and making them happen, but I knew they would be pretty upset if the dashboard stopped working.

A classical production management situation with too much to do and only a few hands to help execute. So, I did my best to prioritize and parallelize work, showing frequent progress in the main requests so stakeholders could feel their asks were being taken care of, while also finding room for tech debt.

Together with the team, we decided to move the tables "behind the dashboard" to Databricks and connect the Superset (our BI tool) with Databricks directly as a data source.

With a powerful enough cluster and an organized ETL process running every day at night, we would no longer need to have "offline periods" without information available, and no more need to drop and upload ALL the information every day, as Databricks was loading information on itself using a basic CDC - Change Data Capture process by taking a look at the new "created_at" and "updated_at" columns that I had created across multiple tables, therefore providing a way for us to upload only new information.

It took us quite some time, as we needed to do most of the initial set-up for Databricks, and also recreate some configurations in the BI tool, but everything went well.

The day we had it working, though, was only a couple of days before the old process completely exploded and the Slack notifications stopped coming. If you didn't connect the dots yet, the day it stopped working was the same day as the CEO's birthday. Uff, that was close.

From that point onwards, the old ETL would take more than 24 hours to load all the information it needed. However, it would never finish, as every day in the morning, the tables would be dropped again before they were fully loaded with information.

This made me realize the Data Engineer job is quite similar to a Football Referee's, in the sense that the referee has the power to ruin the match for thousands of people if he does something wrong, and a very good referee just goes unnoticed as if he was not there at all.

Nevertheless, it also rewarding to remember both football and dashboards can amuse big crowds and generate lots of revenue and happiness when working properly.