Aaron Mak

Software engineering feedback loop

Aaron Mak — Tue, 04 Jan 2022 09:46:41 GMT

Too many times, software engineers, myself included, write code before even thinking enough about it. This usually results in time wasted both in rewrites, refactors, or worse, removal.

I've been thinking a bit about when we should give feedback and the consensus is early, frequent and actionable. Yet, we also need to balance it with the cost of giving feedback because it takes effort and time.

Everyone should first ask this question, "Do I need to write code?" Sounds like a stupid question but when your job is writing code, it's easy to jump straight to it. If we don't ask ourselves that question, we likely have not fully understood the problem we are trying to solve. Writing code has to be the best way to solve the problem. Once the problem is understood, and the tradeoffs of writing and maintaining the code we write are worth it, then it's time to get some feedback.

Now, down to the nitty-gritty of it. There are many ways to get feedback on software engineering tasks. A non-exhaustive list includes meetings, code reviews, code pairing and RFCs. How should we decide when to use which form of feedback?

Going back to balancing cost vs benefit, it depends on how much we already know about the problem we need to solve, and what kind of feedback we are looking to get. Questions that we ask should help us determine how clear we are on the scope and how concrete this scope is. Some examples of questions that we should ask are as follows.

What is the overall direction and strategy?
Who are our stakeholders?
How do we measure success or failure?
Are we the best team or person for the task?
Have we as a team done something similar before?
How familiar is the task to the one executing?

Explore

If this problem hasn't been solved before and it's entirely new, meetings or RFCs are likely worth doing. Code shouldn't be written before the team is clear on the direction it's heading. Timelines need to be flexible and milestones, shorter because all of them could change once execution starts as we learn new information. You might not have all the details sorted out; what you're looking for is early feedback and agreement between stakeholders in the direction you are heading. If stakeholders are not technical, you'll need to simplify your explanations or split up the meetings and RFCs for different audiences.

Imagine having to change scope or direction once code is already written. From experience, it's costly and you would want to avoid it at all costs.

The length of this phase depends on the entire scope of the project. There might be various stakeholders from the C-suite down to the engineers in your team. Depending on your role and the size of the project, you might want to also split up meetings for discussions and meetings for decisions. This helps increase the quality of discussions and you can have more information on the table before proposing or making a decision.

Execute

Once everyone is clear on the direction of the project, it's down to the next most important bit – execution. I won't go into details about what good software engineering practices are because that would take up a lot more than one post but I'll assume that your team is aligned on those. If they aren't, no one can help you here.

If you or your team has not done this task before, it'd be worthwhile to start with code pairing. Feedback would be real-time and any decision would have to be taken by at least two people, which in software engineering, is usually already a good proportion of your team. If at least one person in the team has done a similar task before, it'll be worth having this person pair with someone that hasn't. A good team shares information to make better decisions and pairing helps in transferring this informal knowledge that might be difficult to write down. Every line of code written influences other lines of code so inconsistency is obvious and confusing.

The cost of code pairing is undeniably time, and probably more exhausting as it requires constant communication between two people. That said, most of the time, it's worth it as code quality is higher and the team communicates more effectively. When the pull request is finally created, it's just a formality.

If a task is very familiar to the team and the one writing the code is very familiar with it, it could be worth just jumping straight to creating the pull request. Feedback in the pull request is likely slower and less frequent but that's okay if there are only one or two minor comments.

Warning signs of creating the pull request too early is a long list of comments or one major comment that would cause a huge rewrite. If we observe that, to save everyone's time, it's better to take a step back and go back to the earlier stages of feedback.

TLDR

Explore, Execute, Repeat

How much we explore depends on how many unknown unknowns we think there are. And once we are clearer on what we don't know and we are confident enough, we start writing code, learning more as we start executing. If we discover more unknowns along the way, we may need to take one step back and explore again. It's seldom a one-way street but more like a cycle of exploring and executing because there will always be something that we didn't realize we don't know.

As a software engineer, feedback comes in various forms. And we need to know the tradeoffs between each form before deciding which is best. Sometimes, ego comes in the way and we'll need to learn how to navigate others and our own ego to gather feedback and respond to it well. No one is perfect at it but like anything else, it gets easier with practice.

Some Expectations at Work

Aaron Mak — Fri, 27 Nov 2020 13:14:51 GMT

Since I'm about to lead a team, thought it'd be helpful to have these things written down. They might change over time but hopefully not too much. I'd also want to be transparent and clear about what I'd expect from my team.

Work isn't the most important thing

Somewhat contrarian, especially in Singapore. No one should prioritise work above all else. It's inhumane to expect one to do so. Life is so much more than just work. Even if you love your job, sometimes you'll be bored, tired or just think it isn't worth it. And that's okay, it's normal. Sometimes we feel that way because of people we work with or a project isn't completed the way we'd like it to be.

If you're someone who thinks work is the most important thing, have a read of this excerpt from a book in the Bible, written by a king around 3000 years ago who was richer than Bill Gates.

I hated all my toil in which I toil under the sun, seeing that I must leave it to the man who will come after me, and who knows whether he will be wise or a fool? Yet he will be master of all for which I toiled and used my wisdom under the sun. This also is vanity. So I turned about and gave my heart up to despair over all the toil of my labors under the sun, because sometimes a person who has toiled with wisdom and knowledge and skill must leave everything to be enjoyed by someone who did not toil for it. This also is vanity and a great evil. What has a man from all the toil and striving of heart with which he toils beneath the sun? For all his days are full of sorrow, and his work is a vexation. Even in the night his heart does not rest. This also is vanity.
- Ecclesiastes 2:18-23

To clarify, this book doesn't say work (toil) isn't important at all, though it might make you wonder what's important then? You'll have to read on to get the full picture or you can ask me for a fuller explanation.

And for most of us, we'd naturally think there are a lot of things more important, like our religion, family, friends, hobbies. That's normal and I don't expect otherwise. I don't mean you should do a bad job at work of course. I'd expect good work but not at the expense of what you think is more important. And I'd argue that we need to prioritise things we believe are more important so that we can do good work. We are made to work but it's not the most important thing. Even rest at times could be more important than work.

Be an owner

See a bug, fix it. See a spelling mistake, fix it. See bad documentation, write good ones. If you own something, you'd take good care of it. I like the idea of a Maintainer coined in software engineering. You are responsible for maintaining what you write, especially in software where every commit is signed by you. Don't write sloppy code and say you have to merge it because of a deadline. Sloppy code is not tech debt. Made bad decisions or merged a bug? Take responsibility, own up and do something about it. And be proud of your work.

We are a team first

Since we are working together towards an arbitrary goal, don't be afraid to say what's on your mind. Don't be afraid to say hard truths we need to hear, even to your manager. Sometimes our opinions turn out to wrong, and that's okay. As a team, there is room for mistakes. As long as we own up, fix it, learn and move on.

If you are having personal issues, or unable to deliver some time in the promised time, let us know and trust that we will fill in the gaps for you.

Of course, all that assumes that we trust each other enough over time.

Learn how to learn

It's easier to teach people what they should do than teach people to figure that out for themselves. Knowledge is everywhere but to discern what knowledge is worth learning is a difficult skill.

The smartest people aren't those that know the most, but those that can figure out how to best leverage on the knowledge of others.

I think it boils down to three things. How to ask good questions, how to think critically, and to be comfortable being wrong. And that's what I'd expect. Not writing the most efficient algorithm, writing the most elegant code or even scoping the most impactful project. But I'd think you'll get there if you ask good questions and think critically in a way that doesn't assume you're always right.

In software engineering, most people think that you need to be the most logical person, or the smartest one to be good at it. On the contrary, I found that the people I most enjoy working with, are humble, curious and communicate well.

If you are in my team, let me know what you think. If you have any feedback regarding any of my expectations that were mentioned, you're more than welcome to talk to me about it.

Don't be afraid of the command line

Aaron Mak — Mon, 28 Sep 2020 10:05:15 GMT

We are so used to graphical user interfaces that we forget computers started off without it. As a software engineer, there are many reasons why using the command line can help with development.

Less context switching -> more focus
Lightweight -> faster development
Deeper understanding of tools and the computer
Higher interoperability between computers

On an aside, the recent trend of containerization requires you to be familiar with the command line. With most containers being in linux and entering them would mean that there's no user interface so you have to depend on the good old fashion command line.

CLI Basics

I'll assume you're either using a Mac or a Linux because it might work differently in Windows. And most development machines are based of UNIX anyway.

With that, I'll be explaining some common command line commands that are present in most computers which I commonly use. They won't be exhaustive but I encourage you to read more of them in the Linux Pocket Guide.

cd

It allows you to go to a relative or absolute path.

cd child_directory  # relative path to child_directory 
cd /usr/foo/directory  # absolute path
cd ../sibling_directory  # go to parent directory then into sibling directory
cd  # without arguments returns you to the home directory

export

Creates environment variables. Note that you'll need to define these environment variables in a .zshrc, .bashrc or any file that loads when the shell is created if you want this environment variable to always be present. Otherwise, this environment variable would not persist beyond it's current shell.

export FOO=bar
echo ${FOO}  # would show `bar`

printenv

Helpful for debugging to figure out what are the current environment variables.

history

Prints the commands you have used previously. Helpful when used with grep.

alias

Does what it says.

alias gp='git push'
gp  # would be equivalent to typing `git push`

ls

Short for "list". Lists attributes of files and directories.

ls  # lists files in current directory
ls directory  # list files in that directory
ls -a  # lists all files, including hidden ones

mv

Short for "move". Move / Rename file(s) or directories.

mv foo.txt bar.txt  # renames foo.txt too bar.txt
mv blah.txt foo.txt destination_directory  # moves blah.txt and foo.txt to destination_directory

cp

Short for "copy".

cp original.txt copy.txt  # creates copy.txt which is a copy of original.txt
cp original.txt directory  # creates a copy of original.txt into `directory`

rm

Short for "remove".

rm foo.txt  # removes foo.txt
rm -r directory  # removes anything in this directory

pwd

Prints the absolute path of your current working directory.

mkdir

Short for "make directory".

mkdir foo  # creates foo directory
mkdir -p foo/bar/bar  # creates any parent directories if necessary

echo

Prints arguments passed into it.

echo foo  # prints `foo`

cat

Concatenates and print files. Basically just means printing the whole file.

less

Something like cat but good for larger text and if you want to browse by pages.

grep

Given some files, print lines which match a regular expression pattern. Pretty powerful stuff if you're well versed in regular expressions.

Given this file a_text_file.txt.

80 days around the world, we’ll find a pot of gold just sitting where the rainbow’s ending.
Time — we’ll fight against the time, and we’ll fly on the white wings of the wind. 
80 days around the world, no we won’t say a word before the ship is really back.
Round, round, all around the world. Round, all around the world. Round, all around the world.
Round, all around the world.

grep time  # Time — we’ll fight against the time, and we’ll fly on the white wings of the wind.

touch

Usually used to create empty files since it creates the file if it doesn't exist. It also updates the modification timestamp and access timestamp of the file.

| (pipe)

Passes the output of the previous command to the next. This is the key to mastering the command line. Commands on it's own don't seem that powerful but combine them with each other and the possibilities are endless. And that's what the unassuming | (pipe) does.

cat file.txt | grep text  # finds `text` in every line in file.txt
history | grep git  # prints git commands you have used previously

xargs

Another powerful command that reads each line of text from the input and turns passes each of them to the next command and executes them. It'll be clearer when you try out the examples.

ls | xargs ls  # runs ls on all child directories and prints the result
grep foo --files-with-matches | xargs rm  # removes all files with `foo`

pbcopy (MacOS)

MacOS clipboard. Helpful when you want to pipe something to the clipboard.

cat file.txt | pbcopy  # copies file.txt contents to clipboard
grep foo | pbcopy  # copies any line with `foo` to your clipboard

Exploring tools

Now you know the basics, but you'll need to figure out how to use these commands on your own. Most commands have a help option.

ls --help
grep --help

If not, they would likely have a man or short for manual.

man ls
man cat
man echo
man grep

You might find the manual too verbose but I found this helpful tool called tldr (https://github.com/tldr-pages/tldr) which summarizes the manual to useful cheatsheets for me.

Other useful commands

These require a separate installation that are not found by default and to me, are better versions of default commands, but require an understanding of them.

z - a faster cd based on your history
tldr - a quicker manual
rg - a faster grep written in rust
peco - a more interactive filtering tool
entr - runs a command when input changes. I mainly use this for TDD to run tests when any file changes.
tree - list contents of a directory in a tree-like format. Comes default with Linux, but you'll have to install it on MacOS.

These commands are just scratching the surface of what you can do, so the best way to try them out is to use them in your daily workflow.

I've also created a small practice file at https://gist.github.com/aaronmak/dcc479d30ec9e844d6756933f0f6886e if you'd like to get more familiar with your command line.

References

Linux Pocket Guide - similar commands to MacOS but there are slight differences.

Building a Dactyl Manuform keyboard with hot-swappable sockets

Aaron Mak — Sun, 17 May 2020 15:09:43 GMT

Thinking it was similar to building one had normal switches, I tried to build my first keyboard with hot swappable sockets. I was so wrong. Not only could I not find build logs of anyone who has built one, but soldering wires to the Kalih hot swappable sockets was so much harder. So please read on if you're as clueless as I was.

There are many build logs out there so I'll distill what I thought was helpful and share some mistakes I've made to prevent you from repeat them before continuing with the build log.

Helpful but not obvious

Soldering the copper tape first

Start with soldering the copper tape and not the wires, but leaving the top most switch unsoldered. The topmost switch should be soldered together with the wire.

I made a mistake of soldering the diodes first because most guides suggested it. To my horror, I found out later that the copper wire cannot be wrapped around the hot swappable sockets. So I had to sneak the copper tape under the existing wires, which probably added at least an hour or two to my build time.

Solder the diode as close to the socket as possible

None of the guides I read told me to do this even though some images I've seen hinted that. Or maybe everyone already knew that except me, a soldering noob.

Soldering the diode close to the socket (or switch) ensures that the diode doesn't move around too much. There's also less of a chance that the row wire would touch another.

Understand what you're building

There're many guides out there and some have subtle differences of how they wire the thumb cluster or the materials they use. Since it's my first time soldering a keyboard, I just followed a mishmash of the popular guides. I'm glad it worked out but I just didn't know what the options were and the tradeoffs I was making.

Took me a while to understand how the wiring allows the controller to know which key was pressed. So even though I followed the popular wiring guide at https://github.com/abstracthat/dactyl-manuform, I might not need to in the future. The QMK guide to split keyboard also suggested an alternative wiring to connect the 2 controllers which I didn't know existed until I flashed the controller.

Wiring Diagram

Not drawn to scale but hopefully it's clear. It's an svg file so you'll be able to zoom in as needed. I found that most diagrams don't have the reset switch and they only show one side at a time.

To understand this diagram, it is drawn from the perspective of the underside of the keyboard. The brown and black rectangles represent the diodes and the direction they need to be in for the current to flow the right way. I used elite-c's because I've read that pro micros are fragile but they work as well.

Wiring Diagram

Building the keyboard

Materials and Tools

3D printed case with hot swappable kalih switches, base, wrist rests and elite-c holder printed by reddit user crystalhand
2 Elite C controllers
2 3.5mm TRRS sockets
Reset button switch
64 Diodes (a few spare would be handy just in case)
Solder core
Solid core copper wires (from ethernet cables)
5mm width copper tape
Soldering Iron
Soldering Iron Tip Cleaner (An old wet cloth or sponge will do as well)
Pliers
Tweezers
Digital Multimeter (optional, but useful for debugging and troubleshooting if something goes wrong)
Screws, nuts and screwdrivers (optional, only needed if you want to cover the bottom or to attach the printed wrist rests)

what have I gotten myself into

Soldering the connections

I started with the diodes because I wasn't expecting to use copper tape later. So don't repeat my mistake and start with the copper tape instead. I was building the 5x6 version of the dactyl manuform so it wasn't too difficult finding wiring diagrams for it. As said in other build logs, winding the diode wire around the node would help. However, it'll be trickier to wind it around the hot swappable socket.

winding the diode around the socket

When you get to the corners, you'll likely have to use the tweezers to finish the job.

After what seemed like eternity

As you can see, some diodes are pretty far away from the sockets, which I'll live to regret later on.

Since I've started on the diodes, I continued by soldering the row wires. After removing the individual wires from the ethernet cable, I got to work. The wire stripper I got was too big so I had to return it, leaving the only other option which was to use the soldering iron to melt the insulation of the wires. That didn't leave me with the cleanest cuts but it worked.

Now for the columns! I assumed I could wind the wire around the socket like all the guides for the switches mentioned but I was clearly mistaken. The copper wires were a little thicker than the diode wires, which resulted in me being unable to wind them around the sockets. I saw some guides that used copper tape so I ordered them online.

After waiting for a couple of days for the copper tape to arrive from a local hardware store, I realised that it'll be difficult to get the tape under the wires. After much patience and perseverance, it was complete.

I found the thumb cluster to be insanely difficult with the copper tape so a tip is to plan where you're going place the tape.

Soldering the wire to the copper tape was also tricky because don't forget, I couldn't wind it around the socket. What I found helpful was to tape one side of the wire such that pressure was applied to the end of the wire in contact with the tape. I know it sounds vague so hopefully this next photo would help.

Masking tape to hold the wire in place while soldering

The exposed end of the wire was bent to be in contact with the socket and copper tape at the same time. With the masking tape helping it stay in place, soldering was more manageable.

Soldering the components

Once the rows and columns are complete, it was time to solder the components. Before soldering to the elite-c, you'll need to wind the exposed end of the wire around the side. It's not that clear but the wire needs to enter from the top and curl at the bottom because the soldering will be done below the controller. Ensure that this is done for all wires so it's consistent.

One mistake I made here was to not give enough slack for the wires so I had to solder 2 wires together because I realised one was too short. If you have the holder for the elite c, ensure that you have enough slack so that elite c can peek out through the cavity for the holder to fit.

pardon the messy wiring

Soldering below the controller, with the wires entering from the top

Quite a few guides would suggest a specific GND position for the TRRS wire but any one of them would do as long as it's labeled GND. Since I was going to solder the reset switch to GND and RST, I used the GND adjacent to RST for the reset switch.

Soldering wires to the reset switch

Soldering wires to the TRRS socket

And before I knew it, I was finally done! Now came the easiest part, adding the switches and keycaps.

Tealios Switches

SA Vilebloom from MechSupply (originally for Ergodox)

Flashing the controller

The open source program QMK is what you're looking for to configure it. For the first flash, I just got the defaults working so I could test out if any keys weren't working. Following the instructions, I used homebrew to install it on my mac. It was surprisingly easy to install.

brew tap qmk/qmk
brew install qmk

QMK is well documented and I'd advise you to read the documentation before continuing. After reading the documentation, I flashed the left side first and connect the right side for the default settings. And it seemed everything was going well.

It turned out I celebrated too early. After testing the keyboard at qmk, I found 4 keys that weren't working. Immediately, I rushed to get the multimeter to diagnose the issue. Finding out that the circuits were working fine which meant no more soldering, I heaved a sigh of relief. It was the switches' pins that were bent. After straightening them out, it was perfect.

Just another shot of the completed keyboard 😎

After spending some time tinkering with the online configurator, I arrived with one which mimicked the layout of my ergodox. Don't worry if you don't know how to use a text editor or read C. The online QMK configurator and toolbox doesn't require you to use any text editor.

And after pressing the reset button followed by one final flash, that's it!

A few days in and I'm pretty sure I'll be using this keyboard for a long time to come.

Learning Vim in 30 days

Aaron Mak — Sun, 03 May 2020 01:00:00 GMT

If you'd like to learn vim, I'll be sharing what I found helpful while picking it up. Vim seems ancient with all the fancy new IDEs like Visual Studio Code. However, I've found vim to greatly improve my productivity. Within a month, I started to use vim full time and never looked back.

The timelines are just a guideline so feel free to skip ahead or slow down.

Why vim

I saw a few colleagues using vim and I was intrigued. I wondered how a text-only interface can work better than an IDE. At first, vim seemed difficult to use and I thought it was only for the nerdiest. The bar was too high, or so I thought.

I can't find the article but I remember it talking about how programming is like surgery. In a sense, most of the time, we make minor tweaks just like surgery. But we might also repeat the same change throughout the repository. And after learning that vim is built to do those things without any fancy tooling, I was convinced to give it a shot.

Week 1 - Vim Tutor

Vim tutor comes with every UNIX computer. To figure out whether you have it, simply go to your terminal and enter vimtutor. If you're using a windows machine, refer to https://superuser.com/questions/270938/how-to-run-vimtutor-on-windows.

The beginning of vimtutor

The beauty of the file is that it teaches you through practice. Getting used to the keybindings in vim requires some unlearning of what you're used to. And there's no better way than to simply practice. Just like learning a musical instrument, if you are struggling with any chapters, repeat them until you are confident.

Spend a couple of hours in the first week to play around with vim tutor. And once you're comfortable with the basic modes and keys, you're ready for vim.

Week 2 - Vanilla Vim

At this point, you should be able to use vim to edit files but you might find the vim defaults a bit of a hassle. I'd recommend tpope's vim-sensible plugin to help start you off with some defaults that make sense while still allowing you to explore and use vim.

Learn how to use buffers or tabs and how to navigate using vim commands and motions.

Week 3 - CLI tools

If you are familiar with the command line and Linux, you can probably skip this. Vim and the terminal work hand in hand so it'd feel like typing with one hand rather than two if you didn't know how to use command-line tools.

A place to start would simply be your terminal. Basics include ls, mkdir, cp, mv, touch, cat, rm and many more. A book I'd recommend would be Linux Pocket Guide.

Week 4 - Customise

Vim is built to be infinitely customisable. Every software engineer's vim is different. Some people like to use vanilla vim so it's easy to be as efficient on remote machines. Others might install so many plugins that their vim might be mistaken for an IDE.

Whatever the case, you would still require a .vimrc file to add your configs. I'd recommend slowly changing that file and only adding things when you've gotten used to your current setup. A mistake I've made was to add too many things at the beginning without fully understanding what I was adding. That caused some issues later down the road when things broke.

Practice makes perfect

If you have come this far and aren't convinced, you could always return to your IDE. But you are convinced, practice makes perfect. There's no better way to be more proficient than using it for your daily projects.

There are endless things to learn about vim so it's quite likely I've only scratched the surface. I'd also recommend using vim with tmux once you're confident and try out neovim as well.

Maintainable SQL in Data Warehousing

Aaron Mak — Sat, 25 Apr 2020 03:26:46 GMT

Since SQL isn't a programming language, and say you don't have the option of using a programming language to generate SQL, how do you go about making it maintainable?

If I had the opportunity to build a new data warehouse from scratch, I'd probably tell myself these things to prevent past mistakes I've made along the way.

Strict style guide
Clear naming conventions
Write tests
Use table views
Arrange the repository
Use version control
Describe your SQL

Strict style guide

Enforcing a style guide early on is a no brainer. The earlier you do it, the better. If you're lucky, there's an open-source formatter that you can use. But you'd still need to decide on some things that would help your SQL be more maintainable. To my dismay, there wasn't one for BigQuery SQL so the style guide had to be enforced manually.

Some things to consider could be the level of nested subqueries, or whether to have commas at the end, the number of spaces and many more. It sounds trivial but it'll improve maintenance if there's consistency in SQL styles. No one likes to point out a missing space or comma in code reviews. But without a formatter, these comments help in the long run.

A good example of what a style guide might look like can be found at https://www.sqlstyle.guide/.

Clear naming conventions

Part of a style guide could include naming conventions. Suffixes and prefixes could help describe columns better. For example, order_count has a count prefix. If standardised, it would provide additional information for the reader even before the reader explores the data.

How tables and schemas (or datasets in BigQuery) are named are equally important. Are tables grouped by end-users or products? Would you be using the fact and dim prefixes for fact and dimension tables? Would reporting tables be in a separate schema? In the end, you should have some rules and the reasons behind them.

A caveat is that conventions have to be followed or else it would have the opposite effect and make it more unmaintainable and more confusing for everyone.

Write tests

Validity

The SQL needs to be able to run, to begin with. You can start with a static parser and some data warehouses like BigQuery might even allow you to call their API to check if your SQL is valid. This would easily prevent major issues in your pipeline.

Data Integrity

Monitor the quality of the data is essential so errors can be caught as early as possible. Trust me, the last thing you'd want is your stakeholders reporting data quality issues.

Expected row count

This prevents bad joins for downstream tables. You won't have that issue with SQL databases because you'll likely define the relationships between objects using an object-relational mapping (ORM) library. But when it comes to data warehouses, there aren't any restrictions. You could expect a one-to-many relationship, but the join revealed a many-to-many relationship. These issues are made worse when data is unclean or duplicates are hidden in distributed microservices.

Expected columns

Certain columns contain the primary key or the foreign keys which you'd expect to be in the table. Or even for reporting tables, you might expect the date to be there. Ensuring such columns exist will prevent careless mistakes. A static parser would help with this or the output table can be monitored instead.

Uniqueness

Ensure columns which contain the primary key to be unique. It's common to have duplicate rows in streaming pipelines or wrong joins.

Recency

Ensures the tables are up to date. It's more like a failsafe than anything. An external monitoring tool would be great for this.

Use table views

Views allow you to breakdown a large SQL into smaller chunks. Views can also be shared by downstream tables, reducing the SQL you need to write. Long SQL scripts are more likely more difficult to maintain.

Arrange the repository

If you're using a repository to store the SQL, a structure is necessary so people know where to find their pipelines. Grouping them by their data pipelines or domain is a good start. It'd be helpful as well if there's a document to show how the repository is organised.

For example, I might want to group the first layer transformations in one directory and the reporting transformations in another directory.

Use version control

I'm going to go out on a limb and say that analysts need to use git. If you could learn SQL, you can learn git. An IDE would help if the terminal gives you nightmares, though I'd still recommend the terminal. Version control not just helps you to figure out what happened historically but also why things are done a certain way if git is used well, which in my opinion is more important.

Describe your SQL

Most data warehouses allow you to add a description to your columns so use it! There are times when the logic for a column is complicated and cannot be described with just the column name, especially for reporting tables. But maintenance of the descriptions is key so the description is up to date with the logic of the column. Looking at SQL is no fun for users of your table that don't know SQL.

Continuous Integration and Deployment for Data Pipelines

Aaron Mak — Tue, 03 Dec 2019 07:44:00 GMT

Continuous integration and continuous deployment (CI/CD) is a practice that enables an organization to rapidly iterate on software changes while maintaining stability, performance and security.

CI/CD Pipelines interface on Gitlab

Although continuous integration and deployment (CI/CD) are not new concepts, new tools such as Docker, Kubernetes, and Jenkins have allowed CI/CD to be more easily implemented in recent years. However, there are interesting challenges when applying CI/CD to data engineering.

Challenges with CI/CD in Data

Replication of environments

Credit: CommitStrip.com

In modern software engineering workflows, there are usually development, staging and production environments. In software engineering, development and staging databases usually mimic a fraction of the production database. This assumes that the differences in data would not matter. In software engineering, this is usually the case.

In data engineering, we are unable to create a staging environment for some 3rd party data sources. Yet, our development and staging environments should try to mimic our production environment to ensure consistency in environments. But replicating data environments can be difficult, depending on the size of your data and existing infrastructure.

A typical set of data pipelines in 90 Seconds

Tests take too long

Tests are designed to give quick feedback in CI/CD, ensuring as many iterations as possible so that bugs can be caught and fixed quickly. However, testing data pipelines might require testing queries that take more time than an average integration test.

Legacy

Existing data pipelines might have been in place before CI/CD was even considered. This makes things complicated because a lot of code would need to be redesigned to consider different environments. Other than technical legacy, it might also come in the form of organisational legacy where people are not used to having multiple environments. Educating different stakeholders and engineering teams to get used to a CI/CD workflow would not be easy. Besides, it would also require significant changes in processes.

A typical data architecture

Instead of hypothetical examples, let’s take a look at a simplified data architecture which follows the ELT principle. For this example, we will be using Google Cloud Platform, which is similar to other cloud providers.

Looking at this architecture, we notice some challenges if we were to fully adopt the modern development workflow. For one, it’d be impossible to create a BigQuery instance for each data engineer on their development environment since this is likely a local setup. And if you were to create one BigQuery instance on the cloud for each engineer, it will likely be too costly.

Say you wanted to test your pipeline. How would you go about doing an integration test? With any modern application, there’s usually a library for that but how do you verify the data integrity in development and pinpoint the source of the error without a complete copy of the production environment?

If you had thought about these issues, you’d be surprised that some companies had never thought of it. Some teams might not be experienced or have an engineering mindset. So they started building things from the get-go, without considering how best to control technical debt. Once enough things are built and the process is settled, it’ll be difficult to make any significant changes.

Some of these issues convince some companies to simply go with one environment — production. But there can be a compromise and I’d argue that multiple environments encourage higher code quality and data integrity.

A better workflow

Since we are unable to strictly adopt a software development workflow, we will have to make some compromises.

Instead of having a dedicated BigQuery instance to each data engineer for the development environment, we will have a shared one. For this to work, all tables have to be easily torn down and rebuilt — following a functional approach to data engineering. The BigQuery API integration with Airflow also helps us with that.

With every merge request, a series of automated tests will be run. BigQuery gives us some cool API calls to check the validity and cost of our queries which helps us with testing. Though not a strict unit test, it gives us quick feedback from automated testing so we could catch errors before they are pushed to production.

A shared BigQuery instance in development and test environments will inevitably result in unexpected errors, especially when the state of the shared BigQuery instance is dependent on the last executed Airflow DAG. But since we have followed a functional approach to data engineering, resetting the state of the data warehouse is quick and easy.

Applying CI/CD as best as we can gives us a nice workflow to test our pipelines and logic before pushing them to production.

Though this is similar to our current setup at 90 Seconds, every company has a different data architecture and constraints. I’d expect our setup to evolve as data needs change and better tools are being developed.

Data Engineers need to think like Software Engineers

In a lot of ways, data engineers are software engineers, though with different constraints. Concepts such as functional programming, containerization and TDD might not be applicable all the time but it’s a good place to start. Let’s start with first principles and maybe, just maybe, there’ll be a new set of conventions created by the data engineers built upon the shoulders of software engineering.