Maintainable SQL in Data Warehousing

Since SQL isn't a programming language, and say you don't have the option of using a programming language to generate SQL, how do you go about making it maintainable?

If I had the opportunity to build a new data warehouse from scratch, I'd probably tell myself these things to prevent past mistakes I've made along the way.

Strict style guide

Enforcing a style guide early on is a no brainer. The earlier you do it, the better. If you're lucky, there's an open-source formatter that you can use. But you'd still need to decide on some things that would help your SQL be more maintainable. To my dismay, there wasn't one for BigQuery SQL so the style guide had to be enforced manually.

Some things to consider could be the level of nested subqueries, or whether to have commas at the end, the number of spaces and many more. It sounds trivial but it'll improve maintenance if there's consistency in SQL styles. No one likes to point out a missing space or comma in code reviews. But without a formatter, these comments help in the long run.

A good example of what a style guide might look like can be found at https://www.sqlstyle.guide/.

Clear naming conventions

Part of a style guide could include naming conventions. Suffixes and prefixes could help describe columns better. For example, order_count has a count prefix. If standardised, it would provide additional information for the reader even before the reader explores the data.

How tables and schemas (or datasets in BigQuery) are named are equally important. Are tables grouped by end-users or products? Would you be using the fact and dim prefixes for fact and dimension tables? Would reporting tables be in a separate schema? In the end, you should have some rules and the reasons behind them.

A caveat is that conventions have to be followed or else it would have the opposite effect and make it more unmaintainable and more confusing for everyone.

Write tests

Validity

The SQL needs to be able to run, to begin with. You can start with a static parser and some data warehouses like BigQuery might even allow you to call their API to check if your SQL is valid. This would easily prevent major issues in your pipeline.

Data Integrity

Monitor the quality of the data is essential so errors can be caught as early as possible. Trust me, the last thing you'd want is your stakeholders reporting data quality issues.

Expected row count

This prevents bad joins for downstream tables. You won't have that issue with SQL databases because you'll likely define the relationships between objects using an object-relational mapping (ORM) library. But when it comes to data warehouses, there aren't any restrictions. You could expect a one-to-many relationship, but the join revealed a many-to-many relationship. These issues are made worse when data is unclean or duplicates are hidden in distributed microservices.

Expected columns

Certain columns contain the primary key or the foreign keys which you'd expect to be in the table. Or even for reporting tables, you might expect the date to be there. Ensuring such columns exist will prevent careless mistakes. A static parser would help with this or the output table can be monitored instead.

Uniqueness

Ensure columns which contain the primary key to be unique. It's common to have duplicate rows in streaming pipelines or wrong joins.

Recency

Ensures the tables are up to date. It's more like a failsafe than anything. An external monitoring tool would be great for this.

Use table views

Views allow you to breakdown a large SQL into smaller chunks. Views can also be shared by downstream tables, reducing the SQL you need to write. Long SQL scripts are more likely more difficult to maintain.

Arrange the repository

If you're using a repository to store the SQL, a structure is necessary so people know where to find their pipelines. Grouping them by their data pipelines or domain is a good start. It'd be helpful as well if there's a document to show how the repository is organised.

For example, I might want to group the first layer transformations in one directory and the reporting transformations in another directory.

Use version control

I'm going to go out on a limb and say that analysts need to use git. If you could learn SQL, you can learn git. An IDE would help if the terminal gives you nightmares, though I'd still recommend the terminal. Version control not just helps you to figure out what happened historically but also why things are done a certain way if git is used well, which in my opinion is more important.

Describe your SQL

Most data warehouses allow you to add a description to your columns so use it! There are times when the logic for a column is complicated and cannot be described with just the column name, especially for reporting tables. But maintenance of the descriptions is key so the description is up to date with the logic of the column. Looking at SQL is no fun for users of your table that don't know SQL.

Show Comments