Good Data Warehouse uses its own surrogate keys for dimension tables instead of natural key coming from a source. This way you can e.g. implement slowly changing dimensions later in the process. This time I will demonstrate how to generate surrogate keys using Databricks with Azure Synapse Analytics (f.k.a. Azure Data Warehouse).
I also demonstrate how to use inserts instead of updates when already existing row requires an update. Microsoft recommends using inserts instead of updates with Synapse. Basically this is the same with every Data Warehouse technology vendor.
I was pondering with three approaches how to generate surrogate keys when using Databricks and Azure Synapse:
- Use IDENTITY-column in Azure Synapse. Downside with this one is that values sometimes are not subsequent and even in some extraordinary cases might not be unique.
- Use Databricks’ MONOTONICALLY_INCREASING_ID-function.
- Use ROW_NUMBER functionality in Databricks’ SQL block.