Azure Automated Machine Learning

I learned just recently about new feature what Azure Machine Learning holds – Auto Machine Learning (AutoML). According to Microsoft’s documentation:

Traditional machine learning model development is resource-intensive, requiring significant domain knowledge and time to produce and compare dozens of models. Apply automated ML when you want Azure Machine Learning to train and tune a model for you using the target metric you specify. The service then iterates through ML algorithms paired with feature selections, where each iteration produces a model with a training score. The higher the score, the better the model is considered to “fit” your data.

Sounds good! I don’t have such a vast data science experience, so this kind of automated experimentation is just for people like me. There is no need for time consuming experimentation, because I don’t even know most of the algorithms.

Jatka lukemista “Azure Automated Machine Learning”

Databricks, Azure Synapse, surrogate keys, inserts instead of updates

Good Data Warehouse uses its own surrogate keys for dimension tables instead of natural key coming from a source. This way you can e.g. implement slowly changing dimensions later in the process. This time I will demonstrate how to generate surrogate keys using Databricks with Azure Synapse Analytics (f.k.a. Azure Data Warehouse).

I also demonstrate how to use inserts instead of updates when already existing row requires an update. Microsoft recommends using inserts instead of updates with Synapse. Basically this is the same with every Data Warehouse technology vendor.

I was pondering with three approaches how to generate surrogate keys when using Databricks and Azure Synapse:

  1. Use IDENTITY-column in Azure Synapse. Downside with this one is that values sometimes are not subsequent and even in some extraordinary cases might not be unique.
  2. Use Databricks’ MONOTONICALLY_INCREASING_ID-function.
  3. Use ROW_NUMBER functionality in Databricks’ SQL block.
Jatka lukemista “Databricks, Azure Synapse, surrogate keys, inserts instead of updates”

Live analysis of Finnish ice hockey league (Liiga) season 2019-20 and betting odds

I updated my analysis of Finnish ice hockey league to cover now season 2019-20. Goal is to update this analysis frequently almost after every round.

I also updated my betting odds machine learning model to forecast odds for coming games. Last season this model proved to be very successful.  In a nutshell where I should have lost 138 €, I had won 60€!  More details comments section.

Jatka lukemista “Live analysis of Finnish ice hockey league (Liiga) season 2019-20 and betting odds”

Azure Data Factory Mapping Data Flows

Azure Data Factory Mapping Data Flows is graphical user interface to build data transformation logic without coding like in SSIS. It generates behind scenes Databricks code (Scala or Python, I’m not sure) and use Databricks cluster to execute jobs.

I think this is the breakthrough feature that cloud data integration has really needed. My first touch to so called ETL tools was about 15 years ago when I started to use Informatica PowerCenter. I was really sold and thought there is no going back to scripting. You can imagine my disappointment when cloud technology started to emerge, but it missed native graphical ETL capabilities. In AWS you have had PaaS ETL services like Talend or Matillion, but you still need to manage your software. AWS has also AWS Glue as a SaaS offering, but I wasn’t convinced at all about that. This ADF Mapping Data Flows is the one I have been waiting for. Real SaaS ETL in the cloud. I have to confess I’m already used to use Databricks Notebooks and coding, but graphical user interface still lowers a barrier for new users a lot.

Jatka lukemista “Azure Data Factory Mapping Data Flows”

Azure Data Factory SSIS Integration Runtime error

I stumbled to strange error with our Azure Data Factory SSIS Integration Runtime, which I decided to share here if somebody faces the same problem. Infrastructure guys made some network changes, which lead to situation where our SSIS Integration Runtime didn’t work anymore. It gave some “MisconfiguredDnsServerOrNsgSettings” error message, which was quite natural and was corrected by returning original network settings.

Jatka lukemista “Azure Data Factory SSIS Integration Runtime error”

Using Azure Data Factory to copy only new on-premise files, process 0-n files and delete those afterwards

Last time I promised to blog about Azure Data Factory Data Flows, but decided to do this first. My business problem was to process files on On-Premise file share with SSIS without moving original files anywhere. Challenge was that with SSIS that is not easily done without possibility to move the original file. Maybe with log table and custom code. 

I decided to use Azure Data Factory (ADF) and Azure Blob Storage to tackle this challenge. Even ADF is missing couple of critical features, but I managed to use workarounds. I wanted to write about this, because I couldn’t find any good instructions for this process in one place.

In ADF there is a nice component called “Get Metadata”. Basically you could do this also with SSIS script component. I decided to use File Creation Date and store that into BatchControl table for the next run.

Jatka lukemista “Using Azure Data Factory to copy only new on-premise files, process 0-n files and delete those afterwards”

Using Azure Databricks with Azure Data Factory

Azure Databricks was released into preview on November 2017 and GA was on March 2018. Data Factory announced its support for Azure Databricks on April 2018 and I kind of missed that. Maybe because it was at the beginning profiled more as a Machine Learning part of the Data Warehouse infrastructure.

My interest was awaken when I saw a presentation held at PASS Summit where Databricks was positioned as central part of Microsoft’s modern data warehouse architecture.

Point is to use Data Factory as a data sourcing and pipeline orchestration tool with its wide connectivity capabilities to different data sources. Data is stored first into Azure, e.g. into blob storage. Databricks is used to do required data transformations before moving data into proper database (e.g. Azure SQL Database, Azure SQL DW V2, Azure Snowflake) for reporting and analysis purposes. Note also, there is no mentioning about SSIS in this architecture.

Jatka lukemista “Using Azure Databricks with Azure Data Factory”

Connecting Azure Data Factory SSIS Integration Runtime (IR) to your Azure VM (and your On-Premise data sources)

I wanted to test how I can connect other than public data sources to my Azure Data Factory SSIS IR. Namely I wanted to be able to read files from a file share. Originally I thought I would need so called Self-Hosted Integration Runtime, but actually you can connect SSIS IR to your On-Premise data sources. If you don’t have SSIS IR, then you would need Self-Hosted IR connecting On-Premise data sources. It’s basically installs own Azure VM to your Azure environment.

I tested also connection to On-Premise SQL Server in other environment. It was quite tricky and took me a while until I found this article:

According to it I needed to store my domain credentials into SSISDB like this: catalog.set_execution_credential @user='<your user name>’, @domain='<your domain name>’, @password='<your password>’

After that I finally got my connection to work. Of course other prerequisite was to open On-Premise firewall for connections from SSIS IR.

According to this document I was supposed to create my own Azure Virtual Network and connect my SSIS IR and my Azure VM to that. Creation of VNet is straightforward task.

Jatka lukemista “Connecting Azure Data Factory SSIS Integration Runtime (IR) to your Azure VM (and your On-Premise data sources)”