Azure Data Factory Mapping Data Flows

Azure Data Factory Mapping Data Flows is graphical user interface to build data transformation logic without coding like in SSIS. It generates behind scenes Databricks code (Scala or Python, I’m not sure) and use Databricks cluster to execute jobs.

I think this is the breakthrough feature that cloud data integration has really needed. My first touch to so called ETL tools was about 15 years ago when I started to use Informatica PowerCenter. I was really sold and thought there is no going back to scripting. You can imagine my disappointment when cloud technology started to emerge, but it missed native graphical ETL capabilities. In AWS you have had PaaS ETL services like Talend or Matillion, but you still need to manage your software. AWS has also AWS Glue as a SaaS offering, but I wasn’t convinced at all about that. This ADF Mapping Data Flows is the one I have been waiting for. Real SaaS ETL in the cloud. I have to confess I’m already used to use Databricks Notebooks and coding, but graphical user interface still lowers a barrier for new users a lot.

Jatka lukemista “Azure Data Factory Mapping Data Flows”

Azure Data Factory SSIS Integration Runtime error

I stumbled to strange error with our Azure Data Factory SSIS Integration Runtime, which I decided to share here if somebody faces the same problem. Infrastructure guys made some network changes, which lead to situation where our SSIS Integration Runtime didn’t work anymore. It gave some “MisconfiguredDnsServerOrNsgSettings” error message, which was quite natural and was corrected by returning original network settings.

Jatka lukemista “Azure Data Factory SSIS Integration Runtime error”

Using Azure Data Factory to copy only new on-premise files, process 0-n files and delete those afterwards

Last time I promised to blog about Azure Data Factory Data Flows, but decided to do this first. My business problem was to process files on On-Premise file share with SSIS without moving original files anywhere. Challenge was that with SSIS that is not easily done without possibility to move the original file. Maybe with log table and custom code. 

I decided to use Azure Data Factory (ADF) and Azure Blob Storage to tackle this challenge. Even ADF is missing couple of critical features, but I managed to use workarounds. I wanted to write about this, because I couldn’t find any good instructions for this process in one place.

In ADF there is a nice component called “Get Metadata”. Basically you could do this also with SSIS script component. I decided to use File Creation Date and store that into BatchControl table for the next run.

Jatka lukemista “Using Azure Data Factory to copy only new on-premise files, process 0-n files and delete those afterwards”

Using Azure Databricks with Azure Data Factory

Azure Databricks was released into preview on November 2017 and GA was on March 2018. Data Factory announced its support for Azure Databricks on April 2018 and I kind of missed that. Maybe because it was at the beginning profiled more as a Machine Learning part of the Data Warehouse infrastructure.

My interest was awaken when I saw a presentation held at PASS Summit where Databricks was positioned as central part of Microsoft’s modern data warehouse architecture.

Point is to use Data Factory as a data sourcing and pipeline orchestration tool with its wide connectivity capabilities to different data sources. Data is stored first into Azure, e.g. into blob storage. Databricks is used to do required data transformations before moving data into proper database (e.g. Azure SQL Database, Azure SQL DW V2, Azure Snowflake) for reporting and analysis purposes. Note also, there is no mentioning about SSIS in this architecture.

Jatka lukemista “Using Azure Databricks with Azure Data Factory”

Connecting Azure Data Factory SSIS Integration Runtime (IR) to your Azure VM (and your On-Premise data sources)

I wanted to test how I can connect other than public data sources to my Azure Data Factory SSIS IR. Namely I wanted to be able to read files from a file share. Originally I thought I would need so called Self-Hosted Integration Runtime, but actually you can connect SSIS IR to your On-Premise data sources. If you don’t have SSIS IR, then you would need Self-Hosted IR connecting On-Premise data sources. It’s basically installs own Azure VM to your Azure environment.

I tested also connection to On-Premise SQL Server in other environment. It was quite tricky and took me a while until I found this article:

According to it I needed to store my domain credentials into SSISDB like this: catalog.set_execution_credential @user='<your user name>’, @domain='<your domain name>’, @password='<your password>’

After that I finally got my connection to work. Of course other prerequisite was to open On-Premise firewall for connections from SSIS IR.

According to this document I was supposed to create my own Azure Virtual Network and connect my SSIS IR and my Azure VM to that. Creation of VNet is straightforward task.

Jatka lukemista “Connecting Azure Data Factory SSIS Integration Runtime (IR) to your Azure VM (and your On-Premise data sources)”

Liiga-kausi 2018-19

No niin, kesä ja golf-kausi alkavat olla ohi, niin voisi alkaa taas vaikka seuraamaan kiekkoa. Päivitin Liiga-analyysini koskemaan nyt kautta 2018-19 ja yritän muistaa päivittää tilastoja säännönmukaisesti.

Vielä ei kannata syvää luotaavia analyyseja tehdä. Ensin yksi poiminta viime kauden lopusta. Viime kautena analysoin myös maalien, laukauksien ja aloituksien välisiä korrelaatioita. Kauden lopussa korrelaatioluvut olivat maalit/aloitukset 0,23, maalit/laukaukset 0,25 ja aloitukset/laukaukset 0,33. Eli selkeää korrelaatiota oli havaittavissa maalien sekä laukauksien ja aloitusten välillä. Nyt alkukauden osalta vastaavat korrelaatioluvut ovat 0,09, -0,16 ja 0,54.

Jatka lukemista “Liiga-kausi 2018-19”

Päivitys: uusien autojen myynti 1-6 2018

Päivitin Trafin datan ja raporttini myös alkuvuoden automyyntien osalta. Kappalemääräisen kokonaismyynnin osalta kasvua näyttäisi olevan peräti 7,6 %. Merkkien osalta kärjessä edelleen Toyota, Volkswagen ja Skoda. Volvon (-5,3 %), Mercedeksen (-3,9 %) , Kian  (-9,6 %) ja BMW:n (-4,7 %) osuudet ovat hienoisessa laskussa, Renautin (-33,2 %) ja Hondan (-18,1 %) myynnit suuremmassa laskussa. Ford (+18,3 %) ja etenkin Hyundai (+80,9 %), Peugeot (+51,6 %) ja Seat (+42,8 %) ovat kovassa nousussa.

Jatka lukemista “Päivitys: uusien autojen myynti 1-6 2018”

Utililize Azure Virtual Machine file folder from your own computer

This time I like to share my experiences how I could utilize files saved on my Azure Virtual Machine. My ultimate goal was to test can I access those with Azure Data Factory SSIS Integration Runtime without copying the files into Azure File Share. However I decided to start trying to access those files from my own workstation with map network drive method and not with remote desktop connection.

First I shared a folder on my Azure Virtual Machine. Then I started googling and found some instructions how I could connect my computer to Azure Virtual Network (VNet). I had to create so called point-to-site connection to my Azure (VNet). I had already my VNet in place and my virtual machine connected to it. VNet is type “new”, I don’t know what it is called officially, but not Classic one.

I started from creating a Gateway subnet into my VNet. You can create it by selecting first Subnets and then by clicking “Gateway subnet”.

Jatka lukemista “Utililize Azure Virtual Machine file folder from your own computer”

Uudet moottoripyörät 1-6 2018

Päivitin vuoden 2017 moottoripyörien myyntiä koskevan analyysini tuoreella 1-6 2018 Trafin datalla. Muutin vähän käsittelysääntöjä ja rupesi oikein itseänikin epäilyttämään meniköhän kaikki oikein. Kasvua nimittäin vuoden takaiseen vastaavaan ajankohtaan olisi peräti 8,2 %! No tällä mennään kunnes jostain tulee varmistus, että väärin meni.

Jatka lukemista “Uudet moottoripyörät 1-6 2018”