Azure Data Factory Mapping Data Flows is graphical user interface to build data transformation logic without coding like in SSIS. It generates behind scenes Databricks code (Scala or Python, I’m not sure) and use Databricks cluster to execute jobs.
I think this is the breakthrough feature that cloud data integration has really needed. My first touch to so called ETL tools was about 15 years ago when I started to use Informatica PowerCenter. I was really sold and thought there is no going back to scripting. You can imagine my disappointment when cloud technology started to emerge, but it missed native graphical ETL capabilities. In AWS you have had PaaS ETL services like Talend or Matillion, but you still need to manage your software. AWS has also AWS Glue as a SaaS offering, but I wasn’t convinced at all about that. This ADF Mapping Data Flows is the one I have been waiting for. Real SaaS ETL in the cloud. I have to confess I’m already used to use Databricks Notebooks and coding, but graphical user interface still lowers a barrier for new users a lot.
I stumbled to strange error with our Azure Data Factory SSIS Integration Runtime, which I decided to share here if somebody faces the same problem. Infrastructure guys made some network changes, which lead to situation where our SSIS Integration Runtime didn’t work anymore. It gave some “MisconfiguredDnsServerOrNsgSettings” error message, which was quite natural and was corrected by returning original network settings.
Last time I
promised to blog about Azure Data Factory Data Flows, but decided to do this
first. My business problem was to process files on On-Premise file share with
SSIS without moving original files anywhere. Challenge was that with SSIS that
is not easily done without possibility to move the original file. Maybe with
log table and custom code.
to use Azure Data Factory (ADF) and Azure Blob Storage to tackle this
challenge. Even ADF is missing couple of critical features, but I managed to
use workarounds. I wanted to write about this, because I couldn’t find any good
instructions for this process in one place.
In ADF there
is a nice component called “Get Metadata”. Basically you could do this also
with SSIS script component. I decided to use File Creation Date and store that
into BatchControl table for the next run.
Azure Databricks was released into preview on November 2017 and GA was on March 2018. Data Factory announced its support for Azure Databricks on April 2018 and I kind of missed that. Maybe because it was at the beginning profiled more as a Machine Learning part of the Data Warehouse infrastructure.
My interest was awaken when I saw a presentation held at PASS Summit where Databricks was positioned as central part of Microsoft’s modern data warehouse architecture.
Point is to use Data Factory as a data sourcing and pipeline orchestration tool with its wide connectivity capabilities to different data sources. Data is stored first into Azure, e.g. into blob storage. Databricks is used to do required data transformations before moving data into proper database (e.g. Azure SQL Database, Azure SQL DW V2, Azure Snowflake) for reporting and analysis purposes. Note also, there is no mentioning about SSIS in this architecture.
I wanted to test how I can connect other than public data sources to my Azure Data Factory SSIS IR. Namely I wanted to be able to read files from a file share. Originally I thought I would need so called Self-Hosted Integration Runtime, but actually you can connect SSIS IR to your On-Premise data sources. If you don’t have SSIS IR, then you would need Self-Hosted IR connecting On-Premise data sources. It’s basically installs own Azure VM to your Azure environment.
I tested also connection to On-Premise SQL Server in other environment. It was quite tricky and took me a while until I found this article: https://docs.microsoft.com/en-us/sql/integration-services/lift-shift/ssis-azure-connect-with-windows-auth?view=sql-server-2017
According to it I needed to store my domain credentials into SSISDB like this: catalog.set_execution_credential @user='<your user name>’, @domain='<your domain name>’, @password='<your password>’
After that I finally got my connection to work. Of course other prerequisite was to open On-Premise firewall for connections from SSIS IR.
No niin, kesä ja golf-kausi alkavat olla ohi, niin voisi alkaa taas vaikka seuraamaan kiekkoa. Päivitin Liiga-analyysini koskemaan nyt kautta 2018-19 ja yritän muistaa päivittää tilastoja säännönmukaisesti.
Vielä ei kannata syvää luotaavia analyyseja tehdä. Ensin yksi poiminta viime kauden lopusta. Viime kautena analysoin myös maalien, laukauksien ja aloituksien välisiä korrelaatioita. Kauden lopussa korrelaatioluvut olivat maalit/aloitukset 0,23, maalit/laukaukset 0,25 ja aloitukset/laukaukset 0,33. Eli selkeää korrelaatiota oli havaittavissa maalien sekä laukauksien ja aloitusten välillä. Nyt alkukauden osalta vastaavat korrelaatioluvut ovat 0,09, -0,16 ja 0,54.
Päivitin Trafin datan ja raporttini myös alkuvuoden automyyntien osalta. Kappalemääräisen kokonaismyynnin osalta kasvua näyttäisi olevan peräti 7,6 %. Merkkien osalta kärjessä edelleen Toyota, Volkswagen ja Skoda. Volvon (-5,3 %), Mercedeksen (-3,9 %) , Kian (-9,6 %) ja BMW:n (-4,7 %) osuudet ovat hienoisessa laskussa, Renautin (-33,2 %) ja Hondan (-18,1 %) myynnit suuremmassa laskussa. Ford (+18,3 %) ja etenkin Hyundai (+80,9 %), Peugeot (+51,6 %) ja Seat (+42,8 %) ovat kovassa nousussa.
This time I like to share my experiences how I could utilize files saved on my Azure Virtual Machine. My ultimate goal was to test can I access those with Azure Data Factory SSIS Integration Runtime without copying the files into Azure File Share. However I decided to start trying to access those files from my own workstation with map network drive method and not with remote desktop connection.
First I shared a folder on my Azure Virtual Machine. Then I started googling and found some instructions how I could connect my computer to Azure Virtual Network (VNet). I had to create so called point-to-site connection to my Azure (VNet). I had already my VNet in place and my virtual machine connected to it. VNet is type “new”, I don’t know what it is called officially, but not Classic one.
I started from creating a Gateway subnet into my VNet. You can create it by selecting first Subnets and then by clicking “Gateway subnet”.
Päivitin vuoden 2017 moottoripyörien myyntiä koskevan analyysini tuoreella 1-6 2018 Trafin datalla. Muutin vähän käsittelysääntöjä ja rupesi oikein itseänikin epäilyttämään meniköhän kaikki oikein. Kasvua nimittäin vuoden takaiseen vastaavaan ajankohtaan olisi peräti 8,2 %! No tällä mennään kunnes jostain tulee varmistus, että väärin meni.