I learned just recently about new feature what Azure Machine Learning holds – Auto Machine Learning (AutoML). According to Microsoft’s documentation:
“Traditional machine learning model development is resource-intensive, requiring significant domain knowledge and time to produce and compare dozens of models. Apply automated ML when you want Azure Machine Learning to train and tune a model for you using the target metric you specify. The service then iterates through ML algorithms paired with feature selections, where each iteration produces a model with a training score. The higher the score, the better the model is considered to “fit” your data.”
Sounds good! I don’t have such a vast data science experience, so this kind of automated experimentation is just for people like me. There is no need for time consuming experimentation, because I don’t even know most of the algorithms.
Warehouse uses its own surrogate keys for dimension tables instead of natural
key coming from a source. This way you can e.g. implement slowly changing
dimensions later in the process. This time I will demonstrate how to generate
surrogate keys using Databricks with Azure Synapse Analytics (f.k.a. Azure Data
demonstrate how to use inserts instead of updates when already existing row
requires an update. Microsoft recommends using inserts instead of updates with
Synapse. Basically this is the same with every Data Warehouse technology
pondering with three approaches how to generate surrogate keys when using
Databricks and Azure Synapse:
Use IDENTITY-column in Azure
Synapse. Downside with this one is that values sometimes are not subsequent and
even in some extraordinary cases might not be unique.
Use ROW_NUMBER functionality in
Databricks’ SQL block.
Azure Data Factory Mapping Data Flows is graphical user interface to build data transformation logic without coding like in SSIS. It generates behind scenes Databricks code (Scala or Python, I’m not sure) and use Databricks cluster to execute jobs.
I think this is the breakthrough feature that cloud data integration has really needed. My first touch to so called ETL tools was about 15 years ago when I started to use Informatica PowerCenter. I was really sold and thought there is no going back to scripting. You can imagine my disappointment when cloud technology started to emerge, but it missed native graphical ETL capabilities. In AWS you have had PaaS ETL services like Talend or Matillion, but you still need to manage your software. AWS has also AWS Glue as a SaaS offering, but I wasn’t convinced at all about that. This ADF Mapping Data Flows is the one I have been waiting for. Real SaaS ETL in the cloud. I have to confess I’m already used to use Databricks Notebooks and coding, but graphical user interface still lowers a barrier for new users a lot.
I stumbled to strange error with our Azure Data Factory SSIS Integration Runtime, which I decided to share here if somebody faces the same problem. Infrastructure guys made some network changes, which lead to situation where our SSIS Integration Runtime didn’t work anymore. It gave some “MisconfiguredDnsServerOrNsgSettings” error message, which was quite natural and was corrected by returning original network settings.
Last time I
promised to blog about Azure Data Factory Data Flows, but decided to do this
first. My business problem was to process files on On-Premise file share with
SSIS without moving original files anywhere. Challenge was that with SSIS that
is not easily done without possibility to move the original file. Maybe with
log table and custom code.
to use Azure Data Factory (ADF) and Azure Blob Storage to tackle this
challenge. Even ADF is missing couple of critical features, but I managed to
use workarounds. I wanted to write about this, because I couldn’t find any good
instructions for this process in one place.
In ADF there
is a nice component called “Get Metadata”. Basically you could do this also
with SSIS script component. I decided to use File Creation Date and store that
into BatchControl table for the next run.
Azure Databricks was released into preview on November 2017 and GA was on March 2018. Data Factory announced its support for Azure Databricks on April 2018 and I kind of missed that. Maybe because it was at the beginning profiled more as a Machine Learning part of the Data Warehouse infrastructure.
My interest was awaken when I saw a presentation held at PASS Summit where Databricks was positioned as central part of Microsoft’s modern data warehouse architecture.
Point is to use Data Factory as a data sourcing and pipeline orchestration tool with its wide connectivity capabilities to different data sources. Data is stored first into Azure, e.g. into blob storage. Databricks is used to do required data transformations before moving data into proper database (e.g. Azure SQL Database, Azure SQL DW V2, Azure Snowflake) for reporting and analysis purposes. Note also, there is no mentioning about SSIS in this architecture.
I wanted to test how I can connect other than public data sources to my Azure Data Factory SSIS IR. Namely I wanted to be able to read files from a file share. Originally I thought I would need so called Self-Hosted Integration Runtime, but actually you can connect SSIS IR to your On-Premise data sources. If you don’t have SSIS IR, then you would need Self-Hosted IR connecting On-Premise data sources. It’s basically installs own Azure VM to your Azure environment.
I tested also connection to On-Premise SQL Server in other environment. It was quite tricky and took me a while until I found this article: https://docs.microsoft.com/en-us/sql/integration-services/lift-shift/ssis-azure-connect-with-windows-auth?view=sql-server-2017
According to it I needed to store my domain credentials into SSISDB like this: catalog.set_execution_credential @user='<your user name>’, @domain='<your domain name>’, @password='<your password>’
After that I finally got my connection to work. Of course other prerequisite was to open On-Premise firewall for connections from SSIS IR.
No niin, kesä ja golf-kausi alkavat olla ohi, niin voisi alkaa taas vaikka seuraamaan kiekkoa. Päivitin Liiga-analyysini koskemaan nyt kautta 2018-19 ja yritän muistaa päivittää tilastoja säännönmukaisesti.
Vielä ei kannata syvää luotaavia analyyseja tehdä. Ensin yksi poiminta viime kauden lopusta. Viime kautena analysoin myös maalien, laukauksien ja aloituksien välisiä korrelaatioita. Kauden lopussa korrelaatioluvut olivat maalit/aloitukset 0,23, maalit/laukaukset 0,25 ja aloitukset/laukaukset 0,33. Eli selkeää korrelaatiota oli havaittavissa maalien sekä laukauksien ja aloitusten välillä. Nyt alkukauden osalta vastaavat korrelaatioluvut ovat 0,09, -0,16 ja 0,54.
Päivitin Trafin datan ja raporttini myös alkuvuoden automyyntien osalta. Kappalemääräisen kokonaismyynnin osalta kasvua näyttäisi olevan peräti 7,6 %. Merkkien osalta kärjessä edelleen Toyota, Volkswagen ja Skoda. Volvon (-5,3 %), Mercedeksen (-3,9 %) , Kian (-9,6 %) ja BMW:n (-4,7 %) osuudet ovat hienoisessa laskussa, Renautin (-33,2 %) ja Hondan (-18,1 %) myynnit suuremmassa laskussa. Ford (+18,3 %) ja etenkin Hyundai (+80,9 %), Peugeot (+51,6 %) ja Seat (+42,8 %) ovat kovassa nousussa.