Search This Blog

September 10, 2019

Azure Data Factory - Execute Pipeline Activity

Execute Pipeline activity is used to call another Azure Data Factory Pipeline. This will help to develop reusable pipelines and use it in other pipelines

For example, you can create a separate pipeline to archive the files after processing. You can create one pipeline for archiving files (move the blobs or files to different folders) and call that pipeline in multiple pipelines. This will help you to reduce development time and increase code reusability.

Similarly, you can create a separate pipeline for sending emails in Azure data factory and re-use in multiple pipelines.


In the following screen-shot, we are calling two different pipelines (Archiving Pipeline, Send_Error_Message)









For more information, please check the following
https://docs.microsoft.com/en-us/azure/data-factory/control-flow-execute-pipeline-activity




September 09, 2019

Azure Data Warehouse - Materialized View

Materialized View

Materialized Views are similar to Indexed Views in SQL Server. These are used to increase query performance and hide complex query logic from the users.

To enable disabled materialized view, you need to use the "Alter" statement.

ALTER MATERIALIZED VIEW My_Indexed_View REBUILD;

Materialized views are visible within SQL Server Management Studio under the views folder of the Azure SQL Data Warehouse instance

Limitations

  • Materialized views are disabled during an update or delete operations on the base tables.
  • Only Clustered Cloumnstore Index is supported.
  • Only Hash and Round Robin distributions are supported.

Sample Code

CREATE MATERIALIZED VIEW mv_test2
WITH (distribution = hash(i_category_id), FOR_APPEND)
AS
SELECT MAX(i.i_rec_start_date) as max_i_rec_start_date, MIN(i.i_rec_end_date) as min_i_rec_end_date, i.i_item_sk, i.i_item_id, i.i_category_id
FROM syntheticworkload.item i
GROUP BY i.i_item_sk, i.i_item_id, i.i_category_id


For more information, check the following links

https://docs.microsoft.com/en-us/sql/t-sql/statements/create-materialized-view-as-select-transact-sql?view=azure-sqldw-latest


September 08, 2019

Azure Data Factory - If Condition activity

If activity is used to perform different activities based on the input expressions. In the following, we are checking if the file is present or not using "GetMetaData" activity. If file presents perform the activities in if True activities. If file does not present then Edit the False activities.

Passing input values for If condition activity



Set validation expression in settings for If condition activity.


Adding activity for True or False conditions.


For more information, please check https://docs.microsoft.com/en-us/azure/data-factory/control-flow-if-condition-activity

September 07, 2019

Azure Data Factory - For Each Activity

"ForEach" activity in Azure Data Factory is used to loop thru each row of the input and perform the actions based on the data.

Following screenshot shows two activities one is GetMetaData and second one is ForEach loop. The output of the GetMetaData is passed to the ForEach activity in the Settings. If the checkbox for "Sequential" is selected all the activities are performed in sequence for all the child items. If parallel execution is needed we need to un-check "Sequential".


"ForEach" activity can be used to an output of Lookup activity as well.

You can perform multiple activities for each row of data received for the input of ForEach activity. You can add all the activities by going to Activities and Edit Activities.

Limitations and workarounds

Here are some limitations of the ForEach activity and suggested workarounds.

Limitation Workaround
You can't nest a ForEach loop inside another ForEach loop (or an Until loop). Design a two-level pipeline where the outer pipeline with the outer ForEach loop iterates over an inner pipeline with the nested loop.
The ForEach activity has a maximum batchCount of 50 for parallel processing, and a maximum of 100,000 items. Design a two-level pipeline where the outer pipeline with the ForEach activity iterates over an inner pipeline.

For more information, please check the documentation at  https://docs.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity

September 06, 2019

Azure Analysis Services

Supported Features

  • AAS is available in 3 tiers (Developer for development, Basic and Standard)
  • We can pause and resume the server when not used to reduce the cost.
  • Supports only the tabular model.
  • Scale-out up to 8 nodes when the load on the model is increased.
  • Partitions are supported
  • Row-level permissions are supported.
  • Perspectives are supported.
  • Bi-directional relations are supported.
  • Metadata is stored in Azure Blob storage.

 Limitations
  • A multidimensional model is not supported
  • Power-pivot for SharePoint is not supported.



September 05, 2019

Azure SQL Server VS Azure Data Warehouse

Azure SQL Server
  • Database size limit is 4TB
  • Optimized for OLTP loads
  • Cross-database queries are supported
  • Automatic performance tuning of a database is supported


Azure Data Warehouse
  • Supports up to 1 Peta Byte (1024 TB) size.
  • Optimized for OLAP loads
  • MPP (Massively Parallel Processing) system (data is processed on multiple parallel nodes)
  • Polybase support to load data from multiple systems.
  • Cross-database queries are not supported.
  • Manual performance tuning.
  • You can pause and resume the database to reduce cost.

September 04, 2019

Azure Data Factory Stored Procedure Activity

Stored Procedure activity is used to execute Stored Procedure in Azure Data Factory. You can use this to maintain audit logs, update metadata information, to track files in Azure blobs to  MetaData activity output. To create dynamic external tables in Azure Data Warehouse.

To pass the parameter values to Stored Procedure, you can use pipeline parameters. Use dynamic content to pass dynamic values. You can import all the stored procedure parameters using import parameters


For more information on Stored Procedure activity, Please check the following https://docs.microsoft.com/en-us/azure/data-factory/transform-data-using-stored-procedure

September 03, 2019

Azure Data Factory Execute Pipeline Activity

Execute Pipeline Activity is used to call a different pipeline in Azure Data Factory. It is similar to calling Execute Package task in SSIS.

The main purpose of Execute Pipeline is to develop common pipelines which can be used in multiple pipelines and call them when needed. This helps to reduce development time and reduce maintenance of code.

If the calling pipeline needs to wait till the called pipeline is completed, set the "waitOnCompletation" to True.

Check the following for more details https://docs.microsoft.com/en-us/azure/data-factory/control-flow-execute-pipeline-activity


September 02, 2019

Azure Data Factory Lookup activity

Lookup activity in Azure Data Factory is different from the lookup activity present in SSIS. Lookup activity in Azure Data Factory is used to execute Stored Procedure with output.

It can return the first row or multiple rows of data from the stored procedure. The output can be used to input for each loop.

The lookup activity supports a max of 5000 rows as output.

For more details, please check the following https://docs.microsoft.com/en-us/azure/data-factory/control-flow-lookup-activity documentation

September 01, 2019

Azure Data Analytics: U-SQL Skipping headers and outputting headers

To skip the headers while reading the input files. Use the skipFirstNRows parameter

USING Extractors.Csv(skipFirstNRows:1)

For complete parameter details check the following
https://docs.microsoft.com/en-us/u-sql/functions/operators/extractors/extractor-parameters



Output the Header row while writting the output file

USING Outputters.Csv(outputHeader:true);

For complete parameter details check the following
https://docs.microsoft.com/en-us/u-sql/functions/operators/outputters/outputter-parameters

August 31, 2019

Microsoft Azure: Free Microsoft Azure learning videos for different roles at Pluralsight


Go to the following link and register with your email. Three roles are for free of cost. There is a sequence of videos.

In Azure Environment can be used for different things
  1. Web-based platforms
  2. Big Data Platform
  3. AI platform

Choose the technologies based on the platform you are interested in.

August 30, 2019

Microsoft Azure: End to End technologies used in Azure for data storage and processing


Azure Storage (Blob storage) -- Used to store data files. Landing or storage area for getting data from multiple systems. Create multiple storage containers for landing, staging, processing and archiving. Virtual folders. Permissions are at the container level. Less data cost. Cost and data availability is different in Hot, Cold & Archive levels.

Azure Data Lake (Gen-1) -- Optimized for Big Data storage and processing. Data can be stored in a hierarchy format. Security can be enabled at the folder level. Parallel writes and reads are enabled. Storage cost is more compared to Azure Storage.

Azure Data Lake (Gen-2) -- Optimized for Big Data storage and processing. Supports both Object-based storage and Hierarchy level storage. Advantages of Azure Storage and Azure Data Lake (Gen-1).

Azure Data Analytics -- Serverless and cluster level processing of big data.

Azure DataBricks -- Cluster-based processing of big data using Spark.

Azure Data Factory -- Similar to SSIS and SQL Server Agent. Used to develop control flows and different tasks for processing data. Jobs can be executed one time or scheduled basis. Only UTC time is supported for job schedules. You can use SSIS, U-SQL, Data Bricks Spark and Azure Data Factory tasks to process the data.

Azure SQL Server -- Supports up to 4 TB of data size. Mainly used for OLTP systems. When needed SQL Server instance, you need to use Managed Azure SQL Server. You can also create a single database.

Azure SQL Data Warehouse -- Used for OLAP data marts and if the data size is more than 4 TB. MPP processing architecture. Data is processed on multiple nodes. Data is stored in Azure Storage. You can create External tables to access HDFS and other data sources. Not all the SQL features are supported currently.

Azure Analysis Services -- Tabular model for Analysis services.

Power BI -- For Reporting purpose.

Azure DevOps -- For source code control and auto-deploy the code.

August 29, 2019

Azure Key Vault -- To store sensitive data

Azure Key Vault -- To store sensitive data

This technology is used to hide all the sensitive information like SQL Connection strings, SQL User Name, and passwords. Advantage of this technology is you define the key-value pairs like give the connection string a name and the entire connection string is hidden from all the applications.

This will help not to store the connection strings in source control or applications. In all the applications and source control we refer only with the secret name

For more information, check the following

August 28, 2019

Azure Data Factory - GetMetaData activity



GetMetaData activity is used to get file information which is present in Azure storage. This will get file size, row count, lastModifiedDate, file exists and other information.

Following screenshot shows how to get all files information present in a particular folder. The folder name is passed as pipeline parameter



The output of this activity can be used as input to Stored Procedure activity. It can be used to store the metadata information in Azure SQL Database

For more information check the following

August 26, 2019

Azure Storage: Azure Storage VS Azure Data Lake Storage (Gen-1 & Gen-II)

Azure Storage VS Azure Data Lake Storage (Gen-1 & Gen-II)
Azure Storage
· Object Storage
· Virtual folders -no real folders.
· No folder level security
· No folder specific performance optimizations
· Data is stored in Containers as blobs
· Storage cost is based on hot, cold and archive tiers
· Available in all regions globally
· Data replication and redundancy options
Azure Data Lake Storage (Gen-1)
· Hierarchical file system
o Supports nesting of files within folders.
· Folder level security can be implemented
o Fine-grained security visa access control lists
· Performance optimization at the folder level
· Parallel reads and writes
· Scaled out over multiple nodes
· Hadoop and big data optimizations
· Optimized for analytics workloads.
Azure Data Lake Storage (Gen-2)
· Multi-modal storage combining features from both Azure Storage and ADLS (Gen-1)
· Enable the Hierarchical Namespace to use for the file system.
· Data can be accessed using object storage endpoint or file system storage endpoint.
· Object Storage Endpoint – wasb://containername@accountname.blob.core.windows.net/
· File System Endpoint – abfs://filesystemname@accountname.dfs.core.windows.net/
· Optimized for analytics workloads.
Leverage partition scans & partition pruning to improve query performance.

August 25, 2019

Azure Data Analytics: How to access Azure Storage Blob data from Azure Data Lakes

1. Register Azure Storage Blob account in Azure Data Lakes

Ensure that your Windows Azure Blob Storage account is registered with your Azure Data Lake Analytics account. I have copied the steps below from Registering Your Windows Azure Blob Storage account.
  1. Navigate to the Azure Portal and log in.
  2. Navigate to your Azure Data Lake Analytics Account.
  3. Select Data Sources under Settings.
  4. Verify whether your WABS account is listed. If yes, stop here. If no, continue to next step.
  5. Click Add Data Source.
  6. Select Azure Storage from the Storage Type drop-down list.
  7. Select Select Account from the Selection Method drop-down list.
  8. Select your WABS account from the Azure Storage drop-down list.
  9. Click Add

January 09, 2019

PowerShell: How to rename multiple files in a folder at the same time



To rename single file you can manually edit the file names. If there are multiple files you want to change the name of the file or part of the file name, it is easy to rename the file in PowerShell.

You can use rename-item with -Replace option like below. To replace ".StoredProcedure" in all the files present in the current directory with "_", you can use the following syntax

   dir | rename-item -NewName {$_.name -replace ".StoredProcedure","_"}

For complete syntax and other examples visit the site at rename-item

January 04, 2019

SQL SERVER: How to refresh local cache in SQL Server Management Studio

When working on SQL Server Management Studio, when any new tables are created or columns are added, the local cache is not refreshed immediately and display as error for the new objects or columns in the Query Window. To avoid this we need to update local cache of SSMS.

If you need to refresh the local cache of SSMS you can use two methods

First Method:
       Press the following combination of keys on your keyboard by selecting SSMS

CTRL + Shift + R

Second Method:
    Select the Query window in SSMS
    In the Edit Menu
          Go to IntelliSense
          Press on Refresh Local Cache