Search This Blog

March 05, 2021

Optimizing transactions in Azure Data Warehouse (Synapse)

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-develop-best-practices-transactions  

Minimally logged operations

The following operations are capable of being minimally logged:

  • CREATE TABLE AS SELECT (CTAS)
  • INSERT..SELECT
  • CREATE INDEX
  • ALTER INDEX REBUILD
  • DROP INDEX
  • TRUNCATE TABLE
  • DROP TABLE
  • ALTER TABLE SWITCH PARTITION
CTAS and INSERT...SELECT are both bulk load operations

Primary IndexLoad ScenarioLogging Mode
HeapAnyMinimal
Clustered IndexEmpty target tableMinimal
Clustered IndexLoaded rows do not overlap with existing pages in targetMinimal
Clustered IndexLoaded rows overlap with existing pages in targetFull
Clustered Columnstore IndexBatch size >= 102,400 per partition aligned distributionMinimal
Clustered Columnstore IndexBatch size < 102,400 per partition aligned distributionFull

DELETE is a fully logged operation. If you need to delete a large amount of data in a table or a partition, it often makes more sense to SELECT the data you wish to keep, which can be run as a minimally logged operation. To select the data, create a new table with CTAS. Once created, use RENAME to swap out your old table with the newly created table.

UPDATE is a fully logged operation. If you need to update a large number of rows in a table or a partition, it can often be far more efficient to use a minimally logged operation such as CTAS to do so

Both UPDATE and DELETE are fully logged operations and so these undo/redo operations can take significantly longer than equivalent minimally logged operations.


March 04, 2021

Transferring data to and from Azure

 

CapabilityAzure CLIAzCopyPowerShellAdlCopyPolyBase
Compatible platformsLinux, OS X, WindowsLinux, WindowsWindowsLinux, OS X, WindowsSQL Server, Azure Synapse
Optimized for big dataNoYesNoYes 1Yes 2
Copy to relational databaseNoNoNoNoYes
Copy from relational databaseNoNoNoNoYes
Copy to Blob storageYesYesYesNoYes
Copy from Blob storageYesYesYesYesYes
Copy to Data Lake StoreNoYesYesYesYes
Copy from Data Lake StoreNoNoYesYesYes



CapabilityAzure Storage ExplorerAzure portal *Azure Data Factory
Optimized for big dataNoNoYes
Copy to relational databaseNoNoYes
Copy from relational databaseNoNoYes
Copy to Blob storageYesNoYes
Copy from Blob storageYesNoYes
Copy to Data Lake StoreNoNoYes
Copy from Data Lake StoreNoNoYes
Upload to Blob storageYesYesYes
Upload to Data Lake StoreYesYesYes
Orchestrate data transfersNoNoYes
Custom data transformationsNoNoYes
Pricing modelFreeFreePay per usage

March 03, 2021

Data analytics technology in Azure

 

CapabilityPower BIJupyter NotebooksZeppelin NotebooksMicrosoft Azure Notebooks
Connect to big data cluster for advanced processingYesYesYesNo
Managed serviceYesYes 1Yes 1Yes
Connect to 100s of data sourcesYesNoNoNo
Offline capabilitiesYes 2NoNoNo
Embedding capabilitiesYesNoNoNo
Automatic data refreshYesNoNoNo
Access to numerous open source packagesNoYes 3Yes 3Yes 4
Data transformation/cleansing optionsPower Query, R40 languages, including Python, R, Julia, and Scala20+ interpreters, including Python, JDBC, and RPython, F#, R
PricingFree for Power BI Desktop (authoring), see pricing for hosting optionsFreeFreeFree
Multiuser collaborationYesYes (through sharing or with a multiuser server like JupyterHub)YesYes (through sharing)

[1] When used as part of a managed HDInsight cluster.

[2] With the use of Power BI Desktop.

[2] You can search the Maven repository for community-contributed packages.

[3] Python packages can be installed using either pip or conda. R packages can be installed from CRAN or GitHub. Packages in F# can be installed via nuget.org using the Paket dependency manager.

March 02, 2021

Data pipeline orchestration technology in Azure

 

Capability matrix

The following tables summarize the key differences in capabilities.

General capabilities

GENERAL CAPABILITIES
CapabilityAzure Data FactorySQL Server Integration Services (SSIS)Oozie on HDInsight
ManagedYesNoYes
Cloud-basedYesNo (local)Yes
PrerequisiteAzure SubscriptionSQL ServerAzure Subscription, HDInsight cluster
Management toolsAzure Portal, PowerShell, CLI, .NET SDKSSMS, PowerShellBash shell, Oozie REST API, Oozie web UI
PricingPay per usageLicensing / pay for featuresNo additional charge on top of running the HDInsight cluster

Pipeline capabilities

PIPELINE CAPABILITIES
CapabilityAzure Data FactorySQL Server Integration Services (SSIS)Oozie on HDInsight
Copy dataYesYesYes
Custom transformationsYesYesYes (MapReduce, Pig, and Hive jobs)
Azure Machine Learning scoringYesYes (with scripting)No
HDInsight On-DemandYesNoNo
Azure BatchYesNoNo
Pig, Hive, MapReduceYesNoYes
SparkYesNoNo
Execute SSIS PackageYesYesNo
Control flowYesYesYes
Access on-premises dataYesYesNo

Scalability capabilities

SCALABILITY CAPABILITIES
CapabilityAzure Data FactorySQL Server Integration Services (SSIS)Oozie on HDInsight
Scale upYesNoNo
Scale outYesNoYes (by adding worker nodes to cluster)
Optimized for big dataYesNoYes