Table of contents
1.
Introduction
2.
Azure Synapse SQL architecture
3.
Synapse SQL architecture components
3.1.
Azure Storage
3.2.
Control node
3.3.
Compute nodes
3.4.
Data Movement Service
3.5.
Distributions
3.6.
Hash-distributed tables
3.7.
Round-robin distributed tables
3.8.
Replicated tables
4.
Frequently Asked Questions
4.1.
What is azure synapse SQL?
4.2.
Is Azure Synapse a relational database?
4.3.
What is azure synapse used for?
5.
Conclusion
Last Updated: Mar 27, 2024

Azure Synapse SQL

Author Sanjana Yadav
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Azure Synapse SQL is a big data analytic tool that lets you query and analyze data using the T-SQL programming language. For data analysis, you can use the standard ANSI-compliant SQL dialect used on SQL Server and Azure SQL Database.

In this article, we will learn about the Azure Synapse SQL architecture components. 

Azure Synapse SQL architecture

Synapse SQL uses a scale-out architecture to distribute data processing across multiple nodes. Even though compute is distinct from storage, you can scale compute independently of the data in your system.

The unit of scale for a dedicated Sql pool is an abstraction of compute power known as a data warehouse unit.

Since the serverless SQL pool is serverless, scaling is done automatically to meet query resource requirements. As the topology changes over time due to the addition or removal of nodes or failovers, it adapts and ensures that your query has enough resources and completes successfully. For example, the image below depicts a serverless SQL pool executing a query across four compute nodes.

Dedicated SQL pool (formerly SQL DW) architecture - Azure Synapse Analytics  | Microsoft Learn

Dedicated SQL Pool

Source: Microsoft

Serverless SQL Pool

Source: Microsoft

Synapse SQL incorporates a node-based architecture. Applications connect to and issue T-SQL commands to a Control node, which serves as the single point of entry for Synapse SQL.

The Azure Synapse SQL Control node uses a distributed query engine to optimize queries for parallel processing before passing them to Compute nodes to complete their work in parallel.

The SQL serverless pool Control node makes use of the Distributed Query Processing (DQP) engine to optimize and facilitate distributed execution of user queries by breaking them down into smaller queries that will be executed on Compute nodes. Each small query is referred to as a task, and it represents a distributed execution unit. It reads files from storage, joins results from other tasks, and groups, or orders data from other tasks.

The Compute nodes are responsible for storing all user data in Azure Storage and running parallel queries. The Data Movement Service (DMS) is a system-level internal service that moves data across nodes as needed, to run queries concurrently and accurately.

When using Synapse SQL with decoupled storage and compute, you can benefit from independent sizing of compute power regardless of your storage requirements. Scaling is done automatically for serverless SQL pools, but for dedicated SQL pools, one can:

  • Increase or decrease compute power within a dedicated SQL pool without relocating data.
  • Pause compute capacity while keeping data intact, so you only pay for storage.
  • During operational hours, resume compute capacity.

Synapse SQL architecture components

Azure Storage

Synapse SQL uses Azure Storage to protect your user data. There is a separate charge for storage consumption because your data is stored and managed by Azure Storage.

You can query your data lake files using a serverless SQL pool, whereas a dedicated SQL pool can query and ingest data from your data lake files. When data is ingested into a dedicated SQL pool, it is sharded into distributions to improve system performance. When you define the table, you can specify which sharding pattern to use to distribute the data. The following sharding patterns are supported:

  • Hash
  • The Round Robin
  • Replicate

Control node

The Control node serves as the architecture's brain. The front end is where all applications and connections interact.

The distributed query engine runs on the Control node to optimize and coordinate parallel queries in Synapse SQL. When you send a T-SQL query to the dedicated SQL pool, the Control node converts it into queries that run in parallel against each distribution.

The DQP engine runs on the Control node in a serverless SQL pool to optimize and coordinate distributed execution of user queries by splitting them into smaller queries that are executed on Compute nodes. It also assigns sets of files to each node to process.

Compute nodes

The Compute nodes provide the computational power.

Distributions are mapped to Compute nodes in a dedicated SQL pool for processing. Pool remaps distributions to available  Compute nodes as you pay for more compute resources. The dedicated SQL pool's service level determines the number of Compute nodes, which ranges from 1 to 60. Each Compute node has a unique node ID that can be seen in system views. Look for the node id column in system views that begin with sys.pdw nodes to see the Compute node ID. See Synapse SQL system views for a list of these system views.

Each Compute node in a serverless SQL pool is allocated a task and a set of files to execute the task on. Task is a distributed query execution unit that is part of the query that the user provided. Automatic scaling is in place to ensure that enough Compute nodes are available to perform user queries.

Data Movement Service

Data Movement Service (DMS) is a data transport mechanism in a dedicated SQL pool that facilitates data movement across Compute nodes. Some queries need data movement for simultaneous queries to produce accurate results. When data movement is needed, DMS guarantees that the correct data is sent to the correct destination.

Distributions

Distribution is the fundamental storage and processing unit for parallel queries running on distributed data in a dedicated SQL pool. When a query is conducted in a dedicated SQL pool, the work is split into 60 smaller queries that run in parallel.

Each of the 60 smaller queries is executed on different data distribution. Each Compute node manages one or more of the 60 distributions. One distribution per Compute node exists in a dedicated SQL pool with maximum compute resources. All distributions are on one compute node in a dedicated SQL pool with minimum compute resources.

Hash-distributed tables

A hash distributed table can provide the best query performance for joins and aggregations on large tables.

A dedicated SQL pool uses a hash function to deterministically assign each row to one distribution to shard data into a hash-distributed table. One of the columns in the table definition is marked as the distribution column. The hash function assigns each row to a distribution based on the values in the distribution column.

The figure below shows how a complete (non-distributed table) is stored as a hash-distributed table.

Source: Microsoft 

  • Each row belongs to a single distribution.
  • Each row is assigned to one distribution through a deterministic hash algorithm.
  • The number of table rows per distribution changes according to the table size.

The selection of a distribution column must consider performance factors like distinctness, data skew, and the types of queries executed on the system.

Round-robin distributed tables

A round-robin table is the easiest to build and provides rapid performance when used as a load staging table.

A round-robin distributed table distributes data uniformly over the table but without additional optimization. After selecting a distribution at random, buffers of data are assigned to distributions successively. Loading data into a round-robin table is rapid, but hash distributed tables often provide greater query performance. Joins on round-robin tables necessitate the reshuffling of data, which costs time.

Replicated tables

For small tables, a replicated table delivers the fastest query performance.

A replicated table caches a complete duplicate of the table on each compute node. As a result, replicating a table eliminates the requirement to transport data between compute nodes prior to a join or aggregation. Replicated tables work well with small tables. Extra storage is needed, and there is an additional cost when writing data, making large tables unfeasible.

The figure below depicts a replicated table that is cached on each compute node's initial distribution.
 

Source: Microsoft 

Frequently Asked Questions

What is azure synapse SQL?

Azure Synapse SQL is a big data analytics solution that lets you query and analyze data using the T-SQL programming language. For data analysis, you can utilize the standard ANSI-compliant SQL dialect used on SQL Server and Azure SQL Database.

Is Azure Synapse a relational database?

Azure Synapse Analytics is the next step in the evolution of Azure SQL Data Warehouse. Azure SQL Data Warehouse was a cloud-based, scale-out, massively parallel processing (MPP) relational database designed to analyze and store enormous amounts of data on the Microsoft Azure cloud platform.

What is azure synapse used for?

Azure Synapse Analytics is an unrestricted analytics service that combines data integration, enterprise data warehousing, and big data analytics. It enables you to query data at scale on your terms, utilizing either serverless or dedicated solutions.

Conclusion

In this article, we have extensively discussed Azure Synapse SQL. Our discussion mainly focused on its architecture and its architectural components.

We hope this blog has helped you enhance your Azure Data Lake Analytics knowledge. To learn more about Microsoft Azure, refer to our articles on Microsoft Azure Certification – Coding Ninjas Blog.  

Refer to our guided paths on the Coding Ninjas Studio platform to learn more about DSA, DBMS, Competitive Programming, Python, Java, JavaScript, etc. 

Refer to the links problemstop 100 SQL problemsresources, and mock tests to enhance your knowledge.

For placement preparations, visit interview experiences and interview bundle.

Do upvote our blog to help other ninjas grow. Happy Coding!

Live masterclass