Azure data platform end-to-end

Azure data platform

This example scenario demonstrates how to use the extensive family of Azure Data Services to build a modern data platform capable of handling the most common data challenges in an organization.

The solution described in this article combines a range of Azure services that will ingest, process, store, serve, and visualize data from different sources, both structured and unstructured.

This solution architecture demonstrates how a single, unified data platform can be used to meet the most common requirements for:

  • Traditional relational data pipelines
  • Big data transformations
  • Unstructured data ingestion and enrichment with AI-based functions
  • Stream ingestion and processing following the Lambda architecture
  • Serving insights for data-driven applications and rich data visualization

Relevant use cases

This approach can also be used to:

  • Establish an enterprise-wide data hub consisting of a data warehouse for structured data and a data lake for semi-structured and unstructured data. This data hub becomes the single source of truth for your data.
  • Integrate relational data sources with other unstructured datasets with the use of big data processing technologies;
  • Use semantic modeling and powerful visualization tools for simpler data analysis.

Architecture

Architecture for a modern data platform using Azure data services

Note

  • The services covered by this architecture are only a subset of a much larger family of Azure services. Similar outcomes can be achieved by using other services or features not covered by this design.
  • Specific business requirements for your analytics use case may also ask for the use of different services or features not considered in this design.

The data flows through the solution as follows (from bottom-up):

Relational databases

  1. Use Azure Data Factory pipelines to pull data from a wide variety of databases, both on-premises and in the cloud. Pipelines can be triggered based on a pre-defined schedule, in response to an event or be explicitly called via REST APIs.
  2. Still part of the Azure Data Factory pipeline, use Azure Data Lake Store Gen 2 to stage the data copied from the relational databases. You can save the data in delimited text format or compressed as Parquet files.
  3. Use Azure Synapse PolyBase capabilities for fast ingestion into your data warehouse tables.
  4. Load relevant data from the Azure Synapse data warehouse into Power BI datasets for data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships.
  5. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.

Semi-structured data sources

  1. Use Azure Data Factory pipelines to pull data from a wide variety of semi-structured data sources, both on-premises and in the cloud. For example, you can ingest data from file-based locations containing CSV or JSON files. You can connect to No-SQL databases such as Cosmos DB or Mongo DB. Or you call REST APIs provided by SaaS applications that will function as your data source for the pipeline.
  2. Still part of the Azure Data Factory pipeline, use Azure Data Lake Store Gen 2 to save the original data copied from the semi-structured data source.
  3. Azure Data Factory Mapping Data Flows or Azure Databricks notebooks can now be used to process the semi-structured data and apply the necessary transformations before data can be used for reporting. You can save the resulting dataset as Parquet files in the data lake.
  4. Use Azure Synapse PolyBase capabilities for fast ingestion into your data warehouse tables.
  5. Load relevant data from the Azure Synapse data warehouse into Power BI datasets for data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships.
  6. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.

Non-structured data sources

  1. Use Azure Data Factory pipelines to pull data from a wide variety of unstructured data sources, both on-premises and in the cloud. For example, you can ingest video, image or free text log data from file-based locations. You can also call REST APIs provided by SaaS applications that will function as your data source for the pipeline.
  2. Still part of the Azure Data Factory pipeline, use Azure Data Lake Store Gen 2 to save the original data copied from the unstructured data source.
  3. You can invoke Azure Databricks notebooks from your pipeline to process the unstructured data. The notebook can make use of Cognitive Services APIs or invoke custom Azure Machine Learning Service models to generate insights from the unstructured data. You can save the resulting dataset as Parquet files in the data lake.
  4. Use Azure Synapse PolyBase capabilities for fast ingestion into your data warehouse tables.
  5. Load relevant data from the Azure Synapse data warehouse into Power BI datasets for data visualization. Power BI models implement a semantic model to simplify the analysis of business data and relationships.
  6. Business analysts use Power BI reports and dashboards to analyze data and derive business insights.

Streaming

  1. Use Azure Event Hubs to ingest data streams generated by a client application. The Event Hub will then ingest and store streaming data preserving the sequence of events received. Consumers can then connect to Event Hub and retrieve the messages for processing.
  2. Configure the Event Hub Capture to save a copy of the events in your data lake. This feature implements the “Cold Path” of the Lambda architecture pattern and allows you to perform historical and trend analysis on the stream data saved in your data lake using tools such as Azure Databricks notebooks.
  3. Use a Stream Analytics job to implement the “Hot Path” of the Lambda architecture pattern and derive insights from the stream data in transit. Define at least one input for the data stream coming from your Event Hub, one query to process the input data stream and one Power BI output to where the query results will be sent to.
  4. Business analysts then use Power BI real-time datasets and dashboard capabilities for to visualize the fast changing insights generated by your Stream Analytics query.

Architecture components

The following Azure services have been used in the architecture:

  • Azure Data Factory
  • Azure Data Lake Gen2
  • Azure Synapse Analytics
  • Azure Databricks
  • Azure Cosmos DB
  • Azure Cognitive Services
  • Azure Event Hubs
  • Azure Stream Analytics
  • Microsoft Power BI

If you need further training resources or access to technical documentation, the table below links to Microsoft Learn and to each service’s Technical Documentation.

Azure ServiceMicrosoft LearnTechnical Documentation
Azure Data FactoryData ingestion with Azure Data FactoryAzure Data Factory Technical Documentation
Azure Synapse AnalyticsImplement a Data Warehouse with Azure Synapse AnalyticsAzure Synapse Analytics Technical Documentation
Azure Data Lake Storage Gen2Large Scale Data Processing with Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2 Technical Documentation
Azure Cognitive ServicesCognitive Services Learning Paths and ModulesAzure Cognitive Services Technical Documentation
Azure Cosmos DBWork with NoSQL data in Azure Cosmos DBAzure Cosmos DB Technical Documentation
Azure DatabricksPerform data engineering with Azure DatabricksAzure Databricks Technical Documentation
Azure Event HubsEnable reliable messaging for Big Data applications using Azure Event HubsAzure Event Hubs Technical Documentation
Azure Stream AnalyticsImplement a Data Streaming Solution with Azure Streaming AnalyticsAzure Stream Analytics Technical Documentation
Power BICreate and use analytics reports with Power BIPower BI Technical Documentation

Alternatives

Considerations

The technologies in this architecture were chosen because each of them provide the necessary functionality to handle the vast majority of data challenges in an organization. These services meet the requirements for scalability and availability, while helping them control costs.

Pricing

The ideal individual pricing tier and the total overall cost of each service included in the architecture is dependent on the amount of data to be processed and stored and the acceptable performance level expected. Use the guide below to learn more about how each service is priced:

  • Azure Synapse allows you to scale your compute and storage levels independently. Compute resources are charged per hour, and you can scale or pause these resources on demand. Storage resources are billed per terabyte, so your costs will increase as you ingest more data.
  • Data Factory costs are based on the number of read/write operations, monitoring operations, and orchestration activities performed in a workload. Your Data Factory costs will increase with each additional data stream and the amount of data processed by each one.
  • Power BI has different product options for different requirements. Power BI Embedded provides an Azure-based option for embedding Power BI functionality inside your applications. A Power BI Embedded instance is included in the pricing sample above.

Leave a Reply

Your email address will not be published. Required fields are marked *