Exploring OLAP in simple words with examples and scenarios

What is OLAP? OLAP stands for Online Analytical Processing. Think of it as a high-powered magnifying glass for data. It allows businesses to look at their data from different perspectives and dimensions, turning raw data into insightful information.

Online analytical processing (OLAP) is a technology that organizes large business databases and supports complex analysis. It can be used to perform complex analytical queries without negatively affecting transactional systems.

Simple Example: Imagine you run a chain of toy stores. You have sales data for every store, every day, for multiple years, and for thousands of toys. Now, you want to answer questions like:

  • Which store had the most sales last December?
  • Which toy was the best-seller during the summer across all stores?
  • How did toy sales this year compare to the previous year?

With OLAP, you can "slice and dice" the data to answer these questions quickly.

 Lets sum it up in technical terms:

 "The databases that a business uses to store all its transactions and records are called online transaction processing (OLTP) databases. These databases usually have records that are entered one at a time. Often they contain a great deal of information that is valuable to the organization. The databases that are used for OLTP, however, were not designed for analysis. Therefore, retrieving answers from these databases is costly in terms of time and effort. OLAP systems were designed to help extract this business intelligence information from the data in a highly performant way. This is because OLAP databases are optimized for heavy read, low write workloads."

 

Why We Use OLAP:

  1. Speed: OLAP databases are optimized for querying, making the retrieval of complex data faster.
  2. Flexibility: Users can look at the data from multiple dimensions. In our toy store example, dimensions could be time (years, months), locations (store branches), or products (different toys).
  3. Ad-hoc Queries: Unlike traditional databases where you might need predefined queries, OLAP allows spontaneous data exploration.

What Purpose It Solves:

  1. Informed Decision Making: Businesses can make decisions based on data-driven insights.
  2. Identifying Patterns: By analyzing data trends, businesses can capitalize on opportunities or address potential challenges.
  3. Enhanced Productivity: Since OLAP tools are user-friendly, even non-tech staff can explore data without relying on the IT department.

What Happens Without OLAP:

  1. Slower Insights: Traditional databases might not handle complex queries as efficiently.
  2. Limited Perspectives: Without the ability to "slice and dice" data, businesses might miss out on key insights.
  3. Dependency on IT: Non-tech employees might constantly need IT's help to pull and analyze data.

 

Use Cases:

  1. Business Reporting: Companies can generate monthly, quarterly, or annual reports to see performance metrics.
  2. Trend Analysis: Businesses can identify which products are becoming more popular over time.
  3. Budgeting and Forecasting: By analyzing past spending and revenue, companies can set future budgets.

In Conclusion: OLAP is like a Swiss Army knife for data. It offers a multifaceted view of information, ensuring businesses not only see the complete picture but also understand it. In today's competitive landscape, having an OLAP system is not just an advantage; it's a necessity for data-driven decision-making.

 


OLAP vs. OLTP: Differences, Relationship, and Choosing the Right One

What are OLAP and OLTP?

  • OLAP (Online Analytical Processing): It's like the "brain" of the data world. OLAP focuses on complex queries and involves vast amounts of data to assist in decision-making processes.
  • OLTP (Online Transaction Processing): Think of this as the "heartbeat." OLTP manages daily transactions and ensures smooth business operations.

Key Differences:

  1. Purpose:
    • OLAP: Used for data analysis and decision-making. It answers questions like "Which region had the highest sales last year?"
    • OLTP: Manages daily operations. It processes actions like placing an order or updating inventory.
  2. Database Design:
    • OLAP: Typically has a star or snowflake schema, making it efficient for complex queries.
    • OLTP: Usually follows a relational model, ensuring data integrity and fast transaction processing.
  3. Data Volume:
    • OLAP: Deals with large amounts of historical data.
    • OLTP: Manages current transactional data, which is comparatively smaller.
  4. Query Complexity:
    • OLAP: Complex, involving aggregations across multiple tables.
    • OLTP: Simple and quick, typically affecting only one record at a time.

Relationship Between the Two:

  • Data in OLTP systems often gets transferred to OLAP systems for analysis. You can think of OLTP as the frontline data gatherer, while OLAP is the analyzer.
  • OLTP handles the business's day-to-day operations, but once the data is old or needs analyzing, it's sent to OLAP systems.

When to Choose What:

  1. E-Commerce Store:
    • Scenario: Processing a customer's order.
      • Choose: OLTP, because it's a daily transaction.
    • Scenario: Analyzing the year's most popular products.
      • Choose: OLAP, as it requires aggregating past sales data.
  2. Bank:
    • Scenario: A customer transferring money to another account.
      • Choose: OLTP, as it's a real-time transaction.
    • Scenario: The bank wants to analyze the spending patterns of customers over the past five years.
      • Choose: OLAP, as it's a complex, data-heavy query.
  3. Hospital:
    • Scenario: Updating a patient's current health details during a visit.
      • Choose: OLTP, for immediate data entry and retrieval.
    • Scenario: Analyzing the efficiency of a treatment plan over the past decade.
      • Choose: OLAP, to get insights from historical data.

Conclusion: While OLAP and OLTP have different purposes, they're two sides of the same coin. OLTP ensures that businesses run smoothly day-to-day, while OLAP provides the insights needed for growth and improvement. Choosing between them depends on the task at hand: immediate transaction or deep analysis.

 

 


 

 

Scenario:

A multinational retail company wants to analyze its sales data. This data includes transaction details, product details, customer information, and store information. The primary objective is to understand sales performance, customer behaviors, and product popularity.

1. Data Ingestion:

Data Sources:

  • Transaction Data: Details of every sale made, including product ID, store ID, customer ID, date of purchase, and amount.
  • Product Data: Details of each product, including product ID, product name, category, and price.
  • Customer Data: Customer ID, name, age, gender, and location.
  • Store Data: Store ID, location, and size.

All this data is ingested into Azure Blob Storage from various point-of-sale systems, inventory systems, and customer management systems.

2. Data Movement:

Using Azure Data Factory:

  • Data from different sources is moved into Azure Synapse Analytics.
  • Basic cleaning is performed: removing duplicate entries, filling missing values, and correcting format inconsistencies.

3. Data Processing in Azure Synapse Analytics:

  • Transformation: Converting product prices from different currencies to a standard currency (e.g., USD).
  • Aggregation: Summing up total sales for each product, each store, and each region.
  • Calculation: Calculating the average sale amount, profit margins for products, and customer lifetime value.

4. Data Modeling in Azure Analysis Services:

  • Hierarchies:
    • Time Hierarchy: Year → Quarter → Month → Day.
    • Product Hierarchy: Category → Sub-category → Product Name.
    • Location Hierarchy: Country → State/Province → City → Store.
  • Metrics Defined: Total Sales, Average Sale, Total Profit, Number of Transactions.
  • Relationships: Between tables, like connecting transaction data to product data using product ID, allowing us to see which products have the highest sales.

5. Data Visualization in Power BI:

With the OLAP model ready:

  • A business user can see which products are the top sellers in Q1 in the USA.
  • Drill-down to see top sellers in New York in January.
  • Slice-and-dice data to compare sales of male vs. female customers.
  • Observe trends, like a sudden increase in sales of winter products during December.

Summary:

So, in our example, raw transactional data from stores worldwide gets transformed into actionable insights. The company can easily understand its performance metrics at various granularities (like global sales vs. sales in a specific store on a specific day) and act on these insights to improve its business strategies.


Scenario:

A large hospital chain is aiming to improve patient care and optimize its operations. They want to analyze patient records, treatment histories, equipment usage, and staff schedules. Their primary objective is to understand patient recovery patterns, resource allocation, and staff efficiency.

1. Data Ingestion:

Data Sources:

  • Patient Records: Details like patient ID, name, age, ailment, admission date, and discharge date.
  • Treatment Data: Records of treatments prescribed, including medicine, surgeries, therapies, and their durations.
  • Equipment Usage Data: Which equipment was used, for which patient, duration, and date.
  • Staff Data: Staff ID, specialization, shifts, and leave records.

All these data sets are ingested into Azure Blob Storage from various hospital management systems.

2. Data Movement:

Using Azure Data Factory:

  • Data is moved into Azure Synapse Analytics.
  • Basic cleaning: Removing any incorrectly entered records, ensuring data privacy by masking certain patient details, and resolving discrepancies in drug names or equipment names.

3. Data Processing in Azure Synapse Analytics:

  • Transformation: Converting diverse date formats to a standard one (e.g., YYYY-MM-DD).
  • Aggregation: Counting the number of patients treated for each ailment, summing up equipment usage hours, and calculating staff working hours.
  • Calculation: Determining average recovery time for each ailment, calculating equipment downtime, and computing staff overtime.

4. Data Modeling in Azure Analysis Services:

  • Hierarchies:
    • Time Hierarchy: Year → Month → Week → Day.
    • Disease Hierarchy: Disease Category (like Cardiovascular) → Specific Disease (like Heart Attack) → Complications (if any).
    • Staff Hierarchy: Department → Role → Individual Staff Member.
  • Metrics Defined: Number of Patients, Average Recovery Time, Equipment Usage Hours, Staff Working Hours.
  • Relationships: Connecting patient records with treatment data to understand which treatments are most effective for specific ailments.

5. Data Visualization in Power BI:

With the OLAP model in place:

  • A hospital manager can analyze the average recovery time of patients suffering from cardiovascular diseases in the past year.
  • Drill down to see the recovery patterns of heart attack patients specifically.
  • Cross-examine data to check the efficiency of specific treatments or therapies on recovery times.
  • Understand patterns, like increased hospital admissions during flu seasons.

Summary:

In this example, diverse hospital data, ranging from patient records to staff schedules, gets transformed into actionable insights. Hospital management can gain a deeper understanding of patient care patterns, resource allocation efficiency, and potential areas of improvement. This not only optimizes hospital operations but also enhances patient care quality.

Explore Big Data & Azure services for it

Big Data refers to extremely large data sets that are beyond the ability of traditional databases and software tools to capture, store, manage, and analyze within a reasonable amount of time. The concept of Big Data isn't just about the volume of data, but also the variety and velocity.

Here's a breakdown:

  1. Volume: Refers to the immense amount of data generated every second. For instance, every day, billions of photos are uploaded to social media, millions of transactions happen online, and countless bytes of data are generated by IoT (Internet of Things) devices.
  2. Variety: Data comes in different forms. Traditional data types were structured (like databases with defined fields), but much of Big Data is unstructured or semi-structured. This includes text, images, sound, video, etc.
  3. Velocity: The speed at which data is being generated, processed, and made available. Consider social media posts that get created every millisecond or sensor data that's being generated every microsecond.

Other characteristics, like veracity (trustworthiness of data) and value (usefulness of the data), have also been discussed by some experts, but Volume, Variety, and Velocity are the core characteristics that define Big Data.

The importance of Big Data lies in the insights that can be drawn from it. With the right tools and analytical approaches, Big Data can provide valuable insights for businesses, scientific research, and many other areas, leading to more informed decisions, optimized processes, and innovative solutions.

Examples:

  1. Social media posts from millions of users.
  2. Purchase transactions from an online store.
  3. Sensor data from smart devices in a city.
  4. Medical records from hospitals.

Use Cases:

  1. Business Decisions: Companies analyze customer purchase patterns to tailor marketing or stock products.
  2. Healthcare: Predict disease outbreaks or optimize patient care.
  3. Smart Cities: Manage traffic, waste management, and energy use.
  4. Finance: Detect fraudulent transactions.
  5. Entertainment: Recommend movies or music based on preferences.

Benefits:

  1. Informed Decisions: Companies can make data-driven decisions.
  2. Efficiency: Processes can be streamlined based on data insights.
  3. Personalization: Tailor experiences for individuals based on their behavior.
  4. Innovation: New products/services based on what the data suggests.


Why the buzz about Big Data now?

  1. Explosion of Devices: Smartphones, smartwatches, IoT devices—all generate tons of data.
  2. Digital Revolution: More businesses operate online, generating more data.
  3. Affordable Storage: It's now cheaper to store large amounts of data.
  4. Advanced Tools: Modern software can process and analyze Big Data effectively.

Why wasn’t it popular earlier?

  1. Limited Data: Earlier, not as many digital devices or platforms existed.
  2. Storage Costs: Storing huge amounts of data was expensive.
  3. Processing Power: Computers weren’t as powerful or efficient in handling vast amounts of data.
  4. Awareness: Many didn't realize the potential benefits of analyzing vast data sets.

In essence, as technology has evolved, so has our ability to generate, store, and analyze data. Big Data provides powerful insights, leading to better decisions and innovative solutions, making it a hot topic in today's digital age.



Popular Azure Storage Options for Big Data:

  1. Azure Blob Storage
  2. Azure Data Lake Storage
  3. Azure SQL Data Warehouse (now part of Azure Synapse Analytics
  4. Azure Cosmos DB
  5. Azure HDInsight
  6. Azure Databricks
Its important to note that the choice of storage or processing option depends on the nature and requirements of the data and the tasks you wish to perform on it. Lets explore little more about these big data azure storage services.



  1. Azure Blob Storage
    • Criteria: Ideal for storing large amounts of unstructured data, like documents, logs, backup data, media files, and more. Offers high availability and durability.
    • Example Scenarios:
      1. Media Hosting: A video streaming platform can use Blob Storage to store and stream videos to users.
      2. Backup & Archive: An enterprise wants to store backups of critical data securely offsite.
  2. Azure Data Lake Storage
    • Criteria: Best for big data analytics. It handles structured and unstructured data and integrates seamlessly with analytics frameworks like Hadoop and Spark.
    • Example Scenarios:
      1. Healthcare Analytics: Hospitals analyze patient data, treatment histories, and lab results to predict disease outbreaks.
      2. Financial Forecasting: Investment firms analyze historical data to predict stock market trends.
  3. Azure Synapse Analytics 
    • Criteria: When you need to store and query large datasets using SQL and require the scalability and analytics capability of a data warehouse. Unified centralized service for the end to end ETL/ELT process.
    • Example Scenarios:
      1. Retail Sales Analysis: A chain store aggregates sales data from all its stores globally to glean insights about best-selling products.
      2. Customer Insights: A tech company analyzes user interactions with its software to improve features.
  4. Azure Cosmos DB
    • Criteria: For globally distributed applications requiring wide-reaching scalability and geographic distribution. It supports multiple data models: document, key-value, graph, and column-family.
    • Example Scenarios:
      1. E-commerce Platforms: An online store that serves customers worldwide and requires low latency for product recommendations and inventory checks.
      2. Social Networking Apps: An app that requires quick and globally distributed access to user profiles, posts, and friend networks.
  5. Azure HDInsight
    • Criteria: When you need cloud-based analytics service to process big data using popular frameworks like Hadoop, Spark, Hive, and more.
    • Example Scenarios:
      1. Log Analysis: A company analyzes logs from its web servers to understand user behavior and optimize website design.
      2. Genome Sequencing: Scientists analyze genomic sequences to conduct research in personalized medicine.
  6. Azure Databricks
    • Criteria: When collaborative analytics using Apache Spark is needed. Offers an integrated workspace for data science, data engineering, and business analytics.
    • Example Scenarios:
      1. Real-time Analytics: A ride-sharing app analyzes real-time data on car locations, user demand, and traffic to optimize ride allocations.
      2. Collaborative Research: Researchers from different backgrounds collaborate on a dataset to gain insights on climate change.

In choosing a service, consider factors like the nature of your data (structured vs. unstructured), volume, access speed requirements, geographical distribution, and the specific analytic tools you intend to use.


lets go through some example scenarios to choose right data store with reasoning.

Scenario 1: Global Online Retail Platform

  • Description: An e-commerce company operates in multiple countries and offers thousands of products. They require fast product search, user personalization, real-time inventory updates, and the ability to handle sudden surges in user traffic during sale events. They also want to provide consistent low-latency access to their customers globally.
    • Chosen Azure Service: Azure Cosmos DB
    • Reasoning:
      1. Global Distribution: Azure Cosmos DB is a globally distributed database service, meaning the e-commerce platform can replicate its data across multiple regions, ensuring users get low-latency access no matter where they are located.
      2. Scalability: During sale events when traffic surges, Cosmos DB can scale rapidly to accommodate the increased load.
      3. Multi-Data Models: It supports document, key-value, and graph models, catering to diverse data needs of an e-commerce platform like product catalogs, user carts, and recommendation graphs.

Scenario 2: Energy Utility Company's Data Analysis

  • Description: An energy utility company collects vast amounts of data from smart meters across a region. They need to store this data, analyze consumption patterns, forecast demand, and optimize the distribution. The data from the smart meters is vast, arriving in real-time, and requires advanced analytical tools for processing.
    • Chosen Azure Service: Azure Data Lake Storage (combined with Azure Databricks for processing)
    • Reasoning:
      1. Big Data Analytics: Azure Data Lake Storage is specifically designed for big data analytics. The utility company can store the vast amounts of structured and unstructured data streaming in from smart meters.
      2. Integration with Analytic Tools: It integrates seamlessly with analytic frameworks like Apache Spark (offered through Azure Databricks), allowing the company to process the data efficiently and gain insights.
      3. Scalability: As the number of smart meters increases or as the data collection frequency goes up, Azure Data Lake Storage can scale accordingly without performance hitches.








Functional and Non-functional Requirements for Data Store Selection

In today's digital age, where data drives decisions and operations, selecting the right data store is pivotal. The right choice can propel a business forward, while the wrong one can spell disaster. But how do you make this choice? Enter: functional and non-functional requirements.

Functional Requirements: The 'What'

Functional requirements address what the system is supposed to achieve. They outline the specific functionalities and capabilities the data store must support. Let's take a simple e-commerce website as an example:

  1. Product Searches: The site needs to let users search for products. Hence, the data store must support efficient search operations.
  2. Ordering System: Users should be able to place orders, implying the need for transactional support in the data store. Consistency is must here.
  3. User Profiles: The site will have user profiles, suggesting the need for a database that can store structured user data. especially key-value type

Non-Functional Requirements: The 'How'

Non-functional requirements, on the other hand, delve into how the system operates. They depict the quality, performance, and other operational characteristics of the system.

Back to our e-commerce website example:

  1. Performance: The site should load quickly, and search results should appear within seconds. This calls for a data store optimized for performance.
  2. Data Security: User payment and personal information must be securely stored, emphasizing the need for strong encryption mechanisms.
  3. Scalability: As the user base grows, the data store should scale without hitches, ensuring that the increasing user demands are met without performance degradation.

Lets dive deep into these requirements one by one :

1. Functional Requirements

Functional requirements are related to the specific functionalities that the data store should support.

  1. Data Model: The structure of your data plays a significant role. Do you need a relational model (like SQL databases), a document-based model (like MongoDB), a key-value store (like Redis), a columnar store (like Cassandra), or a graph model (like Neo4j)?
  2. Query Capability: Depending on the kind of queries you'll be running, some databases might be more appropriate than others. For instance, if you need complex joins and transactions, a relational database might be best. For flexible schema-less data retrieval, a document store might suffice.
  3. Consistency and Transaction Support: Some applications require strong data consistency and ACID transactions. Relational databases, such as PostgreSQL or MySQL, are designed with this in mind.
  4. Schema Flexibility: If your data schema is likely to evolve over time, then a schema-less or flexible-schema database like MongoDB or CosmosDB might be more suitable. As mentioned before, evolving schemas require flexibility.

    1. Fixed Schema: Traditional relational databases like Oracle, MySQL, or MS SQL.
    2. Flexible Schema: Document stores like MongoDB, Cosmos DB, or columnar databases like HBase.
  5. Data Size: The volume of data you expect to handle can influence your choice.
      1. Small to Medium: Relational databases like MySQL, PostgreSQL, or SQLite often suffice.
      2. Large: Columnar stores like Cassandra, or distributed systems like Hadoop or distributed versions of SQL databases can be more appropriate.
      3. Very Large (Big Data): Solutions like Hadoop HDFS, Google Bigtable, or Amazon S3 with Big Data processing tools like Spark might be needed.
  6. Data Relationship: The nature and complexity of the relationships between data entities can guide the choice.
      1. Simple Relations: Relational databases can handle this efficiently with JOIN operations.
      2. Complex Relations: Graph databases like CosmosDB graph model are designed to manage intricate relationships efficiently.
  7. Data Movement: If your system requires synchronization, migration, or streaming of data, this becomes crucial.
      1. Streaming: Kafka or RabbitMQ for event streaming, Spark Streaming for stream processing.
      2. Migration/Synchronization: Tools like Apache NiFi, Talend, or database-specific tools like Oracle GoldenGate.
  8. Data Lifecycle: How your data evolves and ages over time can dictate storage strategies and archival methods.
      1. Short-Lived Data: In-memory databases like Redis or Memcached are perfect for temporary data.
      2. Long-Term Storage with Occasional Access: Systems like Amazon Glacier or Hadoop HDFS can be more cost-effective.
      3. Data Archival and Retrieval: Databases with in-built lifecycle management like Amazon S3's object lifecycle policies or Azure Storage account Blob storage..



2. Non-Functional Requirements

Non-functional requirements are related to how the system operates, rather than what specific operations it supports.

  1. Scalability: If you anticipate a significant increase in data or query volume, consider databases that scale out easily. NoSQL databases like Cassandra or DynamoDB are known for their horizontal scalability.
  2. Availability and Fault Tolerance: If you need high availability, especially across multiple regions, it's essential to look into databases that offer replication, failover mechanisms, and distributed data storage.
  3. Latency: For applications that require low-latency data access (like real-time analytics), in-memory databases like Redis or in-memory options of relational databases can be beneficial.
  4. Durability: How critical is it that once data is written, it is never lost? Many databases provide durable storage mechanisms to ensure data safety even in case of system failures.
  5. Operational Ease: Consider the effort required to set up, maintain, monitor, and backup the database. Cloud offerings like Amazon RDS or Azure Cosmos DB provide managed database services that alleviate some operational concerns.
  6. Security: Features related to authentication, authorization, encryption (at rest and in transit), and auditing can be vital based on the sensitivity of the data.
  7. Cost: Total cost of ownership includes not only the cost of the database software (or service) but also hardware, operational, and maintenance costs.
  8. Integration and Ecosystem: How well does the data store integrate with your existing tools, systems, and processes? The available drivers, plugins, and community support can be deciding factors.
  9. Backup and Recovery: The ease with which you can backup data and recover from failures or data loss can be crucial, especially for critical applications.
  10. Compliance: If you're in a regulated industry, you might need databases that support specific compliance requirements, such as GDPR, HIPAA, or PCI DSS.

Functional and non-functional requirements act as the guiding stars in the journey of selecting the right data store. The 'what' and 'how' of system needs, represented by these requirements, ensure that the chosen data store not only aligns with the present needs but also scales for the future, offering the optimal mix of functionality and operational excellence. With platforms like Azure offering a plethora of database services, understanding these requirements becomes even more crucial to harness the full potential of the digital realm.

Please share your experience and feedback in the comments and help everyone to learn and grow along with you. Happy Learning 💪


Data Store Models - An Overview

 What are Data Store Models?

In simple terms, a data store model is a way data is structured, saved, and accessed in a database. Think of it like different types of shelves you'd use to organize items at home: some shelves are good for books, others for clothes, and some for tools. The type of shelf you choose depends on what you're storing. Similarly, the type of data store model you choose depends on the kind of data you're handling. Here are some common data store models:

Modern business systems manage increasingly large volumes of heterogeneous data. This heterogeneity means that a single data store is usually not the best approach. Instead, it's often better to store different types of data in different data stores, each focused toward a specific workload or usage pattern. The term polyglot persistence is used to describe solutions that use a mix of data store technologies.

The term "polyglot persistence" is like speaking multiple languages (polyglot means "many tongues" in Greek). It means that, instead of using one type of database for everything, you use different databases for different tasks, depending on what's best suited for each task.


Lets take an analogy to understand it better :)

Imagine you're building a house. You wouldn't use just a hammer for every job, right? Sometimes you'd use a saw, sometimes a drill, and other times a wrench. Each tool is best for a particular job. Similarly, in the digital world, you might use a relational database to store customer information, a graph database to understand their preferences and relationships, and a key-value store for fast session storage.


In essence, heterogeneous data or polyglot persistence is about using the right data storage tool (or database) for the right job, sometimes even combining multiple databases in a single solution to harness the strengths of each. This approach helps in optimizing performance, scalability, and functionality based on specific use cases and needs.


Lets take an example to cement what we understood as of now:

Imagine a large online shopping platform Amazon. Such platforms handle a multitude of tasks, from managing user profiles and product listings to tracking orders and analyzing user behavior and there approach to store data for the efficiency & management is heterogeneous stores. Below are the DB of choice for various aspects of an e-commerce.

  • User Profiles: A document store like MongoDB is great because each user's data can be stored as a flexible "document" that can evolve over time without rigid schema constraints.
  • Products can be organized in a structured way: title, price, description, SKU, etc. This fits well into a relational database, where relationships, like which products fall under which categories, are essential.
  • When users add products to their shopping carts, quick access and high availability are crucial. A key-value store like Redis provides lightning-fast access where a user's ID (key) can quickly retrieve their cart items (value).
  • To suggest products to users based on their behavior and the behavior of similar users, a graph database is invaluable.
  • To analyze trends in user behavior, like which products are most viewed or what search terms are trending, a columnar store like Cassandra can efficiently query and analyze large datasets by columns.
  • When users leave reviews or comments on products, a system needs to provide quick searches, filter, and analyze large amounts of text. Elasticsearch, a search engine backed by a document store structure, excels in this, providing near real-time search results.


Benefits of Polyglot Persistence in this Scenario:

  • Flexibility: Each part of the e-commerce system can use a database best suited to its needs.
  • Scalability: As the platform grows, each database can be scaled independently based on its workload.
  • Performance: Each task, from cart management to product recommendations, can be optimized for maximum performance by leveraging the strengths of different databases.

Selecting the right data store for your requirements is a key design decision. There are literally hundreds of implementations to choose from among SQL and NoSQL databases. Data stores are often categorized by how they structure data and the types of operations they support. Lets explore most used databases one by one:



1) Relational database management systems

Relational databases organize data as a series of two-dimensional tables with rows and columns.

Think of them as organized spreadsheets where you store data in rows and columns, making it easy to relate data from different sheets.

An RDBMS typically implements a transitionally consistent mechanism that conforms to the ACID (Atomic, Consistent, Isolated, Durable) model for updating information.

An RDBMS typically supports a schema-on-write model, where the data structure is defined ahead of time, and all read or write operations must use the schema.

This model is very useful when strong consistency guarantees are important — where all changes are atomic, and transactions always leave the data in a consistent state. However, an RDBMS generally can't scale out horizontally without sharding the data in some way. Also, the data in an RDBMS must be normalized, which isn't appropriate for every data set.


Azure services

·       Azure SQL Database | (Security Baseline)

·       Azure Database for MySQL | (Security Baseline)

·       Azure Database for PostgreSQL | (Security Baseline)

·       Azure Database for MariaDB | (Security Baseline)

Workload

·       Records are frequently created and updated.

·       Multiple operations have to be completed in a single transaction.

·       Relationships are enforced using database constraints.

·       Indexes are used to optimize query performance

Data Type

·       Database schemas are required and enforced.

·       Many-to-many relationships between data entities in the database.

·       Constraints are defined in the schema and imposed on any data in the database.

·       Data requires high integrity. Indexes and relationships need to be maintained accurately.

·       Data requires strong consistency. Transactions operate in a way that ensures all data are 100% consistent for all users and processes.

 Examples

·       Inventory management

·       Order management

·       Reporting database

·       Accounting




 2) Document Databases

Imagine a digital filing cabinet where each file can have different types of information without a fixed format.

A document database stores a collection of documents, where each document consists of named fields and data. The data can be simple values or complex elements such as lists and child collections. Documents are retrieved by unique keys.

·       Typically, a document contains the data for single entity, such as a customer or an order.




  • Azure Service: Azure Cosmos DB (with document data model)
  • Workload Type: Flexible and schema-less data operations
  • Data Type: JSON-like documents,

·       Size of individual document data is relatively small.

·       Each document type can use its own schema.

·       Document data is semi-structured, meaning that data types of each field are not strictly defined

  • Example Use Case: Content management systems (as each content piece might have different attributes), Product catalogue 



3) Key/value stores

A key/value store associates each data value with a unique key. Most key/value stores only support simple query, insert, and delete operations.

Picture a giant locker where each compartment has a unique label, making it quick to find what's inside by its label.

Key/value stores are highly optimized for applications performing simple lookups, but are less suitable if you need to query data across different key/value stores. Key/value stores are also not optimized for querying by value.

A single key/value store can be extremely scalable, as the data store can easily distribute data across multiple nodes on separate machines.



Azure services

·       Azure Cosmos DB for Table and Azure Cosmos DB for NoSQL | (Azure Cosmos DB Security Baseline)

·       Azure Cache for Redis | (Security Baseline)

·       Azure Table Storage | (Security Baseline)

Workload - Fast access, simple lookups

·       Data is accessed using a single key, like a dictionary.

·       No joins, lock, or unions are required.

·       No aggregation mechanisms are used.

·       Secondary indexes are generally not used.

Data type

·       Each key is associated with a single value.

·       There is no schema enforcement.

·       No relationships between entities.

Examples

·       Data caching

·       Session management

·       User preference and profile management

·       Product recommendation and ad serving.



4) Graph Databases:

Simple Explanation: Think of a web or network showing how different points are connected, like a family tree.

A graph database stores two types of information, nodes and edges. Edges specify relationships between nodes. Nodes and edges can have properties that provide information about that node or edge.

Graph databases can efficiently perform queries across the network of nodes and edges and analyze the relationships between entities. 

The following diagram shows an organization's personnel database structured as a graph. The entities are employees and departments, and the edges indicate reporting relationships and the departments in which employees work.

 




This structure makes it straightforward to perform queries such as "Find all employees who report directly or indirectly to Sarah" or "Who works in the same department as John.

 

Azure Service: Azure Cosmos DB (with graph data model)

Workload Type: Relationship-heavy queries

Data Type: Entities or nodes and their relationships

Example Use Case: Social networks (to find connections, friends of friends, etc.) , Organizational Charts, recommendation engines.



5) Column-family databases

A column-family database organizes data into rows and columns. In its simplest form, a column-family database can appear very similar to a relational database, at least conceptually. The real power of a column-family database lies in its denormalized approach to structuring sparse data.

Envision a wardrobe where you don't store complete outfits together but instead put all shirts in one drawer, all pants in another, etc.

You can think of a column-family database as holding tabular data with rows and columns, but the columns are divided into groups known as column families. Each column family holds a set of columns that are logically related together and are typically retrieved or manipulated as a unit.

The following diagram shows an example with two column families, Identity and Contact Info. The data for a single entity has the same row key in each column-family. This structure, where the rows for any given object in a column family can vary dynamically, is an important benefit of the column-family approach, making this form of data store highly suited for storing structured, volatile data.






  • Azure Service: Azure Cosmos DB (with column-family data model) or Azure Data Explorer
  • Workload Type: Big data analytics, fast reads/writes
    • Most column-family databases perform write operations extremely quickly.
    • Update and delete operations are rare.
    • Designed to provide high throughput and low-latency access.
    • Supports easy query access to a particular set of fields within a much larger record.
    • Massively scalable.

 

  • Data Type: Columns of data
    • Data is stored in tables consisting of a key column and one or more column families.
    • Specific columns can vary by individual rows.
    • Individual cells are accessed via get and put commands

 

  • Example Use Case: Event logging systems (because you often query on specific attributes of the event rather than full event details)
    • Recommendations
    • Personalization
    • Sensor data
    • Telemetry
    • Messaging

Comparison or conclusion Note:

RDBMS, like Azure SQL Database, is great for scenarios where data integrity and relations are crucial, such as inventory management. You'd want to know how many of a particular product you have, which supplier it came from, and who purchased it, all tied together.

On the other hand, NoSQL models, offered under the umbrella of Azure Cosmos DB and other Azure services, provide flexibility and scalability for varied data structures, which are especially handy when dealing with vast amounts of diverse data or rapidly changing schemas.


6) Search Engine Databases:

Simple Explanation:

A search engine database is designed to make searching very fast, even in huge amounts of data.

Imagine a librarian who not only knows where every book is but can also quickly tell you which books mention a specific topic, phrase, or word, even if it's just in a single page.

A search engine database allows applications to search for information held in external data stores. A search engine database can index massive volumes of data and provide near real-time access to these indexes.

Azure Service: Azure Cognitive Search

Workload Type: Text search, full-text indexing, and complex search operations.

Data Type:

·       Semi-structured or unstructured text

·       Text with reference to structured data

Example Use Case: E-commerce product search (where users might search for products using various terms, and the system needs to quickly display relevant results).

Search engine databases like Azure Cognitive Search or Elasticsearch aren't typical databases in the RDBMS or NoSQL sense. Instead, they're optimized for searching.

They do this by creating an "index" of the data, much like the index at the back of a book. This index allows them to find relevant results very quickly without having to read through each "page" (or piece of data) every time.

They are often used in conjunction with other databases, where the primary database handles the regular data storage and transactions, while the search engine database handles search queries.

Please share your learning to make this read more impactful and comments of course for feedback.




MS Defenders

 Microsoft Defender offers a wide range of security solutions, similar to the ones we've discussed (Defender for Containers, Defender fo...