Data modeling languages for data warehouses and data lakes

By — Vadim Zhdanov

All data modeling methods can be divided into 2 groups: structural and process. The relational paradigm, relevant for typical QCD, focuses on defining entities and relationships between them, in order to then represent these concepts in the form of related tables using ER-diagrams (Entity-Relationship). Data Flow Diagrams (DFDs) are used to describe data flows in business analysis. DFD diagrams clearly show the exchange of information between data warehouses (not in the sense of QCD, but rather, source databases, consumers and other storage locations), internal processes of the system and entities external to it. This dynamic method can be used in the conceptual design of Big Data solutions, when the main sources and consumers of data, as well as their processing processes, in particular, ETL, are determined. Such visualization will be useful when designing modern QCD integrated with data lakes and cloud services of advanced Big Data analytics. Returning to the structural design of DWH, here are the 3 most common modeling techniques:

  • multidimensional for data marts and the core level in the LSA structure, when the analytics needs of business users are known and correctly defined. In this case, the data mart is considered not a physical layer, but a set of representations of the model core.
  • relational (IDEF1X) for defining relationships between QCD tables, BI dashboards (data marts) and other target systems that are relatively stable and should be integrated with the QCD core model.
  • Data Vault or Data Vault is a hybrid approach that combines the merits of the third normal form (3NF) and the star schema used in denormalized QCD.

Инструментально все эти методы реализованы в виде различных CASE-средств, которые позволяют создавать ER-диаграммы и проверять их корректность. Часто они входят в состав платформ для создания DWH, например, как это сделано в SAP BW∕4 HANA.

A multidimensional model is visually represented using a cube (or, in the case of more than three dimensions, a hypercube).

Multivariate modeling is a method of modeling and visualizing data as a set of numerical or linguistic metrics or parameters that describe general aspects of an organization's operations. Typically, multidimensional modeling focuses on numerical data such as sales, balance sheet, profit, weight, or objects that can be recalculated, such as articles, patents, books. The multivariate modeling method is based on the following basic concepts: facts, attributes, dimensions, parameters (metrics), hierarchy, granularity. A fact is a collection of related data items containing metrics and descriptive data. Each fact usually represents a data element that numerically describes an organization's activities, business transactions or events that can be used to analyze the organization's activities or business processes. An attribute is a description of the characteristics of a real object of the domain. Typically, an attribute contains a pre-known value that characterizes the fact. Measurement is the interpretation of a fact from some point of view in the real world. Dimensions contain text values ​​that are strongly related in meaning to each other. Dimensions are usually represented as axes of a multidimensional space, the points of which are the facts associated with them. Dimensions are specified by listing their members. Dimension Member - A unique name or identifier used to determine the position of an element. Often, the members of a dimension are in a part-whole or parent-child relationship, which allows one or more hierarchies to be introduced on a dimension. Each hierarchy can have multiple levels of hierarchy. Each member of a dimension must belong to only one level of the hierarchy, thus generating a split into disjoint subsets. A parameter, metric or indicator is a numerical characteristic of a fact that determines the effectiveness of an organization's activities or business actions in terms of measurement. As a rule, a metric contains a previously unknown value of a fact characteristic. The specific values ​​of the metric are described using variables.

From the point of view of the relationship between dimensions and facts, the latter can be divided into the following classes:

  • additive facts. A fact is called additive if it makes sense to use it with any dimensions to perform summation operations in order to obtain some meaningful result.
  • semi-additive facts. A fact is called semi-additive if it makes sense to use it in conjunction with some measurements to perform summation operations in order to obtain some meaningful result.
  • non-additive facts. A fact is called non-additive if it does not make sense to use it in conjunction with any dimension to perform summation operations in order to obtain any meaningful result.
  • numerical measures of intensity. A fact is called a numerical measure of intensity if it, being non-additive in time, allows aggregation and summation over a certain number of time periods.

Facts in a multidimensional model are usually presented in the form of a fact table. In the logical entity-relationship model, a fact table is represented by an entity whose attributes are facts (metrics or descriptions) and a composite key that links the fact table to the dimension tables with a one-to-many relationship. Fact tables are divided into three main categories based on the level of detail of the facts.

  • Transactional fact table. In such a fact table, facts are stored that record certain events (transactions).
  • Periodic snapshot fact table This table collects facts that capture the current state of a particular line of business.
  • Cumulative snapshot fact table This table collects facts that capture some of the final state of a particular line of business at the current point in time.

The main characteristics of the fact table are as follows.

  1. The fact table contains numerical parameters (metrics).
  2. Each fact table has a composite key made up of the primary keys of the dimension tables.
  3. The fact table has, as a rule, a small number of fields, no more than 20.
  4. The data in the fact table has the following properties:
    1. numerical parameters are used for aggregation and summation;
    2. data values ​​must have the properties of additivity or semi-additivity with respect to dimensions in order to be summarized;
    3. All data in the fact table must be uniquely identified through the keys of the dimension tables.

The main characteristics of the measurement table are as follows.

  • Dimension tables contain data on the detail of facts.
  • Dimension tables contain descriptive information about the numerical values in the fact table.
  • Typically, denormalized dimension tables contain a large number of fields.
  • Dimension tables usually contain significantly fewer rows than fact tables.
  • Attributes of dimension tables are commonly used when visualizing data in reports and queries.

There are several schemas for multidimensional data modeling. Two of these are considered basic: the star schema and the snowflake schema. In more complex cases, so-called “multi-star” or multiple fact table schemas are used. A star schema has one fact table and multiple dimension tables. Dimension tables are denormalized. A snowflake schema has one fact table and multiple normalized dimension tables.

IDEF1X is a data modeling language for developing semantics of data models. IDEF1X is used to form graphical representations of information models that reflect the structure and semantics of information within an environment or system. This modeling technique is used to model data in a standard, consistent, and predictable manner in order to manage it as a resource. The main goals of the IDEF1X standard are to provide:

  • A tool for a complete understanding and analysis of information resources of organizations
  • A general means of representing and conveying the complexity of data
  • Methods for the general presentation of data required for the operation of the enterprise
  • Means for defining an application-independent representation of the data that can be verified by users and translated into a physical database design
  • Techniques for deriving an integrated data definition from existing data resources.

The main purpose of IDEF1X is to support integration. The integration approach focuses on capturing, managing and using a single semantic definition of a data resource called a “conceptual schema”. A “conceptual schema” provides a single, integrated definition of data within an enterprise that is not biased towards any single use of the data and is independent of how the data is physically stored or accessed. The main purpose of a conceptual schema is to provide a consistent definition of values ​​and relationships between data that can be used to integrate, share, and manage data integrity. A conceptual diagram should have three important characteristics:

  • Be consistent with the business infrastructure and be correct for all areas of application
  • Expand so that new data can be defined without changing previously defined data.
  • Transformability both in the required user views and in various data storage and access structures.

Data Vault is one of the data modeling methods used in QCD design. To maintain simplicity of design and ensure maximum flexibility of QCD, a minimum of basic concepts are used:

  • Hub - a table that stores the main view of a business entity from the functional position of the subject area, for example. The hub contains a unique and unchanging business key - one or more fields that identify an entity in terms of business. In addition to the business key, the hub contains meta-fields: the time the entity was initially loaded into the storage and its source.
  • A link that represents a relationship or transaction between two or more business components, linking them through their respective business keys. It contains the same metadata as the Hub: download timestamp and data source. The keys of the Hubs to be linked are also migrated to the Link entity, forming a composite key.
  • Satellite - a table with contextual information of the Hub key. To be flexible, the structure of the Satellites should be capable of storing new or changed detail data. In addition to the single key of the parent hub and its context data, the Satellite table also contains a typical set of metadata. Thus, in Satellites, you can store the history of context changes by adding a new record when updating in the source system. It is customary to store the context from different source systems in separate satellite tables.

The Data Vault model is created according to the following algorithm:

  1. defining Hubs based on key business entities and their use in the subject area;
  2. identifying Links by identifying possible relationships between business keys - and understanding the context of its work;
  3. Definition of Satellites - modeling the context of each business entity and transaction (Links) connecting the Hubs;
  4. Simulation of point-in-time tables derived from Satellites.

In this case, you must adhere to the following rules:

  1. Hub keys cannot be migrated to other Hubs, so as not to violate the flexibility and extensibility of the Data Vault modeling technique;
  2. Business keys and primary keys of the Hub never change;
  3. Hubs are connected only using entities of the Link type;
  4. The link must link at least two Hubs and can link with other entities of the Link type;
  5. Surrogate keys can be used in Hubs and Links, but not in Satellites;
  6. Hub Keys always migrate to Links and a child Satellite with contextual data;
  7. The satellite can be linked to Hubs and Links;
  8. The satellite always contains a Load Date Time Stamp or a numeric foreign key that refers to a stand-alone table with a sequence of timestamps, such as a calendar;
  9. If the Hub has 2 or more Satellites, you can create a point-in-time table for the convenience of join operations;
  10. Satellites record only changes without duplicating lines;
  11. Data is distributed according to the structure of the Satellites based on the type of information and the rate of its change.

Since historically DWHs evolved from relational DBMSs, they also inherited the design approaches adopted in this area. In particular, the use of 3NF - the third normal form, when each non-key attribute of the table should provide information about only the full key and nothing more. However, this application of this approach showed its following disadvantages:

  • scalability and flexibility issues due to tightly coupled tables where adding an additional parent table will cause cascading changes to all underlying subordinate tables. And when inserting a new row with an existing parent key, all child rows must be reassigned to the new parent key.
  • dependence of the primary key on time and on the aforementioned table connectivity;
  • difficulties in loading in near real-time mode;
  • time consuming queries;
  • top-down design and corresponding top-down implementation.

To get around these shortcomings, in the second half of the 1980s, the star schema was used in data modeling, convenient for storing multidimensional indicators. Later they began to use the “star” scheme. This model is also characterized by a number of problems that complicate its application in modern QCD:

  • isolation of subject-oriented information;
  • possible redundancy of data;
  • incompatible query structures;
  • difficulties of scalability;
  • Inconsistent granularity of fact tables, which complicates their linking, and also limits the design, scalability, and flexibility of the data model;
  • problems of synchronization during downloads in near real-time mode;
  • limited presentation of corporate information;
  • inconvenient environment for data mining;
  • the need for top-down design with bottom-up implementation.

For this, the Data Vault method is now used as the most promising.