Improve cardinality levels by changing data types and through summarization
We’ll go over two popular approaches for reducing the amount of data put into Import models. Remember that import models are loaded with data that has been compressed and optimized, and then the VertiPaq storage engine saves it to disc. Now you’ll learn about data compression strategies.
The key data reduction approaches used to decrease the data input into the import models will now be discussed.
Optimize data types
Considering the VertiPaq storage engine uses separate data structures for each column. Some of the features of optimizing column data types include –
- Firstly, by design, these data structures achieve the highest optimizations for numeric column data, which use-value encoding.
- Also, text and other non-numeric data, uses hash encoding.
- Further, it requires the storage engine to assign a numeric identifier to each unique text value contained in the column.
- Moreover, it is the numeric identifier, then, that is then stored in the data structure, requiring a hash lookup during storage and querying.
You can also transform raw text data into numeric numbers in some cases. For example, a text value may be regularly prefixed to a sales order number (e.g. “SD123456”). The order number value might be changed to a whole number without the prefix. It can lead to considerable data reduction in big tables, especially when the column includes unique or high cardinality values.
Group by and summarize
This is possibly one of the most effective techniques used to reduce a model size is to load pre-summarized data.
- Firstly, this technique can be used to raise the grain of fact-type tables.
- Secondly, there is a distinct trade-off, however, resulting in loss of detail.
- Moreover, deciding to summarize fact-type data always involves tradeoffs.
- Lastly, tradeoff could be mitigated by a Mixed model design, and this option is described in the Switch to Mixed mode technique.
A source sales fact table, for example, maintains one row per order line. However, by summarising all sales indicators and sorting by date, client, and product, considerable data reduction might be achieved. Consider that grouping by date at the monthly level might result in even more considerable data reduction. This might now result in a 99 percent decrease in model size, but reporting at the day or individual order level is no longer viable.