A better way to Model Data

Hello my name is Mark Hargraves, the creator of the Spider Schema. The Spider Schema has been successfully used in real world Business Intelligence Analytics / Reporting for over 20 years. It allows for Cross Dimensional Hierarchies with no changes to the underlying database schema, adding new metrics does not change the Fact Table Schema, faster Analytical processing times, up to 50% less data storage (depending on project), easier data modeling, fast distinct count processing, better layout in reports (no granularity issues), Big Data support, replacing both OLTP and OLAP models, plus so much more.

Recently I was contacted to repost my article about the Spider Schema Data Model with the latest changes. The original site, which is no longer running was spider-schema.info. I am planning on creating a new website that goes into far more detail about the Spider Schema in the near future. There are older articles, some have been removed talking about the Spider Schema and its value. Here is a link to a website that still has some info on it: Bridge Between OLTP and OLAP.

To start with, it was back in 2004 when a manager asked me to create a Cross Dimensional Hierarchy in a SSAS Multi-Dimensional cube (long before Tabular was available) where new Hierarchies could be added without changes to the underlying database schema. I was given 2 weeks to accomplish this.

To create a Cross Dimensional Hierarchy at that time, there was only one option: to build out a physical table to support the hierarchy. This is because all the relationships between Dimensions are stored in the fact table.

To accomplish this, I removed all the Dimensional Foreign Keys from the Fact Table, and created a new table which at first, I called the Intermediate Dimension (later I renamed it to the Relationship Dimension). When creating this new Dimension, I had to create a distinct set of those foreign keys to prevent duplicated data. What I found doing this over the years, is that depending on the data, a decent savings in rows (data) can be accomplished.

I then was able to load the data into SSAS with no issues, and create at will, as many hierarchies that spanned multiple dimensions as desired by simply including the Intermediate Dimension (where all the attributes were hidden). I also noticed that the Multidimensional Model with the new schema processed quite a bit faster since the keys in the “Intermediate Dimension” were all integers, and were already a distinct set of values.

I needed a name for the new Schema. When I viewed the Data Source View in SSAS, I noticed that the Schema was shaped like a Spider. Therefore, I named it the Spider Schema.

Believe it or not, I found modeling data was easier in the Spider Schema. In projects where I was hired over the next 10 years, I created both a Star and Spider Schema in the same Data Warehouse, and then different Cubes from them. As time passed, the organizations using the Analytics Platforms preferred the Spider Schema for several reasons. One being that they did not need both solutions, the second the Spider Schema was better in every way, and had no drawbacks.

After proving the viability of the new Schema, I created a website, and tried to promote the Spider Schema. I found myself spending half my evenings answering questions from people who wanted to try using the Spider Schema. Unfortunately, paying bills came before working for free, and eventually I stopped answering emails from new users. I let the website go, and focused on work.

Later in my career I came upon a new challenge, Big Data. Or, data that the Star Schema was not able to support. I was tasked with building a Data Warehouse Schema where the data for 5 years from a SAP ERP System could be loaded into SSAS. Many of the ERP transaction tables stored 10 million + rows / month.

I rolled up my sleeves, and went to work. Reflecting on a conversation I had years prior, an engineer who wrote the code for MS SQL Server explained to me, that smaller width tables (tables with less columns) were faster than larger tables with fewer rows.

Reflecting on this conversation, and having delt with another project like that, I then came up with a new design for the Fact Table. Instead of creating a column for each Measure in the Fact Table, I created a new Dimension Table that stored all the Measure Names in it, which I called Measures. For each Measure I wanted to store in the new fact table, I added the name of the Measure to the Measures Dimension. I then stored the Primary Key of the Measure Dimension in the Fact Table as a Foreign Key. In the end, the Fact table had the following columns:

1. Primary Key which was a Surrogate Key

2. Foreign Key to the “Intermediate Dimension”

3. Foreign Key to the Measure /Metric Dimension.

4. Measure / Metric value as a float.

Later, I renamed the Intermediate Dimension to the Relationship Dimension.

This is what the new Schema looked like:

What I got from this Schema was far more than I ever dreamed possible. Not only did it support significantly larger sets of data, but I could add metrics (or not include values) without making any changes to the underlying database Schema. This saved large amounts of data storage space as null or zero based values were not stored in the Fact Table.

If I had the Measures: Sales Qty and Price in my Fact Table; and later wanted to add Costs, I could simply add Cost as a value to my Measure Dimension, and the corresponding Measures to the Fact Table as they applied to a transaction. All with no changes to the Schema. Additionally, I no longer needed to include zero based values, saving even more space.

Even more incredible is how data appeared visually. I no longer had data granularity issues, where some data did not align with other data because it was stored in a different Fact Table, and that Fact Table had a different set of Dimensional Data. The data simply had a much better layout in all reports. I was able to customized how the data was stored at a transactional level.

I continued to use the Spider Schema as more time / projects passed. I was then hired to Create a new POC for an ERP System that used the Spider Schema as a replacement for both the OTLP and OLAP Schemas on a simple POC. The goal was to put data into a single Schema that could store data for both OLTP and OLAP needs. This project was also successful.

There is no limit on how many Relationship tables or Fact Tables can be used. I have created Relationship Dimensions to support numerous Date Dimensions (Effective From ~ Effective To) for Insurance Domain data.

I personally have procrastinated documenting and putting this information out to the public for several reasons. The first and main reason is time. Working full time, and having a side project like this never works for me. I need free time to think and put together my ideas without interruptions. Secondly, supporting the Spider Schema before had both positive and negative aspects to it.

Follow me in Social Media

Contacts