Disclaimer: I am a Microsoft employee, this post represents my personal journey using Microsoft products to explore data warehousing- all my information is gathered from public sources linked through the post
I'm going to tell you something now that might shock you: I don't know anything about data. Despite working as a Power BI educator for the past two years, anything beyond those bright yellow (now green) walls has remained a mystery to me - until now.
In my push to learn more about the world of Data Analysis and Engineering, I have had to challenge nearly every assumption I've ever made about the world of data, from date tables (why do I need a separate table when there's a date column right here!), to visualisations (still a pie chart fan don't hate me), I've had to teach myself core concepts that I never learned through traditional education.
One concept that has confounded me the most is the data warehouse. The idea of a data warehouse challenged the very core of my understanding of self-service Business Intelligence - simply point the tool to where the data is and start analysing. A data warehouse seemingly added extra complexity, processing, and required expertise that just didn't make sense to me.
I knew, of course, that there must be a reason that warehousing data was important, but I had just never spent the time to uncover more. With the recent launch of Microsoft Fabric (Check out my very cool and professional post on that here) it seemed like the right time to dive into the deep end and figure it out for myself.
Defining a Warehouse
Before I try creating one myself I thought it would be useful to formulate my definition of exactly what a data warehouse is, and why I would use one. After some extensive googling and some great YouTube videos (Shoutout to Alex The Analyst) I was able to come up with the following: A data warehouse is a centralised database created to unify business data for analytics and business intelligence.
So it turns out that at least on a conceptual level, a data warehouse is much simpler than I thought. A warehouse allows you to unify and store data from multiple locations (Sales databases, CRM applications, ERP processes) and store them in one location where this data can be modelled, transformed, and distributed to ensure accurate data is available for reporting in your organisation. I like to think of a data warehouse as a "source of truth", allowing an organisation to have a singular (or several, depending on the scope and usage of data) location that can be relied upon to provide accurate data for analysts to report on, and business users to access through business intelligence tools like my beloved Power BI.
In my case as a Power BI user, I had originally thought that a data warehouse added additional work to the process of analysing data, but I was wrong. In many cases, a properly created and maintained data warehouse can reduce not only the time it takes to analyse data, but also the processing power required as transformation logic can be reused, and processing happens before the data is accessed, meaning minimal work is required on the users end to shape and form their data.
Now there are a few drawbacks to the paradigm of data warehousing, data warehouses are traditionally set up to ingest and manage structured data in a relational format, this may not be appropriate in use cases that rely on semi-structured or unstructured data formats (more on that later).
On top of these constraints, setting up a data warehouse requires staff with the expertise to provision and manage the ETL processes that shift data from their original sources into the warehouse itself. Depending on the cloud provider you choose to use for your warehouse, it can get very expensive very quickly depending on the volume of data you are moving.
Setting up a Data Warehouse
There are plenty of cloud providers and even on-premise solutions in the world of data warehousing, however as the subtitle of today's article suggests I wanted to explore how I could create my own data warehouse to get hands-on experience with the process end-to-end using Microsoft Fabric. As a reminder, Fabric is still in preview, so my experience here today may change as the product develops.
In a typical scenario where you might be considering a data warehouse, there are a lot of considerations you would need to observe. Each vendor will offer several different configurations that you would have to align with your storage, processing, and budget needs. Thankfully Fabric simplifies this process, you can see licensing details here, but keep in mind that Microsoft solutions can tend to be on the more expensive side, definitely do your research to see which vendor matches your use case. I will be starting small with my demo today, simply connecting a single data source to the warehouse for business intelligence.
Creating the Warehouse
Creating a data warehouse couldn't be easier in the Microsoft Fabric web view. What was previously a fairly complicated process in Synapse Analytics has been transformed into a single-click solution. Simply head over to your Fabric launch page and select "Synapse Data Warehouse" to get started. As a note - this warehouse will be created in the workspace you've currently selected, if you're testing like I am, it's probably worth double-checking that you're inside of "My Workspace" - although in a real-life deployment, the warehouse would likely have a collaborative workspace created for it to live in.
On the Synapse Data Warehouse homepage, click "Warehouse (Preview) under the "New" heading to start.
You will be prompted to give your new data warehouse a name, I've chosen to go with "Test Warehouse" but feel free to choose a more meaningful name, remembering that it can be changed later. Fabric does give you the option to pre-fabricate (no pun intended) a sample warehouse with data already filled, this is a great way to get comfortable with the interface, however, I wanted to show an example using real data ingested through a Dataflow.
Once your empty warehouse is created, it's time to start filling in our data! A data warehouse in Fabric/Synapse Analytics is just a fancy MS SQL database (don't tell them I said that), meaning that not only can you fill data in via a Pipeline or Dataflow, but you can even write T-SQL right against the warehouse to define tables and views directly in the browser, neat!
Now that our warehouse is created, the next step is to start filling it with transformed, business-critical data that can be used for analysis. In today's demonstration, I'm going to use a really simple data source - a CSV file hosted on my OneDrive account. I love that Dataflows allow me to take even the simplest data file and convert it into something meaningful. The CSV file I'm using is provided by Transport for New South Wales (My state government's transit authority) and details trips made on all modes of public transport between July 2016 and May 2023, you can find the file here, be aware that you must sign up to the open data platform before downloading.
So, we have our data and a warehouse to store it, we simply have to bring the two together using a Dataflow. I've chosen a Dataflow to move data due to my familiarity with them as a Power BI developer, but I am excited to try out data Pipelines once development continues on them.
Creating a Dataflow is just as easy as creating a warehouse, simply navigate to the workspace that your warehouse is residing, click "new" and select "Dataflow Gen2"
Once the Dataflow is open, you can select to directly Import from a text or CSV file on the homepage:
and then you are prompted to provide your CSV file, recently this functionality was updated to even allow you to upload local data files, but today I'll point the path to my OneDrive where I've stored the file:
Once you have selected the CSV file and confirmed its contents, that's it! Power BI users will feel right at home at this stage, a Dataflow allows you to use Power Query to transform and filter your data, meaning that users will have the exact data they need, no more and no less. As a quick example, let's filter out some unnecessary rows of data before they are filled into our Warehouse.
If you've never used the Power Query editor, it's incredibly straightforward to learn. I'll select the "Travel_Mode" column of my data and quickly filter out all rows showing data regarding bus travel, as my Warehouse will primarily be used to analyse data based on train usage.
Once your transformations are complete, we can set the data destination of this Dataflow at the bottom right-hand side of the screen and select our data warehouse:
After answering a few questions about where my warehouse is, and how I want the ingestion of data to behave (Creating new tables, appending to existing ones), we are set to publish our Dataflow! Simply select "Publish" at the bottom right side of the screen and grab a cup of tea - the Dataflow will run as soon as it is published, and you can customise the frequency of data ingestion once it's completed in your Workspace.
Note: You might notice a few "staging artefacts" appear in your workspace when you create a Dataflow or Pipeline, these are only visible during the preview of Fabric, these traditionally would not be visible to the average user
You can check the progress of your Dataflow in your workspace, and once it's complete you can open your warehouse back up and see the results! Data that was once raw and fragmented is now cleaned, transformed, and ready to be analysed. Once the data is successfully ingested into the warehouse it can also be modelled to establish relationships (even between tables pulled from separate sources), but hey, considering I just figured out what a warehouse was, we can explore those options at a later date.
Finally, once our data has been successfully stored in the warehouse, we can get to work. One of my favourite features of the Fabric Data Warehouse is the automatically generated Power BI Dataset that comes with every warehouse. without any work, I can head back to my workspace, select the Dataset, and start building a Power BI report in seconds without leaving the browser.
What about Lakehouses?
So I've finally mustered the courage to figure out exactly what a data warehouse is, and why I would use one. But, I hear you say, what about the brand new Lakehouse? Thankfully a Lakehouse is a very similar concept to a data warehouse, however, it can house both structured and unstructured data in one location while maintaining all of the features of a warehouse, like SQL querying and modelling. A lakehouse is a great approach for organisations that may need to provide that same "source of truth" idea to both data analysis and data science workloads.
I'm very glad I took the time to finally explore the world of data warehousing, I still have plenty to learn, but the discoveries I made while writing this article gave me a really clear idea of how I can use a data warehouse, and how I can talk to data professionals about their usage.
On top of this, I realised that it's really easy to make assumptions when learning something new - if I had spent even a small amount of time exploring what a warehouse was instead of assuming it was irrelevant to me as a Power BI developer I may have been able to leverage their power much earlier in my data journey.
New tools like Microsoft Fabric blur the line between data analysis and data engineering, and I'm hoping that my experiences in this article will help me to be more open to learning more technical concepts and workflows as I continue my journey in the world of data.
As always, thank you for taking the time to read my thoughts, I hope this has been useful, if not entertaining. If you want to hear more of my thoughts and see my day-by-day progress as I learn more about data analysis, consider following my Twitter account here, or adding me on LinkedIn here.