Welcome to the second installment in our series on Hydra External Tables! Follow us on Twitter, LinkedIn, or Discord to be the first to hear about the next installment in our series and other Hydra announcements. Part 1 of the series is available here.
Hydra is an open source Snowflake alternative that transforms Postgres into an elastic data warehouse with columnar storage and external tables. In part 1, we discussed using external tables to access and join data that live outside Hydra. In this post, we are going to discuss the reverse direction - populate data to the Hydra data warehouse from other databases. In this tutorial, we use RDS as the external database to Hydra.
Why populate data to Hydra from foreign databases?
In part 1 of our External Tables series, we showed how External Tables can be used to quickly run ad-hoc analytics queries on data outside of a Hydra data warehouse. When an External Table’s data size is large, the query becomes more expensive. No need to fear - you can also use External Tables to continuously duplicate data to your Hydra data warehouse automatically! For large data volumes, create a synchronous duplicate of the data inside the warehouse for faster query execution. Check out this tutorial to see how!
Data Modeling 101: Star Schema
In the tutorial below, we will adopt a data modeling approach known as the star schema. Let’s run through some basic concepts.
Star schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a star shape. It is a fundamental approach that is widely used to develop or build a data warehouse. It requires classifying model tables as either dimension or fact.
Fact tables record measurements or metrics for a specific event. A fact table contains dimension key columns that relate to dimension tables and numeric measure columns. Dimension tables describe business entities - the things you model. Entities can include products, people, places, and concepts including time itself. A dimension table contains a key column (or columns) that acts as a unique identifier and other descriptive columns. Fact tables are best as columnar tables, while dimension tables may be best as row tables due to their size and rate of updates.
The following diagram shows the star schema that we are going to model for the tutorial:
Tutorial
In this tutorial, we are going to have a sales fact columnar table, a customers dimension row table, and a products dimension row table in Hydra. The customers row table is auto-synced from RDS Postgres using a trigger where RDS is the canonical source of this table.
The following is a companion video for this tutorial:
Setup
In Hydra, we create the sales, products, and customers table. We then add some sample data to the sales and products table. The customers table is empty and will be auto-populated from RDS at a later step.
Link to the Hydra columnar table
In RDS Postgres, we create an external table linking to the Hydra customers table:
Trigger to sync with the Hydra columnar table
In RDS, we create customers as a row table and add a trigger to sync with the Hydra customers table:
Query from Hydra columnar table
In Hydra, check that the data is populated from RDS to the Hydra customers table:
Perform analysis in Hydra
In Hydra, we can perform interesting data analysis like product sales by country:
Wrap Up
This post explored populating dimension data from RDS Postgres to Hydra using a star schema model. We have shown the ability to auto-sync data outside your data warehouse using a trigger.
We will explore more Hydra integration patterns in the next few posts. Make sure to follow us on Twitter, LinkedIn, or Discord to get notified. In the meantime, you can learn more about External Tables in our documentation. The Hydra database is open sourced on GitHub; please star it if you like what you see in this post 🙂.