Jupyter is one of the most widely used tools for data exploration, so it's no surprise that data scientists spend most of their time in front of a Jupyter notebook slicing and dicing data. In many cases, the data lives in a Postgres database!
Traditionally, data scientists must pull data from Postgres with SQL and visualize it with Python libraries such as matplotlib. There are two big problems with this approach:
- Postgres isn't optimized for analytical queries. Slow analytics can infect and degrade performance of a transactional workload also running in Postgres.
- Manual, slow process to move data from Postgres to a local machine, load data into memory with Python for visualization. Visualization is also then limited by the memory available locally.
In this blog post, we'll demonstrate how to finally solve these issues with Hydra and JupySQL. Hydra and JupySQL allows users to plot large-scale data sets by:
- Pushing the aggregation step to Hydra
- Running SQL from a notebook
- Using Hydra’s columnar storage, parallel execution, and vectorization on Postgres for speedy analytics
Let’s dive in!
What is JupySQL?
JupySQL is better SQL in Jupyter notebooks. JupySQL is a fork of ipython-sql. To install it, run pip install jupysql.
What is Hydra?
Hydra is building the best data warehouse for developers. Hydra is open source, simple, and speeds through the toughest analytics by adding columnar storage, vectorization, and query parallelization to Postgres. Check out the Hydra repo and add a Github star ⭐!
Tutorial
With JupySQL we’ll combine Jupyter Notebooks and Hydra. This enables data scientists to easily query data in SQL directly from Jupyter. By using Hydra’s Postgres, hundreds or thousands of gigabytes of data can be combined and leveraged. Locally, you can experiment with importing and combining datasets using both Hydra's columnar tables, Postgres built-in row tables, and external data using FDWs.
In the example below, we’ll use the NYC taxi data set, run analytics on average trip time, and efficiently create a histogram on 1.4M observations with Postgres.
Requirements
To run this tutorial, you need to install the following Python packages:
To run the \copy command, you need pgspecial<2. We recommend you getting it via conda install since pip install might give you some issues.
You also need Docker installed and running to start the Hydra instance.
Starting a Hydra instance
Let's fetch the docker image and start the Hydra instance. The next command will take a couple of minutes as it needs to fetch the image from the repository:
Data download
Now, let's fetch some sample data. We'll be using the NYC taxi dataset:
As you can see, this dataset contains ~1.4M rows.
Uploading data to Hydra
We're ready to upload our data; we'll load the JupySQL extension and start the database connection:
Let's create the table; note that Hydra adds a USING columnar option to the CREATE TABLE statement, which will optimize storage for analytical queries.
Let's now upload the data:
Let's now query our data:
We see that the ~1.4M are there. Let's take a look at the first rows:
Hydra allows us to perform analytical queries efficiently. Let's compute the average trip distance given the passenger count:
JupySQL comes with powerful plotting capabilities. Let's create a histogram of trip distance:
We cannot see much since there are some outliers. Let's find the 99th percentile:
Now, let's use this value as a cutoff:
Much better! We just created a histogram of 1.4M observations!
Where to go from here
Clean up
To finish the tutorial, let's shut down the container:
Package versions
For reproducibility, these are the package versions we used:
Conclusion
In this blog post, we walked through how to use Hydra and JupySQL to analyze and plot datasets. Used together, Hydra and JupySQL enable data practitioners to run fast analytical queries and plot large-scale datasets by pushing the aggregation steps to Hydra. If you want to learn more, check out Hydra and JupySQL documentation!