How To Use pandas resample on a Database

Naren Krishna

Sep 19, 2023 3 min read

Articles
How To Use pandas resample on a Database image

Introduction to pandas resample

A common time series challenge is having data represented with the wrong time interval for your use case — Perhaps you want your data expressed per hour, but it’s expressed per minute, making it hard to join your data with another hourly data set.

The solution? Resampling. In this article, we describe pandas resample + provide some examples, and then show how you can use it at scale in your database. For more examples of pandas time series functionality (including additional pandas resample details), check out our Professional Pandas article: “Python for Finance: Pandas Resample, Groupby, and Rolling.”

What Does pandas resample do?

pandas resample allows for easy frequency conversion and aggregations across intervals for dataframes and series with certain types of indexes: DatetimeIndex, PeriodIndex, and TimedeltaIndex.

It supports a variety of frequencies to resample by, including minutes, hours, days, quarters, and years and several aggregations to be applied on the data.

There are two types of resampling: downsampling and upsampling. We’ll walk through simple examples of both.

Let’s start by imagining we have a Series with a DatetimeIndex minutes_time and several integer values:
minutes_time	        
2000-01-01 00:00:00	  0
2000-01-01 00:01:00	  1
2000-01-01 00:02:00	  2
2000-01-01 00:03:00	  3
2000-01-01 00:04:00	  4
2000-01-01 00:05:00	  5
2000-01-01 00:06:00	  6
2000-01-01 00:07:00	  7
2000-01-01 00:08:00	  8
Name: a, dtype: int64

In downsampling, we use a wider frequency to grouping the DatetimeIndex by the desired interval. In the examples below, we downsample by interval 3 minutes and 10 minutes, applying the aggregations min and mean respectively.

ser.resample("3min").min()
minutes_time	        
2000-01-01 00:00:00	  0
2000-01-01 00:03:00	  3
2000-01-01 00:06:00	  6
Name: a, dtype: int64
ser.resample("10min").mean()
minutes_time
2000-01-01            4
Name: a, dtype: int64

In upsampling, we use a narrower frequency, resulting in many rows being populated with NaN values. In the example below, we upsample by the interval 30 seconds with the aggregation min.

ser.resample("30S").min()
minutes_time	        
2000-01-01 00:00:00	  0.0
2000-01-01 00:00:30	  NaN
2000-01-01 00:01:00	  1.0
2000-01-01 00:01:30	  NaN
2000-01-01 00:02:00	  2.0
2000-01-01 00:02:30	  NaN
2000-01-01 00:03:00	  3.0
2000-01-01 00:03:30	  NaN
2000-01-01 00:04:00	  4.0
2000-01-01 00:04:30	  NaN
2000-01-01 00:05:00	  5.0
2000-01-01 00:05:30	  NaN
2000-01-01 00:06:00	  6.0
2000-01-01 00:06:30	  NaN
2000-01-01 00:07:00	  7.0
2000-01-01 00:07:30	  NaN
2000-01-01 00:08:00	  8.0
Name: a, dtype: float64

Now that we have seen some basic examples of how pandas resample works, let’s try it out on some large data!

pandas resample on a Database

Ponder lets you use the pandas API on data in your data warehouse, and it does this by converting your pandas code to SQL. We currently support resample on Snowflake, DuckDB, and BigQuery. For more details on our resample API coverage, see our Supported API documentation here.

Let’s go through an example of downsampling on Snowflake. First we initialize Ponder and connect it to our Snowflake data warehouse:

import snowflake.connector
import modin.pandas as pd
import ponder

# Create a Ponder Snowflake Connection Object
snowflake_con = snowflake.connector.connect(
    user=******,
    password=******,
    account=******,
    role=******,
    database=******,
    schema=******,
    warehouse=******,
)

ponder.init()
ponder.configure(default_connection=snowflake_con)
Next we read a Snowflake table with 150 million rows into our dataframe and set the index O_ORDERDATE, which is a datetime column.
df_orders = pd.read_sql("SELECT * FROM PONDER_DEMO.ORDERS_150M", snowflake_con)
df_orders.set_index("O_ORDERDATE", inplace=True)
df_orders

Finally, we resample by a yearly interval and aggregate the results by performing a mean.

df_orders.resample("Y").mean()

In the above table, you can see that the index is now grouped by every year in the index, and the values are an average per group of each of the columns. Our resampling operation was successful!

This 150M-row resample.mean operation took only ~18 seconds on an x-small Snowflake warehouse, while the equivalent operation took ~2 minutes in pandas (2.3 GHz 8-Core Intel Core i9 MacBook Pro). That’s a ~6.5X speedup!

Summary

In this post, we discussed what pandas resample does and how to use it. We then demonstrated running pandas.resample directly in your data warehouse with Ponder.

If you want to give resample a try on your own data warehouse, Try Ponder Today!

Try Ponder Today

Start running your Python data science workflows in your database within minutes!

Get started

Ready to level up your Pandas game?

Try Ponder now