pandas apply in your database

Karthik Velayutham

Jul 20, 2023 3 min read

Articles
pandas apply in your database image

pandas apply is a powerful operation that lets users execute custom operations on their DataFrames. In this article, we’ll explore how pandas apply works and demonstrate how you can leverage Ponder to directly perform pandas apply on your database.

What is pandas.DataFrame.apply?

pandas apply applies a Python function along an axis of the DataFrame. The function can usually be any user-defined function or one defined in an external library, etc. This functionality provides users a lot of flexibility as they can perform custom computations or analyses at a row or column-wise level. For example, users may specify a custom formula for calculating a particular metric of interest or define a custom scoring function to evaluate the result of a model.

Here is an example of how apply works in practice, we create an example dataframe with two columns. We use apply to create a third column c based on the average of the two existing columns a and b. The operation is applied along the columns as specified by axis=1.

apply allows users to perform custom functions on every value in your dataframe without having to loop over every value in a DataFrame/Series explicitly. It is well known that looping over DataFrames can be slow and inefficient, making your pandas code difficult to optimize. Note that while apply is often considered better practice than looping over something like iterrows(), it is not a vectorized implementation and can still lead to slowness. 

At Ponder, our goal is to provide the same seamless experience of pandas on top of your existing data warehouses. Thus, supporting apply to work in the same fashion you would expect in pandas has been a priority for us. We currently support apply only on Snowflake and DuckDB, but are working hard to get it to work on other DB vendors as well. Here's an example of us running apply with Ponder on Snowflake:

Here, we do a simple calculation where we add 1000 to every value in TRIP_DISTANCE.

We can do something more advanced here to calculate the fare price per passenger, while custom-handling the cases where the passenger count is zero to avoid division by zero error.

How does apply work in Ponder?

The ability to run arbitrary Python functions and leverage pandas apply to apply them to a dataframe seems like magic. Let’s take a look at how this works underneath the hood. 

To run arbitrary Python functions within your database, we create a temporary Python user-defined table function (UDTF) in Snowflake based on the function that the user submitted in the apply.

As an aside, you might have heard of UDFs and are wondering how UDTFs are different. UDFs are scalar functions designed to return a row of values, so the results will be a single row/column. For example, in the pandas apply example at the beginning of this post, every value in column c is equal to the corresponding value in column a plus the corresponding value in column b divided by two. UDTFs are just like UDFs but instead, they return a table of values. So if we want the output to be a DataFrame, UDTFs are the way to go.

Summary

In this quick post, we demonstrate how you can run pandas apply on your DataFrame directly with the data in your data warehouse. If you want to try this out on your own, sign up for Ponder for free here!

Ready to level up your Pandas game?

Try Ponder now