Scalable Data Science with Ponder on DuckDB

Bala Atur

May 3, 2023 5 min read

Articles
Scalable Data Science with Ponder on DuckDB image

Ponder was founded on the belief that your data science code should not have to change when your data storage platform does. Your data could reside on your laptop, in a database on prem, or in a cloud database. You should be able to use the same code on all these environments and get the same results, while taking advantage of the scalability benefits of your data storage platform.

In keeping with this belief, we’re proud to announce support for DuckDB: Now you can run pandas and NumPy with DuckDB as a backend! Quack quack!

Why Ponder and DuckDB 🦆?

Ponder lets you run your data science workflows (pandas, NumPy) directly in your database, be it Snowflake, BigQuery, or DuckDB. With Ponder, you get the same Python experience you love, but with the power and scalability of data warehouses. Ponder is particularly focused on the pandas API, since that’s a ubiquitous and essential API for data analysis, data science, and AI, with about a quarter of all developers using it frequently. Learn more about Ponder here.

Try Ponder Today

Run pandas and NumPy on DuckDB within minutes!

Get started

DuckDB is a new database that is proving itself to be a robust choice for single-machine, in-process workloads with lots of exciting features announced practically every week.

At Ponder we’re all about meeting the users where they are rather than making them come out of their comfort zone. So DuckDB, thanks to its seamless use as a Python library, was an obvious database for us to support the pandas API on.

What can you do with Ponder and DuckDB?

First, let’s import ponder:

import ponder
ponder.init()

If you have a DuckDB file with data – you can connect to it:

import duckdb
duckdb_con = duckdb.connect("PONDER_DEMO.duckdb"))

You’re off to the races! Now for some pandas. Let’s connect to the CUSTOMER table, which contains 15 million rows.

import modin.pandas as pd
df = pd.read_sql("CUSTOMER", duckdb_con)
time: 812 ms

With Ponder on DuckDB, you can print out a preview of the dataframe — this takes less than 150 milliseconds for our 15-million-row table.

df.head()
		C_CUSTKEY	C_NAME			C_ADDRESS				C_NATIONKEY	C_PHONE		C_ACCTBAL	C_MKTSEGMENT	C_COMMENT
0		13885025	Customer013885025	dI1vjji2UTIEg7a ROATLoYjhb2GtgTM3gw0	17		27-231-844-3541	6307.11		AUTOMOBILE	ogged accounts sleep slyly above t
1		13885026	Customer013885026	90DSwDWWEKzw UIUusyEFsOAkz0E		20		30-235-590-2760	5608.46		BUILDING	quickly final instructions are even, special ...
2		13885027	Customer013885027	a9D8I6SUAPYILYd				19		29-263-210-9815	423.35		BUILDING	wake against the blithely even platelets. bli...
3		13885028	Customer013885028	i,RUQhWznIR3ur6ww,NnJtH3gGTJh		16		26-460-477-8749	7204.22		MACHINERY	lphins sleep after the permanently even packag...
4		13885029	Customer013885029	Xa2A2gh2tbYkZZQl7JxHzL3lbib0e7		4		14-429-237-5917	2402.35		FURNITURE	ial pinto beans. carefully even deposits acros...
time: 125 ms
df.tail()
		C_CUSTKEY	C_NAME			C_ADDRESS					C_NATIONKEY	C_PHONE		C_ACCTBAL	C_MKTSEGMENT	C_COMMENT
14999995	6735020		Customer006735020	vFwQuSJOkRomktiUnXahGp Pg77e3t SGJT20c		5		15-936-504-6793	9498.60		MACHINERY	ular requests cajole slyly after the furiously...
14999996	6735021		Customer006735021	qnsn6jO2qRVJH					4		14-316-923-3675	8984.93		HOUSEHOLD	unts. slyly bold theodolites for the quickly f...
14999997	6735022		Customer006735022	pwWapiNOg5bplf1Ej1WxiRbkigjqONoLN,qJyXEB	2		12-300-419-7326	-214.79		HOUSEHOLD	. carefully ironic deposits along the bold
14999998	6735023		Customer006735023	3xH7YWkqWK,LGZqt bn3Z				3		13-412-507-8812	6904.76		BUILDING	cajole blithely. unusual requests are quickly...
14999999	6735024		Customer006735024	RMJtYJmzCaepvQoUnk				18		28-599-151-2961	7336.20		FURNITURE	even accounts sleep carefully furiously specia...
time: 141 ms

iloc is one of the bread-and-butter Pandas operations that relies on dataframe order. It works just fine, despite DuckDB not having an implicit notion of order — this is all handled by Ponder.

df.iloc[9:99]
	C_CUSTKEY	C_NAME			C_ADDRESS				C_NATIONKEY	C_PHONE		C_ACCTBAL	C_MKTSEGMENT	C_COMMENT
9	13885034	Customer013885034	klO6szIwiZQZocKxFoPu3X3			17		27-270-874-8214	4966.78		AUTOMOBILE	ly express requests. regular, regular accounts...
10	13885035	Customer013885035	SsSgo0uVhQITphOknU vG2okivc Q		0		10-179-934-9269	6760.22		BUILDING	inal foxes grow even, ironic accounts. fi
11	13885036	Customer013885036	C7yL4WlTfA52				20		30-754-512-8897	9097.36		FURNITURE	ic requests. furiously bold foxes use slyly. p...
12	13885037	Customer013885037	79pAnnVAIslyYZ8D4ZheoZJ3hlLX GB		2		12-217-386-5781	8464.12		AUTOMOBILE	sly unusual warthogs are slyly re
13	13885038	Customer013885038	rgXUxZQZZJ6NgEpG1mR8rCii		13		23-705-292-5162	2577.91		AUTOMOBILE	the slyly regular packages. slyly pending acc...
...	...		...			...					...		...		...		...		...
94	13885119	Customer013885119	m9kM1sItxYHb51vR8YSks4soN		4		14-371-781-3754	6150.07		BUILDING	pinto beans sleep around the quickly bold
95	13885120	Customer013885120	POdHO90vuwxSqvAr1B8u,jZTyo6KOvi3Llb	18		28-494-277-9975	6168.26		FURNITURE	refully furiously ironic theodolites. foxes ca...
96	13885121	Customer013885121	BC7D,qGJD0zO4V				18		28-355-267-8882	3988.74		HOUSEHOLD	requests according to the instructions wake a...
97	13885122	Customer013885122	tRzsmr KaBuU4if2ec4k6lS ou61ZTSx9jBO	14		24-573-283-5373	2823.08		AUTOMOBILE	d pinto beans. quickly final ideas cajole care...
98	13885123	Customer013885123	68H0SW9,3mCb3ZLXftQRu,1kg9VZmX1amySgWv	0		10-312-865-3027	2442.52		MACHINERY	refully pending deposits are slowly around the...
90 rows x 8 columns
time: 188 ms

One-hot encoding is a common operation in machine learning, used to transform categorical data to boolean data. One-hot encoding involves moving data to the columns, which isn’t easy in SQL, but is a single line in pandas. All done in less than 200 milliseconds:

pd.get_dummies(df, columns="C_MKTSEGMENT")
		C_CUSTKEY	C_NAME			C_ADDRESS					C_NATIONKEY	C_PHONE		C_ACCTBAL	C_COMMENT						C_MKTSEGMENT_AUTOMOBILE	C_MKTSEGMENT_BUILDING	C_MKTSEGMENT_FURNITURE	C_MKTSEGMENT_HOUSEHOLD	C_MKTSEGMENT_MACHINERY
0		13885025	Customer013885025	dI1vjji2UTIEg7a ROATLoYjhb2GtgTM3gw0		17		27-231-844-3541	6307.11		ogged accounts sleep slyly above t			1			0			0			0			0
1		13885026	Customer013885026	90DSwDWWEKzw UIUusyEFsOAkz0E			20		30-235-590-2760	5608.46		quickly final instructions are even, special ...	0			1			0			0			0
2		13885027	Customer013885027	a9D8I6SUAPYILYd					19		29-263-210-9815	423.35		wake against the blithely even platelets. bli...	0			1			0			0			0
3		13885028	Customer013885028	i,RUQhWznIR3ur6ww,NnJtH3gGTJh			16		26-460-477-8749	7204.22		lphins sleep after the permanently even packag...	0			0			0			0			1
4		13885029	Customer013885029	Xa2A2gh2tbYkZZQl7JxHzL3lbib0e7			4		14-429-237-5917	2402.35		ial pinto beans. carefully even deposits acros...	0			0			1			0			0
...		...		...			...						...		...		...		...							...			...			...			...			...
14999995	6735020		Customer006735020	vFwQuSJOkRomktiUnXahGp Pg77e3t SGJT20c		5		15-936-504-6793	9498.60		ular requests cajole slyly after the furiously...	0			0			0			0			1
14999996	6735021		Customer006735021	qnsn6jO2qRVJH					4		14-316-923-3675	8984.93		unts. slyly bold theodolites for the quickly f...	0			0			0			1			0
14999997	6735022		Customer006735022	pwWapiNOg5bplf1Ej1WxiRbkigjqONoLN,qJyXEB	2		12-300-419-7326	-214.79		. carefully ironic deposits along the bold		0			0			0			1			0
14999998	6735023		Customer006735023	3xH7YWkqWK,LGZqt bn3Z				3		13-412-507-8812	6904.76		cajole blithely. unusual requests are quickly...	0			1			0			0			0
14999999	6735024		Customer006735024	RMJtYJmzCaepvQoUnk				18		28-599-151-2961	7336.20		even accounts sleep carefully furiously specia...	0			0			1			0			0
15000000 rows x 12 columns
time: 187 ms

You have csv files you want to work with along with your data in DuckDB? No problem, just set the default connection as our DuckDB connection and we can now read the CSV as a staged table in DuckDB:

ponder.configure(default_connection=duckdb_con)
# read_csv - staging the table in duckdb
df_csv = pd.read_csv("orders_500k.csv", header=0)
time: 328 ms

Now you can work with this csv file in DuckDB.

df_csv
	O_ORDERKEY	O_CUSTKEY	O_ORDERSTATUS	O_TOTALPRICE	O_ORDERDATE	O_ORDERPRIORITY	O_CLERK		O_SHIPPRIORITY	O_COMMENT
0	247085090	5053234		O		26421.99	1996-03-11	4-NOT SPECIFIED	Clerk000022806	0		special, even accounts. slyly regular requests
1	246557251	5781565		O		150232.09	1996-03-11	2-HIGH		Clerk000096944	0		gular accounts sleep furiously close theodolit...
2	247971940	1066733		O		274524.17	1996-03-11	3-MEDIUM	Clerk000079674	0		to beans sleep above the furiously silent depe...
3	246548960	13399567	O		226992.85	1996-03-11	3-MEDIUM	Clerk000079967	0		beans? idle foxes engage slyly even accounts. ...
4	247110821	11027485	O		211890.95	1996-03-11	2-HIGH		Clerk000074934	0		sleep even excuses. furious
...	...		...		...		...		...		...		...		...		...
499995	184305477	6530809		O		186301.68	1995-08-19	4-NOT SPECIFIED	Clerk000015395	0		l deposits. even, final asymptotes cajole
499996	185794656	10073543	O		138755.17	1995-08-19	4-NOT SPECIFIED	Clerk000024626	0		ts. quickly regular accounts s
499997	184303879	1424971		O		233532.34	1995-08-19	2-HIGH		Clerk000003942	0		pecial ideas. instructions sleep carefully acr...
499998	185796034	5808431		O		105537.88	1995-08-19	5-LOW		Clerk000037482	0		ructions wake: carefully even packages serve b...
499999	184299360	2022085		O		305221.11	1995-08-19	4-NOT SPECIFIED	Clerk000090622	0		ymptotes-- final, unusual accounts sleep slyly...
500000 rows x 9 columns
time: 31 ms

You can even mix-and match! Want to join the csv data with data in DuckDB?

joined = df.merge(df_csv, left_on=["C_CUSTKEY"], right_on=["O_CUSTKEY"], how="inner")

Let’s look at the results, shall we?

joined
	C_CUSTKEY	C_NAME			C_ADDRESS				C_NATIONKEY	C_PHONE		C_ACCTBAL	C_MKTSEGMENT	C_COMMENT						O_ORDERKEY	O_CUSTKEY	O_ORDERSTATUS	O_TOTALPRICE	O_ORDERDATE	O_ORDERPRIORITY	O_CLERK		O_SHIPPRIORITY	O_COMMENT
0	13885025	Customer013885025	dI1vjji2UTIEg7a ROATLoYjhb2GtgTM3gw0	17		27-231-844-3541	6307.11		AUTOMOBILE	ogged accounts sleep slyly above t			411537380	13885025	O		199527.43	1996-05-18	4-NOT SPECIFIED	Clerk000033809	0		ronic requests sleep slyly.
1	13885039	Customer013885039	uME8pef0iTDauXB,MWl VK56Q9T		23		33-756-622-9157	6859.99		AUTOMOBILE	ckages are along the carefully				356399233	13885039	F		125630.09	1994-03-04	1-URGENT	Clerk000056446	0		instructions are ironic excuses. unusual, bold...
2	13885127	Customer013885127	SyUZUT9KUP4S7eBg3aN6oE			24		34-798-434-7839	7572.26		FURNITURE	le slyly grouches. regular dependencies despit...	23341472	13885127	O		188930.45	1995-12-12	1-URGENT	Clerk000027310	0		ages. carefully quick asym
3	13885162	Customer013885162	fSQLEymWgpYCm				0		10-412-546-1921	3522.57		BUILDING	bold, special asymptotes boost unusual packag...	539010085	13885162	O		211823.20	1995-12-02	3-MEDIUM	Clerk000026175	0		requests nag blithely silent pinto beans. slyly
4	13885192	Customer013885192	Tv7Off82RTuc0EQK18NyquXppnGhoHMXgaris4E	17		27-230-872-7428	8792.09		MACHINERY	lly. blithely regular packages are always. flu...	334752419	13885192	F		164490.44	1994-11-05	2-HIGH		Clerk000031240	0		ites integrate after the slyly unusual account...
...	...		...			...					...		...		...		...		...							...		...		...		...		...		...		...		...		...
499995	6734917		Customer006734917	fZU2IYF36md				15		25-846-890-1785	1446.16		AUTOMOBILE	kly about the foxes. slyly furious instruction...	120997858	6734917		O		194310.36	1996-08-30	3-MEDIUM	Clerk000094421	0		he pending requests. pending dol
499996	6734950		Customer006734950	S 6MMlaUQl0S3RWvowKMmkWwNsZo		6		16-486-258-1685	7244.71		MACHINERY	es. furiously bold asymptotes affix furiously ...	326034115	6734950		F		241537.75	1992-05-28	5-LOW		Clerk000095347	0		requests. ideas are after the blithely silent d
499997	6734953		Customer006734953	JK98pi9hSGFHrSZi7H			21		31-422-460-9577	4066.42		HOUSEHOLD	final, regular requests wake blithely careful...	346830726	6734953		P		189769.51	1995-06-02	2-HIGH		Clerk000097403	0		o nag carefully ironically ironic excuses? car...
499998	6734975		Customer006734975	FDxHc31TP0WQmyYCgDs9z			3		13-250-121-7346	6673.06		BUILDING	nts! furiously ironic accounts within the furio		84176453	6734975		O		167591.78	1995-11-28	2-HIGH		Clerk000094241	0		platelets at the ironic ideas print
499999	6735002		Customer006735002	scG9vo6cJArfr				20		30-468-868-9026	9117.82		HOUSEHOLD	es. slyly regular accounts integrate c			54334753	6735002		O		177607.25	1996-04-01	3-MEDIUM	Clerk000049829	0		blithely. theodolites haggle furiously
500000 rows x 17 columns
time: 1.92 s

pandas’ support for datetime operations — especially since it was originally designed for use in financial time-series analysis — is exceptional. Now you can get the same support on DuckDB.

dt_index_df = joined.set_index("O_ORDERDATE")
dt_index_df
		C_CUSTKEY	C_NAME			C_ADDRESS				C_NATIONKEY	C_PHONE		C_ACCTBAL	C_MKTSEGMENT	C_COMMENT						O_ORDERKEY	O_CUSTKEY	O_ORDERSTATUS	O_TOTALPRICE	O_ORDERPRIORITY	O_CLERK		O_SHIPPRIORITY	O_COMMENT	
O_ORDERDATE																
1996-05-18	13885025	Customer013885025	dI1vjji2UTIEg7a ROATLoYjhb2GtgTM3gw0	17		27-231-844-3541	6307.11		AUTOMOBILE	ogged accounts sleep slyly above t			411537380	13885025	O		199527.43	4-NOT SPECIFIED	Clerk000033809	0		ronic requests sleep slyly.
1994-03-04	13885039	Customer013885039	uME8pef0iTDauXB,MWl VK56Q9T		23		33-756-622-9157	6859.99		AUTOMOBILE	ckages are along the carefully				356399233	13885039	F		125630.09	1-URGENT	Clerk000056446	0		instructions are ironic excuses. unusual, bold...
1995-12-12	13885127	Customer013885127	SyUZUT9KUP4S7eBg3aN6oE			24		34-798-434-7839	7572.26		FURNITURE	le slyly grouches. regular dependencies despit...	23341472	13885127	O		188930.45	1-URGENT	Clerk000027310	0		ages. carefully quick asym
1995-12-02	13885162	Customer013885162	fSQLEymWgpYCm				0		10-412-546-1921	3522.57		BUILDING	bold, special asymptotes boost unusual packag...	539010085	13885162	O		211823.20	3-MEDIUM	Clerk000026175	0		requests nag blithely silent pinto beans. slyly
1994-11-05	13885192	Customer013885192	Tv7Off82RTuc0EQK18NyquXppnGhoHMXgaris4E	17		27-230-872-7428	8792.09		MACHINERY	lly. blithely regular packages are always. flu...	334752419	13885192	F		164490.44	2-HIGH		Clerk000031240	0		ites integrate after the slyly unusual account...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1996-08-30	6734917		Customer006734917	fZU2IYF36md				15		25-846-890-1785	1446.16		AUTOMOBILE	kly about the foxes. slyly furious instruction...	120997858	6734917		O		194310.36	3-MEDIUM	Clerk000094421	0		he pending requests. pending dol
1992-05-28	6734950		Customer006734950	S 6MMlaUQl0S3RWvowKMmkWwNsZo		6		16-486-258-1685	7244.71		MACHINERY	es. furiously bold asymptotes affix furiously ...	326034115	6734950		F		241537.75	5-LOW		Clerk000095347	0		requests. ideas are after the blithely silent d
1995-06-02	6734953		Customer006734953	JK98pi9hSGFHrSZi7H			21		31-422-460-9577	4066.42		HOUSEHOLD	final, regular requests wake blithely careful...	346830726	6734953		P		189769.51	2-HIGH		Clerk000097403	0		o nag carefully ironically ironic excuses? car...
1995-11-28	6734975		Customer006734975	FDxHc31TP0WQmyYCgDs9z			3		13-250-121-7346	6673.06		BUILDING	nts! furiously ironic accounts within the furio		84176453	6734975		O		167591.78	2-HIGH		Clerk000094241	0		platelets at the ironic ideas print
1996-04-01	6735002		Customer006735002	scG9vo6cJArfr				20		30-468-868-9026	9117.82		HOUSEHOLD	es. slyly regular accounts integrate c			54334753	6735002		O		177607.25	3-MEDIUM	Clerk000049829	0		blithely. theodolites haggle furiously
500000 rows x 16 columns
time: 907 ms

How about some longer running operations? This next one allows for a rather involved datetime operation, resample, that standardizes the sampling frequency of the data.

dt_index_df.resample("4W")["O_TOTALPRICE"].mean()
O_ORDERDATE
1992-01-05    150849.248284
1992-04-26    151684.353106
1992-05-24    150648.910094
1994-02-27    151260.644243
1994-06-19    151104.287942
1994-07-17    150348.825665
1994-08-14    150693.089773
1994-09-11    151487.427604
1994-10-09    150433.473222
1994-11-06    151049.277947
1994-12-04    153190.038529
1995-01-01    150739.200809
1995-01-29    150783.800724
1995-02-26    151288.344218
1995-03-26    150063.333556
1995-05-21    151170.144389
1995-06-18    151102.866605
1995-07-16    150971.583738
1995-08-13    151443.335792
1995-11-05    151015.251253
1995-12-03    150644.367542
1996-01-28    151879.618669
1996-02-25    153108.216507
1996-03-24    150530.048388
1996-04-21    150544.570320
1996-05-19    152791.916291
1996-07-14    153617.900821
1996-08-11    151389.791442
1996-11-03    150459.415614
1996-12-29    152999.267568
1997-01-26    150501.075554
1997-02-23    150909.408596
Name: O_TOTALPRICE, dtype: float64
time: 3.41 s

We already saw that we can support linear algebra operations. We can also more explicitly support NumPy-style matrix operations too:

arr = pd.get_dummies(df, columns="C_MKTSEGMENT").select_dtypes("number").to_numpy()
time: 47 ms

If we check the type and shape, we can see that it’s a Modin NumPy array with 15 million rows:

type(arr)
modin.numpy.arr.array
arr.shape
(15000000, 8)

Ponder with DuckDB v. vanilla pandas

OK so great, you’ve seen how you can translate pandas and NumPy to DuckDB via Ponder. But why not just use vanilla pandas, and how does Ponder with DuckDB work anyway?

Why use Ponder with DuckDB?

Two reasons: Performance, and scale. We’ll talk about performance here, and explore scale in a later post.

Why does this improve performance? DuckDB is fast (you can read more about DuckDB’s performance v. pandas here), and through Ponder, you can use the pandas API on top of DuckDB.

For example, pandas takes 11.8 seconds to create a dataframe from a file with about 10 million rows:

import pandas
df = pandas.read_csv("orders_big.csv")
df
	O_ORDERKEY	O_CUSTKEY	O_ORDERSTATUS	O_TOTALPRICE	O_ORDERDATE	O_ORDERPRIORITY	O_CLERK		O_SHIPPRIORITY	O_COMMENT
0	247085090	5053234		O		26421.99	1996-03-11		4-NOT SPECIFIED	Clerk000022806	0		special, even accounts. slyly regular requests
1	246557251	5781565		O		150232.09	1996-03-11		2-HIGH		Clerk000096944	0		gular accounts sleep furiously close theodolit...
2	247971940	1066733		O		274524.17	1996-03-11		3-MEDIUM	Clerk000079674	0		to beans sleep above the furiously silent depe...
3	246548960	13399567	O		226992.85	1996-03-11		3-MEDIUM	Clerk000079967	0		beans? idle foxes engage slyly even accounts. ...
4	247110821	11027485	O		211890.95	1996-03-11		2-HIGH		Clerk000074934	0		sleep even excuses. furious
...	...		...		...		...		...			...		...		...		...
9999995	184305477	6530809		O		186301.68	1995-08-19		4-NOT SPECIFIED	Clerk000015395	0		l deposits. even, final asymptotes cajole
9999996	185794656	10073543	O		138755.17	1995-08-19		4-NOT SPECIFIED	Clerk000024626	0		ts. quickly regular accounts s
9999997	184303879	1424971		O		233532.34	1995-08-19		2-HIGH		Clerk000003942	0		pecial ideas. instructions sleep carefully acr...
9999998	185796034	5808431		O		105537.88	1995-08-19		5-LOW		Clerk000037482	0		ructions wake: carefully even packages serve b...
9999999	184299360	2022085		O		305221.11	1995-08-19		4-NOT SPECIFIED	Clerk000090622	0		ymptotes-- final, unusual accounts sleep slyly...
10000000 rows × 9 columns
time: 11.8 s

Ponder on DuckDB gets the job done in 5.75 seconds. There’s similar speedups for other operations as well — DuckDB is able to make use of parallelism via multiple cores, sophisticated query optimization, and compression/vectorization to provide these benefits, all out of the box.

df = pd.read_csv("orders_big.csv")
df
	O_ORDERKEY	O_CUSTKEY	O_ORDERSTATUS	O_TOTALPRICE	O_ORDERDATE	O_ORDERPRIORITY	O_CLERK		O_SHIPPRIORITY	O_COMMENT
0	247085090	5053234		O		26421.99	1996-03-11	4-NOT SPECIFIED	Clerk000022806	0		special, even accounts. slyly regular requests
1	246557251	5781565		O		150232.09	1996-03-11	2-HIGH		Clerk000096944	0		gular accounts sleep furiously close theodolit...
2	247971940	1066733		O		274524.17	1996-03-11	3-MEDIUM	Clerk000079674	0		to beans sleep above the furiously silent depe...
3	246548960	13399567	O		226992.85	1996-03-11	3-MEDIUM	Clerk000079967	0		beans? idle foxes engage slyly even accounts. ...
4	247110821	11027485	O		211890.95	1996-03-11	2-HIGH		Clerk000074934	0		sleep even excuses. furious
...	...		...		...		...		...		...		...		...		...
9999995	184305477	6530809		O		186301.68	1995-08-19	4-NOT SPECIFIED	Clerk000015395	0		l deposits. even, final asymptotes cajole
9999996	185794656	10073543	O		138755.17	1995-08-19	4-NOT SPECIFIED	Clerk000024626	0		ts. quickly regular accounts s
9999997	184303879	1424971		O		233532.34	1995-08-19	2-HIGH		Clerk000003942	0		pecial ideas. instructions sleep carefully acr...
9999998	185796034	5808431		O		105537.88	1995-08-19	5-LOW		Clerk000037482	0		ructions wake: carefully even packages serve b...
9999999	184299360	2022085		O		305221.11	1995-08-19	4-NOT SPECIFIED	Clerk000090622	0		ymptotes-- final, unusual accounts sleep slyly...
10000000 rows x 9 columns
time: 5.75 s

How does Ponder work with DuckDB under the hood?

To the end user, coding in pandas with Ponder on DuckDB might at first feel indistinguishable from normal pandas (with the exception that it’s faster), but under the hood, the two are completely different. Ponder on DuckDB relies on DuckDB for storage and compute. You can think of Ponder as letting users use a pandas API on top of data warehouses.

Moving between the pandas API and SQL

With Ponder on DuckDB, it’s also very easy to persist your dataframe as a DuckDB table and query it with SQL. To save the dataframe, all you need to do is use the pandas to_sql command:

df.to_sql("New Orders", index=False, con=duckdb_con)
time: 3.16 s

Now you can work with it using SQL directly, via DuckDB. When we pull all table names, you can see the “New Orders” table listed among the results:

duckdb_con.execute('SELECT * FROM duckdb_tables;').df()
	database_name	database_oid	schema_name	schema_oid	table_name		table_oid	internal	temporary	has_primary_key	estimated_size	column_count	index_count	check_constraint_count	sql
0	PONDER_DEMO	4		main		856		CUSTOMER		862		False		False		False		15000000	8		0		0			CREATE TABLE "CUSTOMER"("C_CUSTKEY" INTEGER, ...
1	PONDER_DEMO	4		main		856		ORDERS_1M		864		False		False		False		1000000		9		0		0			CREATE TABLE "ORDERS_1M"("O_ORDERKEY" INTEGER,...
2	PONDER_DEMO	4		main		856		CUSTOMER_TEMP		866		False		False		False		15000000	8		0		0			CREATE TABLE "CUSTOMER_TEMP"("C_CUSTKEY" INTEG...
3	PONDER_DEMO	4		main		856		ORDERS_1M_TEMP		868		False		False		False		1000000		9		0		0			CREATE TABLE "ORDERS_1M_TEMP"("O_ORDERKEY" INT...
4	PONDER_DEMO	4		main		856		New Orders		1435		False		False		False		10000000	9		0		0			CREATE TABLE "New Orders"("O_ORDERKEY" BIGINT,...
5	temp		1307		main		1308		PONDER_OJTYPFNERN	1318		False		True		False		10000000	11		0		0			CREATE TABLE "PONDER_OJTYPFNERN"("O_ORDERKEY" ...
time: 78 ms

And finally, you can work with it again using the pandas API through the read_sql command:

new_orders = pd.read_sql("New Orders", con=duckdb_con)
new_orders
	O_ORDERKEY	O_CUSTKEY	O_ORDERSTATUS	O_TOTALPRICE	O_ORDERDATE	O_ORDERPRIORITY	O_CLERK		O_SHIPPRIORITY	O_COMMENT
0	247085090	5053234		O		26421.99	1996-03-11	4-NOT SPECIFIED	Clerk000022806	0		special, even accounts. slyly regular requests
1	246557251	5781565		O		150232.09	1996-03-11	2-HIGH		Clerk000096944	0		gular accounts sleep furiously close theodolit...
2	247971940	1066733		O		274524.17	1996-03-11	3-MEDIUM	Clerk000079674	0		to beans sleep above the furiously silent depe...
3	246548960	13399567	O		226992.85	1996-03-11	3-MEDIUM	Clerk000079967	0		beans? idle foxes engage slyly even accounts. ...
4	247110821	11027485	O		211890.95	1996-03-11	2-HIGH		Clerk000074934	0		sleep even excuses. furious
...	...		...		...		...		...		...		...		...		...
9999995	184305477	6530809		O		186301.68	1995-08-19	4-NOT SPECIFIED	Clerk000015395	0		l deposits. even, final asymptotes cajole
9999996	185794656	10073543	O		138755.17	1995-08-19	4-NOT SPECIFIED	Clerk000024626	0		ts. quickly regular accounts s
9999997	184303879	1424971		O		233532.34	1995-08-19	2-HIGH		Clerk000003942	0		pecial ideas. instructions sleep carefully acr...
9999998	185796034	5808431		O		105537.88	1995-08-19	5-LOW		Clerk000037482	0		ructions wake: carefully even packages serve b...
9999999	184299360	2022085		O		305221.11	1995-08-19	4-NOT SPECIFIED	Clerk000090622	0		ymptotes-- final, unusual accounts sleep slyly...
10000000 rows x 9 columns
time: 1.03 s

Conclusion

As you can see through the demo above, you can run pandas operations on DuckDB seamlessly, while leveraging the parallel execution of DuckDB.

The best part is, if you decide to move to some other database (on premise or on cloud), all your code remains EXACTLY the same. (The only thing that changes is the connection.) That’s the magic of Ponder: we speak both pandas and SQL so your databases (and you) don’t have to!

If you’re interested in trying Ponder, reach out here!

Ready to level up your Pandas game?

Try Ponder now