In our previous post we discussed how Pandas dataframes are much like a food court, while databases offering SQL are like Michelin-star restaurants, and how both dataframes and databases have been around in some form since the 1970s. In this post we’ll dive into the comparison between Pandas and SQL on conciseness: the ability to express sophisticated data transformation operations succinctly.

So, given that both Pandas and SQL support data transformation and analysis, one may ask: Is Pandas really more concise than SQL, the original database query language? In this post we show that there are many operations that are much easier to do in Pandas than in SQL. We’ll walk through ten examples, most of which use this Goodreads dataset from Kaggle, listed below with handy links!

Normalizing columns uniformly
One-hot encoding
Cleaning columns with missing values
Compute covariances
Element-wise combination of dataframes
Transpose
Programmatically operate on column names
Point updates
Pivot & Melt
Moving average of a time series

Or you can skip ahead to the conclusion!

Prelude: loading data

Some of the steps we took here are not easy to do in most relational databases supporting SQL. For example, Postgres would not allow us to skip bad lines when reading the CSV and would throw a parse error.

import pandas as pd
import numpy as np
df = pd.read_csv("books.csv", on_bad_lines='skip', dtype={'bookID': str, 'isbn': str, 'isbn13': str})

1. Normalizing columns uniformly

Normalization is a common feature engineering operation; it involves scaling numerical values in a column to ensure that they are in a certain range, or that the resulting distribution of values have low variance. Normalization is key to fixing input features prior to machine learning model training, which in turn ensures that the eventual machine learning models have high accuracy. With Pandas, we can easily pick out only the numerical columns and normalize them all at once:

x = df.select_dtypes(include='number').columns
(df[x] - df[x].mean())/df[x].std()

Out [2]:

 	average_rating	num_pages	ratings_count	text_reviews_count
0	1.814412	1.308692	18.469003	10.497845
1	1.586157	2.212684	18.979913	11.130456
2	1.386434	0.064666	-0.103199	-0.115674
…	…		…		…		…
11120	0.073968	0.325912	-0.152204	-0.173502
11121	-0.610797	0.404700	-0.152658	-0.156425
11122	-0.068691	-0.267074	-0.158489	-0.205715

11123 rows × 4 columns


SELECT
    (
        (average_rating - AVG(average_rating) OVER ()) /
        STDDEV_POP(average_rating) OVER ()
    ),
    (        
        (num_pages - AVG(num_pages) OVER ()) /
        STDDEV_POP(num_pages) OVER ()
    ),
    ( 
        (ratings_count - AVG(ratings_count) OVER ()) /        
        STDDEV_POP(ratings_count) OVER ()
    ),
    ( 
        (text_reviews_count - AVG(text_reviews_count) OVER ()) /        
        STDDEV_POP(text_reviews_count) OVER ()
    ),            
FROM
    Books

Note that there is a way to filter columns by data type in SQL via the auxiliary metadata tables (e.g.,sys.table, information_schema.)

But this approach is equally clunky and moreover, this approach is not universally supported across all SQL engines.

2. One-hot encoding

Most machine learning packages expect numeric or boolean matrices as input. One-hot encoding is a feature engineering operation that is commonly used to transform a single categorical attribute to a set of binary attributes, which machine learning packages like Scikit-learn can interpret more easily.

For example, we might want to one-hot encodelanguage_code:

df['language_code']

Out [3]:

0        eng
1        eng
2        eng
        ... 
11121    eng
11122    spa
Name: language_code, Length: 11123, dtype: object

To one-hot encode the column with Pandas, we’d write:

pd.get_dummies(df['language_code'])

Out [4]:

 	ale	ara	en-CA	en-GB	en-US	eng	enm	fre	ger	gla	…	nl	nor	por	rus	spa	srp	swe	tur	wel	zho
0	0	0	0	0	0	1	0	0	0	0	…	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	1	0	0	0	0	…	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	1	0	0	0	0	…	0	0	0	0	0	0	0	0	0	0
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
11121	0	0	0	0	0	1	0	0	0	0	…	0	0	0	0	0	0	0	0	0	0
11122	0	0	0	0	0	0	0	0	0	0	…	0	0	0	0	1	0	0	0	0	0

11123 rows × 27 columns

Doing the same thing in SQL is much harder. For each column we want to one-hot encode, we have to write as many clauses as there are distinct values of that column. Imagine if there were thousands of distinct values in a given column (e.g., the counties in the US, which is over 3000) — The resulting SQL query would involve thousands of query phrases. This is simply impossible for a human being to write correctly.

SELECT
    IF(language_code = 'eng', 1, 0) as eng,
    IF(language_code = 'en-US', 1, 0) as en_US,
    IF(language_code = 'fre', 1, 0) as fre,
    [... 20 additional IF clauses ...] 
    IF(language_code = 'nor', 1, 0) as nor,
    IF(language_code = 'tur', 1, 0) as tur,
    IF(language_code = 'gla', 1, 0) as gla,
    IF(language_code = 'ale', 1, 0) as ale
FROM
    Books

There are other complications. Even before we can write the SQL query above, we would need to first identify the distinct values in the column we want to one-hot-encode. The distinct values have to be identified in a separate operation, such as:

SELECT
    DISTINCT language_code
FROM
    Books

Finally, whenever the set of distinct values of the column change, the SQL query will have to change. If we forget to do this, we may miss out on correctly encoding our data, impacting downstream applications and models.

3. Clean columns with missing values

Imagine that we want to discard columns where too many rows are missing a value for that column, as denoted by the valueNULLorNA. Doing this is a common data cleaning step in machine learning, because such columns are unlikely to be valuable features for the eventual model.

In our example, we discard columns where more than 1% of the records are missing values. In Pandas, we can generate a boolean telling whether each entry in the dataframe is null, calculate the mean across each column, and use a boolean indexer to select columns where that mean is not too low:

df.loc[:, df.isnull().mean() < .01]
df.columns

Out [5]:

Index(['bookID', 'title', 'authors', 'average_rating', 'isbn', 'isbn13',
       'language_code', '  num_pages', 'ratings_count', 'text_reviews_count',
       'publication_date', 'publisher'],
      dtype='object')

We can’t write a single statement to accomplish this in SQL, but first, to get the fraction of rows that are null for each column, we can write something like:

SELECT
    (COUNT(CASE WHEN book_id IS NOT NULL THEN 1 END) /
     CAST(COUNT(*) AS FLOAT) AS book_id_null_fraction,
    (COUNT(CASE WHEN title IS NOT NULL THEN 1 END) /
     CAST(COUNT(*) AS FLOAT) AS title_null_fraction,
    [... 8 additional clauses ...] 
    (COUNT(CASE WHEN publication_date IS NOT NULL THEN 1 END) /
     CAST(COUNT(*) AS FLOAT) AS publication_date_null_fraction,      
    (COUNT(CASE WHEN publisher IS NOT NULL THEN 1 END) /
     CAST(COUNT(*) AS FLOAT) AS publisher_null_fraction,          
FROM Books

We can then write another statement selecting the columns where the fraction is larger than .01. However, this is a painful process, and the time it takes to write this query is proportional to the number of columns. The extensive duplication of query logic across phrases also makes it more likely that errors may creep in.

df.cov() * (len(df) - 1)

Out [6]:

 			average_rating	num_pages	ratings_count	text_reviews_count
average_rating		1.366226e+03	1.414535e+05	1.676229e+07	3.381076e+05
num_pages		1.414535e+05	6.467953e+08	1.036071e+10	2.555354e+08
ratings_count		1.676229e+07	1.036071e+10	1.407607e+14	2.791830e+12
text_reviews_count	3.381076e+05	2.555354e+08	2.791830e+12	7.383861e+10

Note that we multiply bydf.cov()bylen(df) - 1because Pandas normalizes the covariance. The reason why it was so simple is because Pandas allows you to treat the dataframe as a matrix and apply linear algebraic operations to it.

With SQL we would first use aWITHclause to calculate the mean for each column, then sum the product(X - X_m) * (Y - Y_m)for each pair of variablesXandY,whereX_mandY_mare the corresponding means.


WITH t1 AS (
    SELECT
        text_reviews_count,
        AVG(text_reviews_count) OVER() AS avg_text_reviews_count,
        ratings_count,
        AVG(ratings_count) OVER() AS avg_ratings_count,
        num_pages,
        AVG(num_pages) OVER() AS avg_num_pages,
        average_rating,
        AVG(average_rating) OVER() AS avg_average_rating        
    FROM Books
)
SELECT
    SUM(
        (text_reviews_count - avg_text_reviews_count) *
        (ratings_count - avg_ratings_count)
    ) AS text_reviews_count_vs_ratings_count,
    SUM(
        (text_reviews_count - avg_text_reviews_count) *
        (num_pages - avg_num_pages )
    ) AS text_reviews_count_vs_num_pages,
    [... additional clauses ...]   
    SUM(
        (ratings_count - avg_ratings_count) *
        (average_rating - avg_average_rating)
    ) AS ratings_count_vs_average_rating,      
    SUM(
        (num_pages - avg_num_pages) *
        (average_rating - avg_average_rating)
    ) AS num_pages_vs_average_rating                
FROM t1

Again, this is a very long query, with clauses proportional to the number of attributes squared. The query also does not enable you to reproduce the covariance matrix exactly.

Some databases, e.g., RedShift and BigQuery, have a correlation function and/or let us write a user-defined function to compute correlation instead of writing the correlation formula out for each pair of columns. In that case, we would have to manually write a query to find correlation between each pair of numerical columns, or join the entire table against itself in the second solution here.

5. Element-wise combination of dataframes

Suppose we want to take a weighted average of numerical columns of the Otherreads and Goodreads book review datasets. In Pandas, we can select all the numerical columns from each dataset, and then add the two datasets element-wise. Pandas lets us easily operate on each of the columns in an equivalent manner with minimal code.

otherreads_df = df.copy()
goodreads_numerical = df.select_dtypes(include='number')
otherreads_numerical = otherreads_df.select_dtypes(include='number')
.8 * goodreads_numerical + .2 * otherreads_numerical

Out [7]:

	average_rating	num_pages	ratings_count	text_reviews_count
0	4.57		652.0		2095690.0	27591.0
1	4.49		870.0		2153167.0	29221.0
2	4.42		352.0		6333.0		244.0
…	…		…		…		…
11121	3.72		434.0		769.0		139.0
11122	3.91		272.0		113.0		12.0

11123 rows × 4 columns

In SQL, we'd have to write out the average for each numerical column, and we'd have to write out a separateSELECTsubclause for each non-numerical column. In Pandas, we effectively joined on the row number, but in SQL, there's no concept of row number, and we have to assume that a column likebookIDis common between the datasets so that we can join them. Once again, the size of this query is proportional to the number of attributes.


SELECT
    .8 * Goodreads.average_rating + .2 * Otherreads.average_rating AS average_rating,
    .8 * Goodreads.num_pages + .2 * Otherreads.num_pages AS num_pages,
    .8 * Goodreads.ratings_count + .2 * Otherreads.ratings_count AS ratings_count,
    .8 * Goodreads.text_reviews_count + .2 * Otherreads.text_reviews_count as text_reviews_count
FROM 
    Goodreads
LEFT JOIN    
    Otherreads
ON
    Goodreads.bookID = Otherreads.bookID

6. Transpose

Transpose is a standard linear algebra operation that is often useful in cleaning data. Most machine learning packages expect individual datapoints as rows and features as columns. If input datasets have the opposite orientation, a transpose can fix them. Here’s how we transpose in Pandas — It’s a single character: T!

df.T

Out [8]:

		0		1		2		3		4		5		6		7		8		9		…	11113		11114		11115		11116		11117		11118		11119		11120		11121		11122
bookID		1		2		4		5		8		9		10		12		13		14		…	45617		45623		45625		45626		45630		45631		45633		45634		45639		45641
title		Harry Potter…	Harry Potter…	Harry Potter…	Harry Potter…	Harry Potter…	Unauthori…	Harry Potter…	The Ultimate…	The Ultimate…	The Hitchhike…	…	O Cavalo e … 	O Sobrinho…	A Viagem do…	O Príncipe Cas…	Whores for Gl…	Expelled from…	You Bright and…	The Ice-Shirt …	Poor People…	Las aventuras…
authors		J.K. Rowling/…	J.K. Rowling/…	J.K. Rowling	J.K. Rowling/…	J.K. Rowling/…	W. Frederic	J.K. Rowling	Douglas Adams	Douglas Adams	Douglas Adams	…	C.S. Lewis/…	C.S. Lewis/…	C.S. Lewis/…	C.S. Lewis/…	William T.…	William T.…	William T.…	William T.…	William T.…	Mark Twain
average_rating	4.57		4.49		4.42		4.56		4.78		3.74		4.73		4.38		4.38		4.22		…	3.92		4.04		4.09		3.97		3.69		4.06		4.08		3.96		3.72		3.91
isbn		0439785960	0439358078	0439554896	043965548X	0439682584	0976540606	0439827604	0517226952	0345453743	1400052920	…	9722330551	9722329987	9722331329	9722330977	0140231579	1560254416	0140110879	0140131965	0060878827	8497646983
isbn13		9780439785969	9780439358071	9780439554893	9780439655484	9780439682589	9780976540601	9780439827607	9780517226957	9780345453747	9781400052929	…	9789722330558	9789722329989	9789722331326	9789722330978	9780140231571	9781560254416	9780140110876	9780140131963	9780060878825	9788497646987
language_code	eng		eng		eng		eng		eng		en-US		eng		eng		eng		eng		…	por		por		por		por		en-US		eng		eng		eng		eng		spa
num_pages	652		870		352		435		2690		152		3342		815		815		215		…	160		147		176		160		160		512		635		415		434		272
ratings_count	2095690		2153167		6333		2339585		41428		19		28242		3628		249558		4930		…	207		396		161		215		932		156		783		820		769		113
text_reviews_ct	27591		29221		244		36325		164		1		808		254		4080		460		…	16		37		14		11		111		20		56		95		139		12
publicat_date	9/16/2006	9/1/2004	11/1/2003	5/1/2004	9/13/2004	4/26/2005	9/12/2005	11/1/2005	4/30/2002	8/3/2004	…	8/15/2003	4/8/2003	9/1/2004	10/11/2003	2/1/1994	12/21/2004	12/1/1988	8/1/1993	2/27/2007	5/28/2006
publisher	Scholastic Inc.	Scholastic Inc.	Scholastic	Scholastic Inc.	Scholastic	Nimble Books	Scholastic	Gramercy Books	Del Rey Books	Crown		…	Editorial Pre…	Editorial Pre…	Editorial Pre…	Editorial Pre…	Penguin Books	Da Capo Press	Penguin Books	Penguin Books	Ecco		Edimat Libros

12 rows × 11123 columns

Transpose is impossible in SQL. SQL doesn’t natively support most linear algebraic operators, like Pandas can.

7. Programmatically operate on column names

Often, when a dataset has many columns, it is a bit overwhelming to examine it all at once. You may want to focus your attention on a subset of those with similar names — these names are often similar because they express similar functions. It is easy in Pandas to just focus on columns based on the column names matching certain criteria. For example, to select all columns of our dataset that contain the stringisbn, we'd do:

df.filter(regex='isbn')

Out [9]:


 	isbn		isbn13
0	0439785960	9780439785969
1	0439358078	9780439358071
2	0439554896	9780439554893
…	…		…
11121	0060878827	9780060878825
11122	8497646983	9788497646987

11123 rows × 2 columns

Out [11]:

 	foo	bar	baz	zoo
0	one	A	1	x
1	one	B	2	y
2	one	C	3	z
3	two	A	4	q
4	two	B	5	w
5	two	C	6	t

Out [12]:

bar	A	B	C
foo	 	 	 
one	1	2	3
two	4	5	6

Out [13]:

date		2018-01-01	2018-01-02	2018-01-03	2018-01-04	2018-01-05	2018-01-06	2018-01-07	2018-01-08	2018-01-09	2018-01-10	…		2018-12-22	2018-12-23	2018-12-24	2018-12-25	2018-12-26	2018-12-27	2018-12-28	2018-12-29	2018-12-30	2018-12-31
visits		864		193		48		523		196		37		517		976		1		209		…		416		722		191		289		252		272		832		587		479		740
purchases	823		165		362		276		277		147		483		580		488		538		…		388		522		758		223		565		390		550		583		810		224

2 rows × 365 columns

Out [14]:

visit_date		2018-01-07	2018-01-08	2018-01-09	2018-01-10	2018-01-11	2018-01-12	2018-01-13	2018-01-14	2018-01-15	2018-01-16	…	2018-12-22	2018-12-23	2018-12-24	2018-12-25	2018-12-26	2018-12-27	2018-12-28	2018-12-29	2018-12-30	2018-12-31
visits (r. w. a.)	286.428571	391.571429	384.857143	395.142857	412.714286	437.857143	407.571429	350.571429	352.285714	427.571429	…	344.714286	327.428571	301.285714	380.571429	431.142857	504.428571	549.714286	507.714286	505.285714	587.571429
purchases (r. w. a.)	525.714286	505.000000	525.714286	530.285714	396.000000	449.428571	393.000000	364.571429	441.428571	412.857143	…	435.571429	385.285714	428.714286	470.142857	549.571429	457.000000	506.000000	551.000000	483.857143	386.428571

2 rows × 359 columns

Instead, we have to search on our own for columns whose names match our target string and manually select them, but that becomes difficult and error-prone when the number of columns is large. We’d then have to update the list of columns in our query whenever the table schema changes.

8. Point Updates

Often, some rows of a dataset have errors or anomalies that break subsequent data processing steps. In such cases, it's useful to fix the errors and anomalies point by point, directly. For example, suppose that we find out that theaverage_ratingvalue of just the first row is wrong, and we need to fix it. In Pandas we would write:

df.iloc[0, df.columns.get_loc('average_rating')] = 4.0

This is easy in Pandas because dataframes maintain order; so like in a spreadsheet, we can go in and edit cells on demand. Since relational databases don’t support a notion of order, this operation is impossible to do in SQL. Instead, we would need to fix the data source (in this case, the CSV) ourselves. Likewise, if we wanted to do an update on columns 4 through 6, that would be impossible in SQL, since queries cannot use the order of columns at all.

9. Pivot and Melt

Pivoting lets you reshape a dataframe so that unique values from a specified index and columns serve as axes for the new dataframe. This operation comes from spreadsheet applications such as Excel. Here’s an example with a different, smaller dataset:

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two','two'],
                   'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'baz': [1, 2, 3, 4, 5, 6],
                   'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df

Out [11]:

 	foo	bar	baz	zoo
0	one	A	1	x
1	one	B	2	y
2	one	C	3	z
3	two	A	4	q
4	two	B	5	w
5	two	C	6	t

df.pivot(index='foo', columns='bar', values='baz')

Out [12]:

bar	A	B	C
foo	 	 	 
one	1	2	3
two	4	5	6

To perform the same pivot on the above dataset in SQL, we’d have to write something like:

SELECT
    foo,
    MIN(CASE WHEN bar = "A" THEN baz END) AS A,
    MIN(CASE WHEN bar = "B" THEN baz END) as B,    
    MIN(CASE WHEN bar = "C" THEN baz END) as C
FROM
    Data
GROUP BY foo

The above query has a separateSELECTclause per possible value of thecolumnscolumnbar.Consequently, the length of the query is proportional to the number of distinct values inbar.Once again, if the distinct values change, the query would have to change.

Pandas also supports Melt, the inverse of pivot. Some SQL engines, such as Microsoft SQL server and Spark SQL, supportMELT,but others do not.

10. Moving average of a time series

Suppose we have a table where each column has the number of visits to a website on a given day, as well the number of people who bought a product:

import numpy as np
visits = pd.DataFrame(np.random.randint(0, 1000, size=(2, 365)),
                      columns=pd.date_range(start='1/1/2018', end='12/31/2018'),
                      index=['visits', 'purchases'])
visits = visits.rename_axis('date', axis='columns')
visits

Out [13]:

date		2018-01-01	2018-01-02	2018-01-03	2018-01-04	2018-01-05	2018-01-06	2018-01-07	2018-01-08	2018-01-09	2018-01-10	…		2018-12-22	2018-12-23	2018-12-24	2018-12-25	2018-12-26	2018-12-27	2018-12-28	2018-12-29	2018-12-30	2018-12-31
visits		864		193		48		523		196		37		517		976		1		209		…		416		722		191		289		252		272		832		587		479		740
purchases	823		165		362		276		277		147		483		580		488		538		…		388		522		758		223		565		390		550		583		810		224

2 rows × 365 columns

We might want to get a rolling weekly average to smooth out noise in the data. To get a rolling weekly average in Pandas, we can do:

average = visits.rolling(7, axis=1).mean().dropna(axis=1)
average.index += " (r. w. a.)"
average

Out [14]:

visit_date		2018-01-07	2018-01-08	2018-01-09	2018-01-10	2018-01-11	2018-01-12	2018-01-13	2018-01-14	2018-01-15	2018-01-16	…	2018-12-22	2018-12-23	2018-12-24	2018-12-25	2018-12-26	2018-12-27	2018-12-28	2018-12-29	2018-12-30	2018-12-31
visits (r. w. a.)	286.428571	391.571429	384.857143	395.142857	412.714286	437.857143	407.571429	350.571429	352.285714	427.571429	…	344.714286	327.428571	301.285714	380.571429	431.142857	504.428571	549.714286	507.714286	505.285714	587.571429
purchases (r. w. a.)	525.714286	505.000000	525.714286	530.285714	396.000000	449.428571	393.000000	364.571429	441.428571	412.857143	…	435.571429	385.285714	428.714286	470.142857	549.571429	457.000000	506.000000	551.000000	483.857143	386.428571

2 rows × 359 columns

In SQL, an equivalent query would look like:

SELECT
    (2018_01_01 +
     2018_01_02 +
     2018_01_03 +
     2018_01_04 +
     2018_01_05 +
     2018_01_06 +
     2018_01_07) / 7 AS 2018_01_07,
    (2018_01_02 +
     2018_01_03 +
     2018_01_04 +
     2018_01_05 +
     2018_01_06 +
     2018_01_07 +
     2018_01_08) / 7 AS 2018_01_08,
    [...]
    (2018_12_25 +
     2018_12_26 +
     2018_12_27 +
     2018_12_28 +
     2018_12_29 + 
     2018_12_30 +
     2018_12_31) / 7 AS 2018_12_31
FROM
     Interactions

We've abbreviated the query for brevity, but theSELECTclass would have 2513 lines! It would also grow as the number of dates in the dataset grew.

Most databases have a window operation for these kinds of queries, but those are for windows spanning rows, not columns. Your database might let you pivot the data so that the dates are rows instead of columns, but that would require an extra query. Pandas, on the other hand, lets you run such operations across either axis by choosing a single number, theaxisargument to thewindowfunction.

Conclusion

Data cleaning and feature engineering: Pandas has convenient utilities for normalization, one-hot encoding, and dropping sparsely populated features. It’s possible to do these things in SQL, but with much more code.
Operations coming from linear algebra and spreadsheets: Pandas has utilities for spreadsheet-style operations like pivoting and unpivoting, while not all SQL engines do. Pandas also makes it convenient to perform matrix operations like addition, transposition, and point updates, while SQL does not, and some of these matrix operations are impossible in many SQL engines.
Working fluently with metadata: When working with Pandas dataframes, we can access and manipulate metadata as easily as data in our queries. Consequently, it was easy to select all numerical columns, or to select all columns whose names matched a pattern.

In the next post in our Pandas vs. SQL series (post 3 of 4), we argue Pandas is the more flexible language. Read more here!

Pandas vs. SQL – Part 2: Pandas Is More Concise

Prelude: loading data

1. Normalizing columns uniformly

2. One-hot encoding

3. Clean columns with missing values

4. Covariance computation

5. Element-wise combination of dataframes

6. Transpose

7. Programmatically operate on column names

8. Point Updates

9. Pivot and Melt

10. Moving average of a time series

Conclusion

About Ponder

Try Ponder Today

Read more:

🐼 ❤️ ❄️

Professional Pandas: Handling Missing Data With Pandas Dropna

How To Use pandas resample on a Database

Ready to level up your Pandas game?