TLDR: In this post, we compare Pandas vs. SQL on the first of three axes: conciseness. We discuss 10 operations that are much more straightforward in Pandas than in SQL, spanning data cleaning, machine learning, feature engineering, linear algebra, and spreadsheet functions.
In our previous post we discussed how Pandas dataframes are much like a food court, while databases offering SQL are like Michelin-star restaurants, and how both dataframes and databases have been around in some form since the 1970s. In this post we’ll dive into the comparison between Pandas and SQL on conciseness: the ability to express sophisticated data transformation operations succinctly.
So, given that both Pandas and SQL support data transformation and analysis, one may ask: Is Pandas really more concise than SQL, the original database query language? In this post we show that there are many operations that are much easier to do in Pandas than in SQL. We’ll walk through ten examples, most of which use this Goodreads dataset from Kaggle, listed below with handy links!
- Normalizing columns uniformly
- One-hot encoding
- Cleaning columns with missing values
- Compute covariances
- Element-wise combination of dataframes
- Transpose
- Programmatically operate on column names
- Point updates
- Pivot & Melt
- Moving average of a time series
Or you can skip ahead to the conclusion!
Prelude: loading data
We use Pandas to load the Goodreads data from the Kaggle CSV file. The CSV doesn't correctly escape commas in string values, e.g., the author"Sam Bass Warner, Jr./Sam B. Warner"
on line 3350 is not surrounded by quotation marks. We can just have Pandas skip these malformed lines by passingread_csv
the parameteron_bad_lines='skip'.
Pandas will also try to infer data types from the CSV, and will by default make thebookID
,isbn
, andisbn13
columns numerical, whereas we prefer to treat them as strings, so we can override this step.
In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv("books.csv", on_bad_lines='skip', dtype={'bookID': str, 'isbn': str, 'isbn13': str})
1. Normalizing columns uniformly
Normalization is a common feature engineering operation; it involves scaling numerical values in a column to ensure that they are in a certain range, or that the resulting distribution of values have low variance. Normalization is key to fixing input features prior to machine learning model training, which in turn ensures that the eventual machine learning models have high accuracy. With Pandas, we can easily pick out only the numerical columns and normalize them all at once:
In [2]:
x = df.select_dtypes(include='number').columns
(df[x] - df[x].mean())/df[x].std()
Out [2]:
SELECT
(
(average_rating - AVG(average_rating) OVER ()) /
STDDEV_POP(average_rating) OVER ()
),
(
(num_pages - AVG(num_pages) OVER ()) /
STDDEV_POP(num_pages) OVER ()
),
(
(ratings_count - AVG(ratings_count) OVER ()) /
STDDEV_POP(ratings_count) OVER ()
),
(
(text_reviews_count - AVG(text_reviews_count) OVER ()) /
STDDEV_POP(text_reviews_count) OVER ()
),
FROM
Books
Note that there is a way to filter columns by data type in SQL via the auxiliary metadata tables (e.g.,sys.table, information_schema.
)
But this approach is equally clunky and moreover, this approach is not universally supported across all SQL engines.
2. One-hot encoding
Most machine learning packages expect numeric or boolean matrices as input. One-hot encoding is a feature engineering operation that is commonly used to transform a single categorical attribute to a set of binary attributes, which machine learning packages like Scikit-learn can interpret more easily.
For example, we might want to one-hot encodelanguage_code
:
In [3]:
df['language_code']
Out [3]:
To one-hot encode the column with Pandas, we’d write:
In [4]:
pd.get_dummies(df['language_code'])
Out [4]:
Doing the same thing in SQL is much harder. For each column we want to one-hot encode, we have to write as many clauses as there are distinct values of that column. Imagine if there were thousands of distinct values in a given column (e.g., the counties in the US, which is over 3000) — The resulting SQL query would involve thousands of query phrases. This is simply impossible for a human being to write correctly.
SELECT
IF(language_code = 'eng', 1, 0) as eng,
IF(language_code = 'en-US', 1, 0) as en_US,
IF(language_code = 'fre', 1, 0) as fre,
[... 20 additional IF clauses ...]
IF(language_code = 'nor', 1, 0) as nor,
IF(language_code = 'tur', 1, 0) as tur,
IF(language_code = 'gla', 1, 0) as gla,
IF(language_code = 'ale', 1, 0) as ale
FROM
Books
There are other complications. Even before we can write the SQL query above, we would need to first identify the distinct values in the column we want to one-hot-encode. The distinct values have to be identified in a separate operation, such as:
SELECT
DISTINCT language_code
FROM
Books
Finally, whenever the set of distinct values of the column change, the SQL query will have to change. If we forget to do this, we may miss out on correctly encoding our data, impacting downstream applications and models.
3. Clean columns with missing values
Imagine that we want to discard columns where too many rows are missing a value for that column, as denoted by the valueNULL
orNA
. Doing this is a common data cleaning step in machine learning, because such columns are unlikely to be valuable features for the eventual model.
In our example, we discard columns where more than 1% of the records are missing values. In Pandas, we can generate a boolean telling whether each entry in the dataframe is null, calculate the mean across each column, and use a boolean indexer to select columns where that mean is not too low:
In [5]:
df.loc[:, df.isnull().mean() < .01]
df.columns
Out [5]:
SELECT
(COUNT(CASE WHEN book_id IS NOT NULL THEN 1 END) /
CAST(COUNT(*) AS FLOAT) AS book_id_null_fraction,
(COUNT(CASE WHEN title IS NOT NULL THEN 1 END) /
CAST(COUNT(*) AS FLOAT) AS title_null_fraction,
[... 8 additional clauses ...]
(COUNT(CASE WHEN publication_date IS NOT NULL THEN 1 END) /
CAST(COUNT(*) AS FLOAT) AS publication_date_null_fraction,
(COUNT(CASE WHEN publisher IS NOT NULL THEN 1 END) /
CAST(COUNT(*) AS FLOAT) AS publisher_null_fraction,
FROM Books
We can then write another statement selecting the columns where the fraction is larger than .01
. However, this is a painful process, and the time it takes to write this query is proportional to the number of columns. The extensive duplication of query logic across phrases also makes it more likely that errors may creep in.
4. Covariance computation
The covariance matrix gives some sense of how each pair of variables correlate with each other. It’s used in principal component analysis, a common technique in exploratory data analysis and machine learning. It’s also useful as a step in feature selection — removing extraneous features to help make machine learning algorithms run more efficiently and effectively. With Pandas we can get a matrix of correlations across all numerical columns with:
In [6]:
df.cov() * (len(df) - 1)
Out [6]:
Note that we multiply bydf.cov()
bylen(df) - 1
because Pandas normalizes the covariance. The reason why it was so simple is because Pandas allows you to treat the dataframe as a matrix and apply linear algebraic operations to it.
With SQL we would first use aWITH
clause to calculate the mean for each column, then sum the product(X - X_m) * (Y - Y_m)
for each pair of variablesX
andY,
whereX_m
andY_m
are the corresponding means.
WITH t1 AS (
SELECT
text_reviews_count,
AVG(text_reviews_count) OVER() AS avg_text_reviews_count,
ratings_count,
AVG(ratings_count) OVER() AS avg_ratings_count,
num_pages,
AVG(num_pages) OVER() AS avg_num_pages,
average_rating,
AVG(average_rating) OVER() AS avg_average_rating
FROM Books
)
SELECT
SUM(
(text_reviews_count - avg_text_reviews_count) *
(ratings_count - avg_ratings_count)
) AS text_reviews_count_vs_ratings_count,
SUM(
(text_reviews_count - avg_text_reviews_count) *
(num_pages - avg_num_pages )
) AS text_reviews_count_vs_num_pages,
[... additional clauses ...]
SUM(
(ratings_count - avg_ratings_count) *
(average_rating - avg_average_rating)
) AS ratings_count_vs_average_rating,
SUM(
(num_pages - avg_num_pages) *
(average_rating - avg_average_rating)
) AS num_pages_vs_average_rating
FROM t1
Again, this is a very long query, with clauses proportional to the number of attributes squared. The query also does not enable you to reproduce the covariance matrix exactly.
Some databases, e.g., RedShift and BigQuery, have a correlation function and/or let us write a user-defined function to compute correlation instead of writing the correlation formula out for each pair of columns. In that case, we would have to manually write a query to find correlation between each pair of numerical columns, or join the entire table against itself in the second solution here.
5. Element-wise combination of dataframes
In [7]:
otherreads_df = df.copy()
goodreads_numerical = df.select_dtypes(include='number')
otherreads_numerical = otherreads_df.select_dtypes(include='number')
.8 * goodreads_numerical + .2 * otherreads_numerical
Out [7]:
In SQL, we'd have to write out the average for each numerical column, and we'd have to write out a separateSELECT
subclause for each non-numerical column. In Pandas, we effectively joined on the row number, but in SQL, there's no concept of row number, and we have to assume that a column likebookID
is common between the datasets so that we can join them. Once again, the size of this query is proportional to the number of attributes.
SELECT
.8 * Goodreads.average_rating + .2 * Otherreads.average_rating AS average_rating,
.8 * Goodreads.num_pages + .2 * Otherreads.num_pages AS num_pages,
.8 * Goodreads.ratings_count + .2 * Otherreads.ratings_count AS ratings_count,
.8 * Goodreads.text_reviews_count + .2 * Otherreads.text_reviews_count as text_reviews_count
FROM
Goodreads
LEFT JOIN
Otherreads
ON
Goodreads.bookID = Otherreads.bookID
6. Transpose
In [8]:
df.T
Out [8]:
7. Programmatically operate on column names
Often, when a dataset has many columns, it is a bit overwhelming to examine it all at once. You may want to focus your attention on a subset of those with similar names — these names are often similar because they express similar functions. It is easy in Pandas to just focus on columns based on the column names matching certain criteria. For example, to select all columns of our dataset that contain the stringisbn,
we'd do:
In [9]:
df.filter(regex='isbn')
Out [9]:
Selecting column names that match a pattern requires filtering based on the table's schema, which isn't easily possible with a SQL query (except using the tediousinformation_schema
approach described in 1. above).
Instead, we have to search on our own for columns whose names match our target string and manually select them, but that becomes difficult and error-prone when the number of columns is large. We’d then have to update the list of columns in our query whenever the table schema changes.
8. Point Updates
Often, some rows of a dataset have errors or anomalies that break subsequent data processing steps. In such cases, it's useful to fix the errors and anomalies point by point, directly. For example, suppose that we find out that theaverage_rating
value of just the first row is wrong, and we need to fix it. In Pandas we would write:
In [10]:
df.iloc[0, df.columns.get_loc('average_rating')] = 4.0
9. Pivot and Melt
In [11]:
df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6],
'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df
Out [11]:
In [12]:
df.pivot(index='foo', columns='bar', values='baz')
Out [12]:
To perform the same pivot on the above dataset in SQL, we’d have to write something like:
SELECT
foo,
MIN(CASE WHEN bar = "A" THEN baz END) AS A,
MIN(CASE WHEN bar = "B" THEN baz END) as B,
MIN(CASE WHEN bar = "C" THEN baz END) as C
FROM
Data
GROUP BY foo
The above query has a separateSELECT
clause per possible value of thecolumns
columnbar.
Consequently, the length of the query is proportional to the number of distinct values inbar.
Once again, if the distinct values change, the query would have to change.
Pandas also supports Melt, the inverse of pivot. Some SQL engines, such as Microsoft SQL server and Spark SQL, supportMELT,
but others do not.
10. Moving average of a time series
In [13]:
import numpy as np
visits = pd.DataFrame(np.random.randint(0, 1000, size=(2, 365)),
columns=pd.date_range(start='1/1/2018', end='12/31/2018'),
index=['visits', 'purchases'])
visits = visits.rename_axis('date', axis='columns')
visits
Out [13]:
We might want to get a rolling weekly average to smooth out noise in the data. To get a rolling weekly average in Pandas, we can do:
In [14]:
average = visits.rolling(7, axis=1).mean().dropna(axis=1)
average.index += " (r. w. a.)"
average
Out [14]:
In SQL, an equivalent query would look like:
SELECT
(2018_01_01 +
2018_01_02 +
2018_01_03 +
2018_01_04 +
2018_01_05 +
2018_01_06 +
2018_01_07) / 7 AS 2018_01_07,
(2018_01_02 +
2018_01_03 +
2018_01_04 +
2018_01_05 +
2018_01_06 +
2018_01_07 +
2018_01_08) / 7 AS 2018_01_08,
[...]
(2018_12_25 +
2018_12_26 +
2018_12_27 +
2018_12_28 +
2018_12_29 +
2018_12_30 +
2018_12_31) / 7 AS 2018_12_31
FROM
Interactions
We've abbreviated the query for brevity, but theSELECT
class would have 2513 lines! It would also grow as the number of dates in the dataset grew.
Most databases have a window operation for these kinds of queries, but those are for windows spanning rows, not columns. Your database might let you pivot the data so that the dates are rows instead of columns, but that would require an extra query. Pandas, on the other hand, lets you run such operations across either axis by choosing a single number, theaxis
argument to thewindow
function.
Conclusion
- Data cleaning and feature engineering: Pandas has convenient utilities for normalization, one-hot encoding, and dropping sparsely populated features. It’s possible to do these things in SQL, but with much more code.
- Operations coming from linear algebra and spreadsheets: Pandas has utilities for spreadsheet-style operations like pivoting and unpivoting, while not all SQL engines do. Pandas also makes it convenient to perform matrix operations like addition, transposition, and point updates, while SQL does not, and some of these matrix operations are impossible in many SQL engines.
- Working fluently with metadata: When working with Pandas dataframes, we can access and manipulate metadata as easily as data in our queries. Consequently, it was easy to select all numerical columns, or to select all columns whose names matched a pattern.
In the next post in our Pandas vs. SQL series (post 3 of 4), we argue Pandas is the more flexible language. Read more here!
About Ponder
Ponder’s mission is to improve user productivity without changing their workflows, and we do this by letting you run data science tools like pandas, at scale and securely, directly in your data warehouse or database.
It shouldn’t matter whether your data is small, medium, or large: the same data pipelines should run as-is, without any additional effort from the user.
Try Ponder Today
Start running pandas and NumPy in your database within minutes!