TLDR: In this post, we compare Pandas vs. SQL on the second of three axes: flexibility. We present eight ways in which the Pandas dataframe data model provides additional flexibility that makes it a great fit for data science and machine learning, over the relational model underlying SQL.
We continue on our quest to understand the relative benefits of Pandas and SQL — two popular and powerful languages for data manipulation and analysis. This is the third post in a series of posts comparing Pandas and SQL; here are the two previous posts: Pandas vs. SQL – Part 1: The Food Court and the Michelin-Style Restaurant and Pandas vs. SQL – Part 2: Pandas Is More Concise.
As we’ve seen in the previous post on Pandas vs. SQL, Pandas has over 600+ functions that let you operate on data in a variety of powerful ways that are either impossible or extremely hard to do in SQL, spanning a range of key machine learning, linear algebra, featurization, and data cleaning operations. In this post, we demonstrate how the Pandas dataframe data model (or abstraction) provides additional flexibility that makes it a great fit for data science and machine learning, over the relational model underlying SQL. So, while the previous post focused on the dataframe algebra, this post will focus on the dataframe data model.
For easy lookup, here is a handy list of the multiple ways Pandas dataframes are more flexible than their relational/SQL counterparts:
- Moving data to metadata and back
- Row and column metadata
- Mixed type columns
- Flexible schemas
- Metadata transformations
- Columnar operations
- Hierarchical metadata
- Tolerance to differences in metadata
Or, you can skip ahead to the conclusion!
We’re going to start with a simple dataset, akin to the ones you might get after loading in data from a spreadsheet. This is a dataset containing the medals tally for the 2020 Summer Olympics for the best performing countries.
In [1]:
df = pd.DataFrame({'A': ['Name', 'USA', 'China', 'Japan', 'Britain', 'ROC'], 'B': ['Gold', 39, 38, None, 22, 20], 'C': ['Ranking', 1, 2, 3, 4, '-'], 'D': ['Continent','North America', 'Asia', 'Asia', 'Europe', 'Europe/Asia']})
df
Out [1]:
A B C D
0 Name Gold Ranking Continent
1 USA 39 1 North America
2 China 38 2 Asia
3 Japan None 3 Asia
4 Britain 22 4 Europe
5 ROC 20 - Europe/Asia
1. In Pandas, you can move data to metadata — and back! In SQL, you cannot.
Perhaps one of the first things you may notice about the dataset above is that we have non-descriptive column names likeA, B, C, ...
, while the actual column names are part of the data (row 0). Ideally, we'd want to replaceA, B, C, ...
with these column names. We can do so in the following way:
In [2]:
header = df.iloc[0]
df = df[1:]
df.columns = header
df
All we did was extract the header row intoheader
, trim the dataframe to retain the lastn-1
rows, and set the column names viadf.columns
.
Out [3]:
Name Gold Ranking Continent
1 USA 39 1 North America
2 China 38 2 Asia
3 Japan None 3 Asia
4 Britain 22 4 Europe
5 ROC 20 - Europe/Asia
One can similarly move information in the inverse direction from the metadata (column headers) to the data, you can see below.
In [3]:
df_header = df.columns.to_frame()
df_header = df_header.T
df = pd.concat([df_header, df])
df
What we did here is to extract the column names as a dataframe viato_frame()
, transpose the dataframe to make it a row, via the single letterT
, and finally, concatenate the row with the old dataframe vertically viapd.concat
.
Out [3]:
Name Gold Ranking Continent
0 Name Gold Ranking Continent
1 USA 39 1 North America
2 China 38 2 Asia
3 Japan None 3 Asia
4 Britain 22 4 Europe
5 ROC 20 - Europe/Asia
In SQL, it is virtually impossible to move data to the metadata (column names) and back.
2. Pandas supports row AND column metadata; SQL only has column metadata.
While Pandas supports column metadata (i.e., column labels) like databases, Pandas also supports row-wise metadata in the form of row labels. This is convenient if we want to organize and refer to data in an intuitive manner. Here is an example; we start by reloading the dataframe afresh.
In [4]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe', 'Europe/Asia']})
df
Out [4]:
Name Gold Ranking Continent
1 USA 39 1 North America
2 China 38 2 Asia
3 Japan None 3 Asia
4 Britain 22 4 Europe
5 ROC 20 - Europe/Asia
And now all we need to do is to useset_index
to set the row label or index. Here we set it to be theName
column in the data.
In [5]:
df_name_index = df.set_index('Name')
df_name_index
Out [5]:
Gold Ranking Continent
Name
USA 39.0 1 North America
China 38.0 2 Asia
Japan NaN 3 Asia
Britain 22.0 4 Europe
ROC 20.0 - Europe/Asia
Just like column labels, we can also move the row labels back to the data, using the reset_index
command.
Databases do not have a notion of a row label. You could claim that a row label functionality can be replicated with a PRIMARY KEY / UNIQUE
keyword, but that's not quite true — a row label doesn't actually need to be unique. Just another way Pandas is flexible!
3. Pandas allows mixed type columns; SQL only supports single-typed columns
Due to its use in data science, operating on data in various stages of cleaning, Pandas does not enforce a strict type per column. Looking at our example again:
In [6]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe', 'Europe/Asia']})
df
Out [6]:
Name Gold Ranking Continent
1 USA 39 1 North America
2 China 38 2 Asia
3 Japan None 3 Asia
4 Britain 22 4 Europe
5 ROC 20 - Europe/Asia
Here, the Ranking
column contains both strings and integers. If we were to inspect the types of that column, we will see the following:
In [7]:
df['Ranking'].apply(type).value_counts()
Out [7]:
<class 'int'=""> 4
<class 'str'=""> 1
Name: Ranking, dtype: int64
What we did was to count the values of each type viavalue_counts
after having inferred the type for each value in the column via anapply
. As you can see in the output, theRanking
column contains four integers and one string (—). If we want to coerce the string values into a null value instead so that the column has a single type, we can do it in the following way:
In [8]:
df['Ranking'] = pd.to_numeric (df['Ranking'], errors = 'coerce')
df['Ranking'].apply(type).value_counts()
Out [8]:
<class 'float'=""> 5
Name: Ranking, dtype: int64
4. In Pandas, the output schema can vary flexibly based on data; in SQL, the schema is fixed
In SQL, the schema of the output — that is the set of columns and their types — is constrained based on the schema of the input and the query. Pandas does not make this limitation, allowing the output to vary based on the data, allowing additional flexibility. Here is an example, once again on our familiar dataset:
In [9]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe', 'Europe/Asia']})
df_expanded = pd.get_dummies(df, columns=['Continent'])
df_expanded
Out [9]:
Name Gold Ranking Continent_Asia Continent_Europe Continent_Europe/Asia Continent_North America
0 USA 39.0 1 0 0 0 1
1 China 38.0 2 1 0 0 0
2 Japan NaN 3 1 0 0 0
3 Britain 22.0 4 0 1 0 0
4 ROC 20.0 - 0 0 1 0
As you can see, theget_dummies
function creates a new Boolean column for each value in theContinent
column. This is incredibly helpful for machine learning, since most machine learning packages only accept numeric data, so all string columns need to be converted to numeric ones. If we had omitted the last row corresponding toROC
from the dataset, there would not be a column corresponding toContinent_Europe/Asia
. In this way, the output columns can depend on the data — neat!
5. Pandas allows you to transform metadata (column/row labels) flexibly; in SQL you cannot
Pandas recognizes that the schema is always in flux, and gives us superpowers to flexibly clean metadata, not just data in easy programmatic ways. Here, if we find the Continent_
prefix to be awkward to keep repeating across columns, we can delete it across all of the columns using one single command:
In [10]:
df_expanded.columns = df_expanded.columns.str.replace('Continent_', '')
df_expanded
Out [10]:
Name Gold Ranking Asia Europe Europe/Asia North America
0 USA 39.0 1 0 0 0 1
1 China 38.0 2 1 0 0 0
2 Japan NaN 3 1 0 0 0
3 Britain 22.0 4 0 1 0 0
4 ROC 20.0 - 0 0 1 0
And poof, just like that, it’s gone! Unfortunately, SQL doesn’t give you the ability to operate on column names in the same way as Pandas. You’ll need to manually specify how each column name will change.
6. Pandas allows you to operate on columns just like you do rows; SQL does not
Unlike SQL that treats rows and columns differently, Pandas treats them the same. Starting fromdf_name_index
above that used theName
column as a row index/label, we can simply transpose it to have the rows be the columns and the columns be the rows, as follows:
In [11]:
df_name_index.T
Out [10]:
Name USA China Japan Britain ROC
Gold 39.0 38.0 NaN 22.0 20.0
Ranking 1 2 3 4 -
Continent North America Asia Asia Europe Europe/Asia
What’s cool about this is that Pandas automatically infers the types of the new columns without the user having to do any work. In this case, all of these columns are mixed type ones, since they have both integers and strings. Organizing the data in this way makes it often easier to compare. Beyond transposing, we can also apply most functions along columns as we do rows. We already covered examples of this in our previous blog post.
7. Pandas lets you have hierarchical metadata; SQL does not
Hierarchical metadata helps us organize and present information in more human-readable and intuitive ways. Fortunately, Pandas supports both hierarchical row and column labels. Say we want to create hierarchical row labels:
In [12]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe', 'Europe/Asia']})
df.set_index(['Continent','Name'])
Out [12]:
Gold Ranking
Continent Name
North America USA 39.0 1
Asia China 38.0 2
Japan NaN 3
Europe Britain 22.0 4
Europe/Asia ROC 20.0 -
As we can see above, we created a hierarchy of row labels consisting of the pair of continents and country names via theset_index
operator. By transposing this result using theT
operator, we get an example of hierarchical column labels:
Out [13]:
Continent North America Asia Europe Europe/Asia
Name USA China Japan Britain ROC
Gold 39.0 38.0 NaN 22.0 20.0
Ranking 1 2 3 4 -
In the result above, the first row of column labels captures the continents, while the second row captures the country names. Unlike Pandas, SQL and relational databases don’t support hierarchical metadata. The best proxy for this is the use of semi-structured data formats (e.g., JSON, XML), which relational databases often support as a special data type within a column. However, enforcing a hierarchical schema on those columns is difficult, as is operating on these nested data formats.
8. Pandas is tolerant when operating on multiple dataframes with different metadata; SQL isn’t
Pandas operations are tolerant to differences in schema when operating on multiple dataframes. We'll illustrate this by considering a common binary operation,concat
. Say we have another dataframe alongside our present dataframe, containing information about three additional countries, Australia, France, and Netherlands. We can use theconcat
operator to perform an ordered union of the two dataframes that have matching schema, as below:
In [14]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe', 'Europe']})
df2 = pd.DataFrame({'Name': ['Australia', 'Netherlands', 'France'], 'Gold': [17, 10, 10], 'Ranking': [5, 6, 7], 'Continent': ['Australia', 'Europe', 'Europe']})
pd.concat([df,df2])
Out [14]:
Name Gold Ranking Continent
0 USA 39.0 1 North America
1 China 38.0 2 Asia
2 Japan NaN 3 Asia
3 Britain 22.0 4 Europe
4 ROC 20.0 - Europe
0 Australia 17.0 5 Australia
1 Netherlands 10.0 6 Europe
2 France 10.0 7 Europe
The resulting dataframe has eight rows. This was an easy case, analogous to a SQLUNION
operator. Now suppose the schemas of the two dataframes were not the same; specifically,df2
did not have the gold medal tally column. Here's what we get with Pandas:
In [15]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe', 'Europe']})
df2 = pd.DataFrame({'Name': ['Australia', 'Netherlands', 'France'], 'Ranking': [5, 6, 7], 'Continent': ['Australia', 'Europe', 'Europe']})
pd.concat([df,df2])
Out [15]:
Name Gold Ranking Continent
0 USA 39.0 1 North America
1 China 38.0 2 Asia
2 Japan NaN 3 Asia
3 Britain 22.0 4 Europe
4 ROC 20.0 - Europe
0 Australia NaN 5 Australia
1 Netherlands NaN 6 Europe
2 France NaN 7 Europe
As we can see, Pandas takes a "best effort" approach, continuing to perform an ordered union (i.e., a concatenation), filling in NaN
(i.e., null) values for the columns that are absent in one of the two dataframes. Here, SQL would give an error, since the schemas need to match for a union.
Conclusion
As we saw in examples above, the dataframe abstraction in Pandas is a lot more flexible than the relational model underling SQL. This additional flexibility often makes representing and consuming data a lot more intuitive from an end-user standpoint. Moreover, when performing data cleaning, this added flexibility is essential: data is often dirty, with heterogeneous typing and imprecise, unspecified, and/or misaligned schemas. Finally, the Pandas data model abandons arbitrary distinctions: rows and columns are equivalent, data and metadata are equivalent — making it a lot more convenient to express many important data science and machine learning operations.
In the next post in our Pandas vs. SQL series (post 4 of 4), we argue Pandas is the more convenient language. Read more here!
About Ponder
Ponder’s mission is to improve user productivity without changing their workflows, and we do this by letting you run data science tools like pandas, at scale and securely, directly in your data warehouse or database.
It shouldn’t matter whether your data is small, medium, or large: the same data pipelines should run as-is, without any additional effort from the user.
Try Ponder Today
Start running pandas and NumPy in your database within minutes!