Pandas vs. SQL – Part 3: Pandas Is More Flexible

Aditya Parameswaran

Jul 26, 2022 10 min read

Articles
Pandas vs. SQL – Part 3: Pandas Is More Flexible image

TLDR: In this post, we compare Pandas vs. SQL on the second of three axes: flexibility. We present eight ways in which the Pandas dataframe data model provides additional flexibility that makes it a great fit for data science and machine learning, over the relational model underlying SQL.

We continue on our quest to understand the relative benefits of Pandas and SQL — two popular and powerful languages for data manipulation and analysis. This is the third post in a series of posts comparing Pandas and SQL; here are the two previous posts: Pandas vs. SQL – Part 1: The Food Court and the Michelin-Style Restaurant and Pandas vs. SQL – Part 2: Pandas Is More Concise.

As we’ve seen in the previous post on Pandas vs. SQL, Pandas has over 600+ functions that let you operate on data in a variety of powerful ways that are either impossible or extremely hard to do in SQL, spanning a range of key machine learning, linear algebra, featurization, and data cleaning operations. In this post, we demonstrate how the Pandas dataframe data model (or abstraction) provides additional flexibility that makes it a great fit for data science and machine learning, over the relational model underlying SQL. So, while the previous post focused on the dataframe algebra, this post will focus on the dataframe data model.

For easy lookup, here is a handy list of the multiple ways Pandas dataframes are more flexible than their relational/SQL counterparts:

  1. Moving data to metadata and back
  2. Row and column metadata
  3. Mixed type columns
  4. Flexible schemas
  5. Metadata transformations
  6. Columnar operations
  7. Hierarchical metadata
  8. Tolerance to differences in metadata

Or, you can skip ahead to the conclusion!

We’re going to start with a simple dataset, akin to the ones you might get after loading in data from a spreadsheet. This is a dataset containing the medals tally for the 2020 Summer Olympics for the best performing countries.

 In [1]:
df = pd.DataFrame({'A': ['Name', 'USA', 'China', 'Japan', 'Britain', 'ROC'], 'B': ['Gold', 39, 38, None, 22, 20], 'C': ['Ranking', 1, 2, 3, 4, '-'], 'D': ['Continent','North America', 'Asia', 'Asia', 'Europe', 'Europe/Asia']})                                                                                                                          
df
 Out [1]:
	A	B	C	D
0	Name	Gold	Ranking	Continent
1	USA	39	1	North America
2	China	38	2	Asia
3	Japan	None	3	Asia
4	Britain	22	4	Europe
5	ROC	20	-	Europe/Asia

1. In Pandas, you can move data to metadata — and back! In SQL, you cannot.

Perhaps one of the first things you may notice about the dataset above is that we have non-descriptive column names like  A, B, C, ... , while the actual column names are part of the data (row 0). Ideally, we'd want to replace  A, B, C, ...  with these column names. We can do so in the following way: 

 In [2]:
header = df.iloc[0]
df = df[1:]
df.columns = header
df
All we did was extract the header row into header, trim the dataframe to retain the last n-1rows, and set the column names via df.columns.
 Out [3]:
	Name	Gold	Ranking	Continent
1	USA	39	1	North America
2	China	38	2	Asia
3	Japan	None	3	Asia
4	Britain	22	4	Europe
5	ROC	20	-	Europe/Asia

One can similarly move information in the inverse direction from the metadata (column headers) to the data, you can see below.

 In [3]:
df_header = df.columns.to_frame()
df_header = df_header.T
df = pd.concat([df_header, df])
df
What we did here is to extract the column names as a dataframe via to_frame(), transpose the dataframe to make it a row, via the single letter T, and finally, concatenate the row with the old dataframe vertically via pd.concat.
 Out [3]:
	Name	Gold	Ranking	Continent
0	Name	Gold	Ranking	Continent
1	USA	39	1	North America
2	China	38	2	Asia
3	Japan	None	3	Asia
4	Britain	22	4	Europe
5	ROC	20	-	Europe/Asia

In SQL, it is virtually impossible to move data to the metadata (column names) and back.

2. Pandas supports row AND column metadata; SQL only has column metadata.

While Pandas supports column metadata (i.e., column labels) like databases, Pandas also supports row-wise metadata in the form of row labels. This is convenient if we want to organize and refer to data in an intuitive manner. Here is an example; we start by reloading the dataframe afresh.

 In [4]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe', 'Europe/Asia']})
df
 Out [4]:
	Name	Gold	Ranking	Continent
1	USA	39	1	North America
2	China	38	2	Asia
3	Japan	None	3	Asia
4	Britain	22	4	Europe
5	ROC	20	-	Europe/Asia
And now all we need to do is to use set_index to set the row label or index. Here we set it to be the Name column in the data.
 In [5]:
df_name_index = df.set_index('Name')
df_name_index
 Out [5]:
	Gold	Ranking	Continent
Name			
USA	39.0	1	North America
China	38.0	2	Asia
Japan	NaN	3	Asia
Britain	22.0	4	Europe
ROC	20.0	-	Europe/Asia
Just like column labels, we can also move the row labels back to the data, using the reset_index command.
Databases do not have a notion of a row label. You could claim that a row label functionality can be replicated with a PRIMARY KEY / UNIQUE keyword, but that's not quite true — a row label doesn't actually need to be unique. Just another way Pandas is flexible!

3. Pandas allows mixed type columns; SQL only supports single-typed columns

Due to its use in data science, operating on data in various stages of cleaning, Pandas does not enforce a strict type per column. Looking at our example again:

 In [6]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe', 'Europe/Asia']})
df
 Out [6]:
	Name	Gold	Ranking	Continent
1	USA	39	1	North America
2	China	38	2	Asia
3	Japan	None	3	Asia
4	Britain	22	4	Europe
5	ROC	20	-	Europe/Asia
Here, the Ranking column contains both strings and integers. If we were to inspect the types of that column, we will see the following:
 In [7]:
df['Ranking'].apply(type).value_counts()
 Out [7]:
<class 'int'="">    4
<class 'str'="">    1
Name: Ranking, dtype: int64
What we did was to count the values of each type via value_counts after having inferred the type for each value in the column via an apply. As you can see in the output, the Ranking column contains four integers and one string (—). If we want to coerce the string values into a null value instead so that the column has a single type, we can do it in the following way:
 In [8]:
df['Ranking'] = pd.to_numeric (df['Ranking'], errors = 'coerce')
df['Ranking'].apply(type).value_counts()
 Out [8]:
<class 'float'="">    5
Name: Ranking, dtype: int64

4. In Pandas, the output schema can vary flexibly based on data; in SQL, the schema is fixed

In SQL, the schema of the output — that is the set of columns and their types — is constrained based on the schema of the input and the query. Pandas does not make this limitation, allowing the output to vary based on the data, allowing additional flexibility. Here is an example, once again on our familiar dataset: 
 In [9]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe',                                                                                                           'Europe/Asia']})
df_expanded = pd.get_dummies(df, columns=['Continent'])
df_expanded
 Out [9]:
	Name	Gold	Ranking	Continent_Asia	Continent_Europe	Continent_Europe/Asia	Continent_North America
0	USA	39.0	1	0		0			0			1
1	China	38.0	2	1		0			0			0
2	Japan	NaN	3	1		0			0			0
3	Britain	22.0	4	0		1			0			0
4	ROC	20.0	-	0		0			1			0
As you can see, the get_dummies function creates a new Boolean column for each value in the Continent column. This is incredibly helpful for machine learning, since most machine learning packages only accept numeric data, so all string columns need to be converted to numeric ones. If we had omitted the last row corresponding to ROC from the dataset, there would not be a column corresponding to Continent_Europe/Asia. In this way, the output columns can depend on the data — neat!

5. Pandas allows you to transform metadata (column/row labels) flexibly; in SQL you cannot

Pandas recognizes that the schema is always in flux, and gives us superpowers to flexibly clean metadata, not just data in easy programmatic ways. Here, if we find the Continent_ prefix to be awkward to keep repeating across columns, we can delete it across all of the columns using one single command:
 In [10]:
df_expanded.columns = df_expanded.columns.str.replace('Continent_', '')
df_expanded
 Out [10]:
	Name	Gold	Ranking	Asia	Europe	Europe/Asia	North America
0	USA	39.0	1	0	0	0		1
1	China	38.0	2	1	0	0		0
2	Japan	NaN	3	1	0	0		0
3	Britain	22.0	4	0	1	0		0
4	ROC	20.0	-	0	0	1		0

And poof, just like that, it’s gone! Unfortunately, SQL doesn’t give you the ability to operate on column names in the same way as Pandas. You’ll need to manually specify how each column name will change.

6. Pandas allows you to operate on columns just like you do rows; SQL does not

Unlike SQL that treats rows and columns differently, Pandas treats them the same. Starting from df_name_index above that used the Name column as a row index/label, we can simply transpose it to have the rows be the columns and the columns be the rows, as follows:
 In [11]:
df_name_index.T
 Out [10]:
Name		USA		China	Japan	Britain	ROC
Gold		39.0		38.0	NaN	22.0	20.0
Ranking		1		2	3	4	-
Continent	North America	Asia	Asia	Europe	Europe/Asia

What’s cool about this is that Pandas automatically infers the types of the new columns without the user having to do any work. In this case, all of these columns are mixed type ones, since they have both integers and strings. Organizing the data in this way makes it often easier to compare. Beyond transposing, we can also apply most functions along columns as we do rows. We already covered examples of this in our previous blog post.

7. Pandas lets you have hierarchical metadata; SQL does not

Hierarchical metadata helps us organize and present information in more human-readable and intuitive ways. Fortunately, Pandas supports both hierarchical row and column labels. Say we want to create hierarchical row labels:

 In [12]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe', 'Europe/Asia']})                                                                                                                               
df.set_index(['Continent','Name'])
 Out [12]:
			Gold	Ranking
Continent	Name		
North America	USA	39.0	1
Asia		China	38.0	2
		Japan	NaN	3
Europe		Britain	22.0	4
Europe/Asia	ROC	20.0	-
As we can see above, we created a hierarchy of row labels  consisting of the pair of continents and country names via the set_indexoperator. By transposing this result using the Toperator, we get an example of hierarchical column labels:
 Out [13]:
Continent	North America	Asia		Europe	Europe/Asia
Name		USA		China	Japan	Britain	ROC
Gold		39.0		38.0	NaN	22.0	20.0
Ranking		1		2	3	4	-

In the result above, the first row of column labels captures the continents, while the second row captures the country names. Unlike Pandas, SQL and relational databases don’t support hierarchical metadata. The best proxy for this is the use of semi-structured data formats (e.g., JSON, XML), which relational databases often support as a special data type within a column. However, enforcing a hierarchical schema on those columns is difficult, as is operating on these nested data formats.

8. Pandas is tolerant when operating on multiple dataframes with different metadata; SQL isn’t

Pandas operations are tolerant to differences in schema when operating on multiple dataframes. We'll illustrate this by considering a common binary operation, concat. Say we have another dataframe alongside our present dataframe, containing information about three additional countries, Australia, France, and Netherlands. We can use the concat operator to perform an ordered union of the two dataframes that have matching schema, as below:
 In [14]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe', 'Europe']})
df2 = pd.DataFrame({'Name': ['Australia', 'Netherlands', 'France'], 'Gold': [17, 10, 10], 'Ranking': [5, 6, 7], 'Continent': ['Australia', 'Europe', 'Europe']})
pd.concat([df,df2])
 Out [14]:
	Name		Gold	Ranking	Continent
0	USA		39.0	1	North America
1	China		38.0	2	Asia
2	Japan		NaN	3	Asia
3	Britain		22.0	4	Europe
4	ROC		20.0	-	Europe
0	Australia	17.0	5	Australia
1	Netherlands	10.0	6	Europe
2	France		10.0	7	Europe
The resulting dataframe has eight rows. This was an easy case, analogous to a SQL UNION operator. Now suppose the schemas of the two dataframes were not the same; specifically, df2 did not have the gold medal tally column. Here's what we get with Pandas:
 In [15]:
df = pd.DataFrame({'Name': ['USA', 'China', 'Japan', 'Britain', 'ROC'], 'Gold': [39, 38, None, 22, 20], 'Ranking': [1, 2, 3, 4, '-'], 'Continent': ['North America', 'Asia', 'Asia', 'Europe', 'Europe']})
df2 = pd.DataFrame({'Name': ['Australia', 'Netherlands', 'France'],  'Ranking': [5, 6, 7], 'Continent': ['Australia', 'Europe', 'Europe']})
pd.concat([df,df2])
 Out [15]:
	Name		Gold	Ranking	Continent
0	USA		39.0	1	North America
1	China		38.0	2	Asia
2	Japan		NaN	3	Asia
3	Britain		22.0	4	Europe
4	ROC		20.0	-	Europe
0	Australia	NaN	5	Australia
1	Netherlands	NaN	6	Europe
2	France		NaN	7	Europe
As we can see, Pandas takes a "best effort" approach, continuing to perform an ordered union (i.e., a concatenation), filling in NaN (i.e., null) values for the columns that are absent in one of the two dataframes. Here, SQL would give an error, since the schemas need to match for a union.

Conclusion

As we saw in examples above, the dataframe abstraction in Pandas is a lot more flexible than the relational model underling SQL. This additional flexibility often makes representing and consuming data a lot more intuitive from an end-user standpoint. Moreover, when performing data cleaning, this added flexibility is essential: data is often dirty, with heterogeneous typing and imprecise, unspecified, and/or misaligned schemas. Finally, the Pandas data model abandons arbitrary distinctions: rows and columns are equivalent, data and metadata are equivalent — making it a lot more convenient to express many important data science and machine learning operations.

In the next post in our Pandas vs. SQL series (post 4 of 4), we argue Pandas is the more convenient language. Read more here!

About Ponder

Ponder’s mission is to improve user productivity without changing their workflows, and we do this by letting you run data science tools like pandas, at scale and securely, directly in your data warehouse or database.

It shouldn’t matter whether your data is small, medium, or large: the same data pipelines should run as-is, without any additional effort from the user.

Try Ponder Today

Start running pandas and NumPy in your database within minutes!

Get started

Ready to level up your Pandas game?

Try Ponder now