Professional Pandas: Indexing with Pandas Iloc

Matt Harrison

Aug 15, 2023 24 min read

Articles

Professional Pandas: Indexing with Pandas Iloc image

Introduction to Pandas Iloc

In this post, I describe how Pandas iloc (.iloc) can be used for indexing, but I also argue that you rarely need to use it. There are almost always cleaner options.

We start by giving a high-level description of Pandas iloc. Then we discuss each of the possible input types, with examples. And finally, we talk about the dangers of magic numbers (numbers that appear in your code with no explanation). Throughout, I’ll present alternatives to Pandas iloc and describe why I think they’re more useful.

This is the fourth piece in our Professional Pandas series on teaching best practices about writing professional-grade Pandas code, and a follow-up to a companion piece on Pandas Loc. If you have questions or topics that you would like to dive into, please reach out on Twitter to @ponderdata or @__mharrison__, and if you’d like to see the notebook version of this blogpost, you can find it on Github here.

Try Ponder Today

Start running your Python data science workflows in your database within minutes!

Get started

Introduction to Pandas Iloc
What Is Pandas Iloc?
What Inputs Does Pandas Iloc Accept?
How to Use Pandas Iloc with Each Input Type
Code Readability and the Danger of Magic Numbers
Conclusion

What is Pandas Iloc?

.iloc stands for integer location. Often folks say it is used to subset a portion of a dataframe or series. We can use it to pull out rows or columns based on location.

You will hear this referred to as a function or a method. Both of those are incorrect. Pandas iloc is a property, and we leverage the property by using an index operation (this is with square brackets). You are probably familiar with indexing dictionaries, lists, or strings.

For example, to find the first character of a string, we can do this:

>>> name = 'Matt'
>>> name[0]
'M'

Or we can get the last item in a list with an index of -1:

>>> names = ['Paul', 'George', 'Ringo', 'John']
>>> name[-1]
'John'

Indexing with .iloc differs from indexing with a string or list. A list is one-dimensional while a dataframe is two-dimensional and you can slice it in both dimensions at once. .iloc takes up to two arguments: the row index selector (required) and an optional column index selector (by passing in a tuple). Pandas recognizes if you pass in a tuple and slices in two dimensions. You can't do this with lists or strings.

Inside the brackets following .iloc, you can place a row selector and an optional column selector. The syntax looks like this:

dataframe.iloc[row_selector, column_selector]

While .iloc leverages integer-based indexing, Pandas offers other indexing functions. You can index directly on DataFrames, Series, and GroupBy objects. But you can also use .loc.

We discussed label-based indexing with Pandas loc in our last professional Pandas post. Pandas loc uses labels instead of integer positions to select data.

A Strong Opinion: Don’t Use Pandas Iloc in Production

I find myself using .iloc for inspection and exploratory data analysis and not for production operations. Generally, I suggest you leverage .loc instead of .iloc because it is more explicit and less error-prone.

As we go through the many .iloc examples in the post, I'll show alternatives that I think are cleaner, but here's one sneak peak: Some people use .iloc to pull off the top 10 or bottom 10 entries, but I prefer to use .head or .tail because they are more explicit.

What Inputs Does Pandas Iloc Accept?

Given that caveat, let's explore options for .iloc. I will try to include real-world examples gleaned from my hard drive.

The documentation states that we can index with the following values (and what follows are direct quotes):

An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
A tuple of row and column indexes. The tuple elements consist of one of the above inputs, e.g. (0, 1).

Let’s see how these work.

How to Use Pandas Iloc with Each Input Type

Part 1: Using Pandas Iloc to Index with an Integer (e.g., 5)

Here is an example that pulls out rows of a stock using an integer. Let’s load Pandas and the data. This data is daily stock data from the first half of 2023.

import pandas as pd

url = 'https://github.com/mattharrison/datasets/raw/master/data/tickers-raw.csv'
price_df = pd.read_csv(url)

price_df

	Date				BRK-B	DIS	F	IBM	MMM	NFLX	PG	QCOM	TSLA	UPS	^GSPC
0	2022-12-22 00:00:00-05:00	302.69	86.67	10.51	137.30	118.20	297.75	150.30	109.25	125.35	172.42	3822.39
1	2022-12-23 00:00:00-05:00	306.49	88.01	10.55	138.05	116.79	294.96	150.72	109.41	123.15	173.79	3844.82
2	2022-12-27 00:00:00-05:00	305.55	86.37	10.41	138.80	116.87	284.17	152.04	108.05	109.10	173.72	3829.25
3	2022-12-28 00:00:00-05:00	303.43	84.17	10.17	136.46	115.00	276.88	150.07	105.59	112.71	170.46	3783.22
4	2022-12-29 00:00:00-05:00	309.06	87.18	10.72	137.47	117.21	291.12	150.69	108.42	121.82	172.56	3849.28
...	...				...	...	...	...	...	...	...	...	...	...	...
119	2023-06-15 00:00:00-04:00	339.82	92.94	14.45	138.40	103.81	445.27	148.45	123.61	255.90	179.00	4425.84
120	2023-06-16 00:00:00-04:00	338.31	91.32	14.42	137.48	104.54	431.96	149.54	122.68	260.54	178.58	4409.59
121	2023-06-20 00:00:00-04:00	338.67	89.75	14.22	135.96	102.30	434.70	148.16	119.82	274.45	177.27	4388.71
122	2023-06-21 00:00:00-04:00	338.61	88.64	14.02	133.69	101.47	424.45	149.44	115.76	259.46	173.63	4365.69
123	2023-06-22 00:00:00-04:00	336.91	88.39	14.28	130.91	100.27	420.61	149.79	116.36	261.47	172.37	4368.72
124 rows × 12 columns

Here is the code example, from the book Algorithmic Short Selling with Python.

The first row stores the Cost of the stock in another dataframe. The last row gets the current Price of a stock. (Don't try to run this one -- We haven't initialized a dataframe called port, so you'll get an error.)

port['Cost'] = price_df.iloc[0,:]
port['Price'] = price_df.iloc[-1,:]

You can see that the 0 and -1 are the row selectors. This first example should pull out the first row. Because Pandas doesn't really have a datatype for a row, it presents it as a series.

price_df.iloc[0,:]

Date     2022-12-22 00:00:00-05:00
BRK-B                       302.69
DIS                          86.67
F                            10.51
IBM                          137.3
MMM                          118.2
NFLX                        297.75
PG                           150.3
QCOM                        109.25
TSLA                        125.35
UPS                         172.42
^GSPC                      3822.39
Name: 0, dtype: object

The author of this code included a slice for the column selector as well. The bare slice, :, selects all of the columns. (Which is unneeded in this code). This returns a single row of a dataframe. This code does the same without requiring the column selector:

price_df.iloc[0]

Date     2022-12-22 00:00:00-05:00
BRK-B                       302.69
DIS                          86.67
F                            10.51
IBM                          137.3
MMM                          118.2
NFLX                        297.75
PG                           150.3
QCOM                        109.25
TSLA                        125.35
UPS                         172.42
^GSPC                      3822.39
Name: 0, dtype: object

Note that if you pass in a list with 0 in it for the row selector, you get back a dataframe with a single row, which is the type we will explore in the next section:

price_df.iloc[[0]]

	Date				BRK-B   DIS	F	IBM	MMM	NFLX	PG	QCOM	TSLA	UPS	^GSPC
0	2022-12-22 00:00:00-05:00	302.69	86.67	10.51	137.3	118.2	297.75	150.3	109.25	125.35	172.42	3822.39

Part 2: Using Pandas Iloc to Index with a List or Array of Integers (e.g., [4, 3, 0])

If you pass in a list of integers to the row selector, Pandas will return a dataframe with rows from those positions. You can also pass in a list of integer positions for column selectors.

I looked through my hard drive and had a hard time coming up with an example for this. So I searched further, and found one in the Pandas source code.

This is the test_sort_values_categorical method that tests if sorting works (pandas/tests/series/methods/test_sort_values.py). It uses Pandas iloc to manually re-arrange the rows by hardcoded values so they are in the correct order:

# multi-columns sort
# GH#7848
df = DataFrame(
    {"id": [6, 5, 4, 3, 2, 1], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
df["grade"] = Categorical(df["raw_grade"], ordered=True)
df["grade"] = df["grade"].cat.set_categories(["b", "e", "a"])

# sorts 'grade' according to the order of the categories
result = df.sort_values(by=["grade"])
expected = df.iloc[[1, 2, 5, 0, 3, 4]]
tm.assert_frame_equal(result, expected)

This is actually a good use for .iloc. It is small, synthetic, and easy to understand. In the real world, I have much bigger data and manually reordering the rows by index wouldn't make sense. Hence the .sort_values method.

Note that you can repeat rows or columns using .iloc:

vehicle_url = 'https://github.com/mattharrison/datasets/raw/master/data/vehicles.csv.zip'
df = pd.read_csv(vehicle_url)

df

	barrels08	barrelsA08	charge120	charge240	city08	city08U	cityA08	cityA08U	cityCD	cityE	...	mfrCode	c240Dscr	charge240b	c240bDscr	createdOn			modifiedOn			startStop	phevCity	phevHwy	phevComb
0	15.695714	0.0		0.0		0.0		19	0.0	0	0.0		0.0	0.0	...	NaN	NaN		0.0		NaN		Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	NaN		0		0	0
1	29.964545	0.0		0.0		0.0		9	0.0	0	0.0		0.0	0.0	...	NaN	NaN		0.0		NaN		Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	NaN		0		0	0
2	12.207778	0.0		0.0		0.0		23	0.0	0	0.0		0.0	0.0	...	NaN	NaN		0.0		NaN		Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	NaN		0		0	0
3	29.964545	0.0		0.0		0.0		10	0.0	0	0.0		0.0	0.0	...	NaN	NaN		0.0		NaN		Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	NaN		0		0	0
4	17.347895	0.0		0.0		0.0		17	0.0	0	0.0		0.0	0.0	...	NaN	NaN		0.0		NaN		Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	NaN		0		0	0
...	...		...		...		...		...	...	...	...		...	...	...	...	...		...		...		...				...				...		...		...	...
41139	14.982273	0.0		0.0		0.0		19	0.0	0	0.0		0.0	0.0	...	NaN	NaN		0.0		NaN		Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	NaN		0		0	0
41140	14.330870	0.0		0.0		0.0		20	0.0	0	0.0		0.0	0.0	...	NaN	NaN		0.0		NaN		Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	NaN		0		0	0
41141	15.695714	0.0		0.0		0.0		18	0.0	0	0.0		0.0	0.0	...	NaN	NaN		0.0		NaN		Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	NaN		0		0	0
41142	15.695714	0.0		0.0		0.0		18	0.0	0	0.0		0.0	0.0	...	NaN	NaN		0.0		NaN		Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	NaN		0		0	0
41143	18.311667	0.0		0.0		0.0		16	0.0	0	0.0		0.0	0.0	...	NaN	NaN		0.0		NaN		Tue Jan 01 00:00:00 EST 2013	Tue Jan 01 00:00:00 EST 2013	NaN		0		0	0
41144 rows × 83 columns

df.iloc[[0,0,1,0], [-1, 1, 1]]

	phevComb	barrelsA08	barrelsA08
0	0		0.0		0.0
0	0		0.0		0.0
1	0		0.0		0.0
0	0		0.0		0.0

Part 3: Using Pandas Iloc to Index with a Slice Object with Ints (e.g., 1:7)

Generally, when code uses slices with .iloc, it is pulling off the first or last rows. .iloc[:5] will return the first five rows. (Rows from positions 0, 1, 2, 3, and 4.) Note that this is the half-open interval. It includes the first position (0 if not defined) and up to but not including the last position.

This also works for the column selector. .iloc[7:20, -5:] returns 13 rows starting at index position 7 (the eighth row) and the last five columns. One of the nice attributes of the half-open interval is that the length of the result is the end of the slice minus the first part of the slice. 20 minus seven is 13.

A negative index position means count back from the end. You could use df.iloc[:, 81:83] if you want the last two columns of a dataframe with 83 columns, or you could use df.iloc[:, -2:]. The -2 could be interpreted as len(columns)-2 or 81. The bare slice for the row selector, :, returns all of the rows.

But again, there are better options than Pandas iloc: 95% of .iloc usage with slices of rows can be replaced with .head or .tail and improve readability substantially.

Here’s a longer example of code I use when doing principal component analysis (PCA) visualization. Let’s run PCA on the Fuel Economy data from the US government. (We’ll see Pandas iloc come into play when it comes time to visualize the PCA results.)

from feature_engine import encoding
from sklearn import base, compose, pipeline, preprocessing, set_config
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
set_config(transform_output='pandas')

def tweak_autos(autos):
    cols = ['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive', 'eng_dscr',
        'fuelCost08', 'make', 'model', 'trany', 'range', 'createdOn', 'year']

    return (autos
     [cols]
     .assign(cylinders=autos.cylinders.fillna(0).astype('int8'),
             displ=autos.displ.fillna(0).astype('float16'),
             drive=autos.drive.fillna('Other').astype('category'),
             automatic=autos.trany.str.contains('Auto').fillna(0).astype('int8'),
             speeds=autos.trany.str.extract(r'(\d)+').fillna('20').astype('int8'),
             createdOn=pd.to_datetime(autos.createdOn.replace({' EDT': '-04:00',
                ' EST': '-05:00'}, regex=True), utc=True).dt.tz_convert('America/New_York'),
             ffs=autos.eng_dscr.str.contains('FFS').fillna(0).astype('int8')
            )
     .astype({'highway08': 'int8', 'city08': 'int16', 'comb08': 'int16', 'fuelCost08':

              'int16',
              'range': 'int16',  'year': 'int16', 'make': 'category'})
     .loc[:, ['city08', 'comb08', 'highway08', 'cylinders', 'displ', 'drive',
       'fuelCost08', 'make', 'model', 'range', 'createdOn', 'year',
       'automatic', 'speeds', 'ffs']]
    )

class TweakTransformer(base.BaseEstimator, base.TransformerMixin):
    def fit(self, X):
        return self
    def transform(self, X):
        return tweak_autos(X)

def get_pipeline():
    categorical_features = ['drive', 'make', 'model']
    numeric_features = ['city08', 'comb08', 'highway08',
        'cylinders', 'displ', 'fuelCost08',
       'range', 'year', 'automatic', 'speeds', 'ffs']

    num_pipeline = pipeline.Pipeline(steps=[
        ('std', preprocessing.StandardScaler())
    ])

    cat_pipeline = pipeline.Pipeline(steps=[
        ('oh', encoding.OneHotEncoder(top_categories=5, drop_last=True))
    ])

    pipe = pipeline.Pipeline([
        ('tweak', TweakTransformer()),
        ('ct', compose.ColumnTransformer(transformers=[
            ('num', num_pipeline, numeric_features),
            ('cat', cat_pipeline, categorical_features),
            ])),
      ]
    )
    return pipe

def get_pca_pipeline():
    scaler = StandardScaler()
    pca = PCA()
    pl = pipeline.make_pipeline(scaler, pca)
    return pl

With that code in place, let's make our cleanup pipeline to generate the data for machine learning, X:

pl = get_pipeline()
X = pl.fit_transform(df)
X

	num__city08	num__comb08	num__highway08	num__cylinders	num__displ	num__fuelCost08	num__range	num__year	num__automatic	num__speeds	...	cat__make_Chevrolet	cat__make_Ford	cat__make_Dodge	cat__make_GMC	cat__make_Toyota	cat__model_F150 Pickup 2WD	cat__model_F150 Pickup 4WD	cat__model_Mustang	cat__model_Jetta	cat__model_Truck 2WD
0	0.079809	0.049985	0.064077	-0.939606	-0.930490	-0.553207	-0.060845	-1.484011	-1.486392	-0.113886	...	0			0		0		0		0			0				0				0			0			0
1	-1.185086	-1.253042	-1.358900	3.512290	1.180007	2.271333	-0.060845	-1.484011	-1.486392	-0.113886	...	0			0		0		0		0			0				0				0			0			0
2	0.585768	0.831800	1.098970	-0.939606	-0.785428	-1.240257	-0.060845	-1.484011	-1.486392	-0.113886	...	0			0		1		0		0			0				0				0			0			0
3	-1.058597	-1.253042	-1.617623	1.286342	1.399021	2.271333	-0.060845	-1.484011	0.672770	-0.909896	...	0			0		1		0		0			0				0				0			0			0
4	-0.173170	-0.210621	-0.194646	-0.939606	-0.785428	0.515538	-0.060845	-0.766025	-1.486392	-0.113886	...	0			0		0		0		0			0				0				0			0			0
...	...		...		...		...		...		...		...		...		...		...		...	...			...		...		...		...			...				...				...			...			...
41139	0.079809	0.180287	0.193439	-0.939606	-0.785428	-0.705884	-0.060845	-0.766025	0.672770	-0.511891	...	0			0		0		0		0			0				0				0			0			0
41140	0.206299	0.310590	0.452162	-0.939606	-0.785428	-0.782223	-0.060845	-0.766025	-1.486392	-0.113886	...	0			0		0		0		0			0				0				0			0			0
41141	-0.046680	0.049985	-0.065284	-0.939606	-0.785428	-0.553207	-0.060845	-0.766025	0.672770	-0.511891	...	0			0		0		0		0			0				0				0			0			0
41142	-0.046680	0.049985	-0.065284	-0.939606	-0.785428	-0.553207	-0.060845	-0.766025	-1.486392	-0.113886	...	0			0		0		0		0			0				0				0			0			0
41143	-0.299660	-0.340923	-0.453369	-0.939606	-0.785428	0.820894	-0.060845	-0.766025	0.672770	-0.511891	...	0			0		0		0		0			0				0				0			0			0
41144 rows × 26 columns

Once we have that data, let’s send it through our PCA pipeline, which standardizes the data and then feeds it to the PCA algorithm:

pca_pl = get_pca_pipeline()
pca_pl.fit_transform(X)

	pca0		pca1		pca2		pca3		pca4		pca5		pca6		pca7		pca8		pca9		...	pca16		pca17		pca18		pca19		pca20		pca21		pca22		pca23		pca24		pca25
0	0.477062	-2.095174	0.770188	-0.212903	-0.951630	0.676938	-0.238926	0.289078	-0.740275	0.155848	...	0.349791	-0.496253	0.237550	0.038151	0.212713	-0.236530	-0.012753	-0.174440	-0.007384	0.032632
1	-4.268353	0.083709	0.988989	-0.406348	-0.790531	0.107578	-0.072892	0.070620	-0.463815	0.187307	...	-0.679322	2.164472	0.463913	-0.271684	1.396367	0.003845	1.387906	-0.348073	-0.282840	-0.020434
2	2.223967	-2.811101	-0.325520	-0.617319	-0.598618	-1.267804	2.235811	0.444152	0.445463	1.456438	...	0.114138	-0.023109	-0.611794	-0.436919	0.084360	-0.158280	0.081188	-0.178606	0.345765	-0.033851
3	-4.042764	-0.429124	1.000171	-1.017005	-0.974965	-1.251741	2.801752	-0.073256	0.909080	1.261115	...	-0.430511	-0.581322	-0.308522	0.426849	1.287281	0.616496	0.055086	-0.331410	-0.077058	-0.038770
4	-0.139146	-2.032831	-0.538644	-0.316546	1.840327	0.798375	-0.194294	0.064586	-0.862334	0.194610	...	0.510479	-0.051578	0.403818	0.149321	-0.555183	0.342289	-0.105411	0.038149	0.184141	0.019709
...	...		...		...		...		...		...		...		...		...		...		...	...		...		...		...		...		...		...		...		...		...
41139	1.377499	-1.709720	-0.699271	0.007333	-0.305877	-0.638405	0.296353	0.026869	0.397360	-0.318153	...	0.376670	0.103955	0.519335	0.167599	-0.254011	-0.122864	-0.182352	-0.038450	-0.003055	-0.039244
41140	1.748701	-2.100211	-0.734197	0.048718	-0.347412	-0.195885	-0.149926	0.314707	-0.301994	0.008568	...	-0.135854	0.938395	0.307274	0.093213	-0.328319	-0.172653	-0.215587	-0.025868	0.035057	-0.001246
41141	0.307412	-1.652104	-0.416253	-0.326038	1.831733	0.360856	0.128497	-0.178601	-0.178784	-0.122508	...	1.101649	-1.133470	0.475791	-0.331684	-0.444559	-0.334098	-0.069290	-0.028727	0.083650	-0.072446
41142	0.450261	-2.073459	-0.552671	-0.268423	1.773376	0.814024	-0.303813	0.101635	-0.884443	0.196430	...	0.564450	-0.309094	0.270824	-0.248844	-0.524381	-0.398215	-0.113874	-0.010273	0.008102	-0.069861
41143	-0.593758	-1.626213	-0.475329	-0.373315	1.902766	0.350590	0.274209	-0.229919	-0.155403	-0.130760	...	1.017916	-0.824843	0.644316	0.279251	-0.484714	0.571243	-0.071891	0.035067	0.187376	-0.022398
41144 rows × 26 columns

Having run PCA on the data, let’s look at the components that make up the weights to calculate the embeddings. (We’ll only display the first two as an example.)

pca_pl.named_steps['pca'].components_[0:2]

array([[ 0.37501081,  0.38968423,  0.39554789, -0.34148913, -0.3483539 ,
        -0.36421128,  0.19347438,  0.0875742 , -0.05200177,  0.07680492,
        -0.03992407,  0.24258046, -0.17168667, -0.11188447,  0.04104331,
        -0.01628179, -0.05080095, -0.03737196, -0.05577404, -0.09616649,
         0.04123887, -0.03649398, -0.03769309, -0.01286445,  0.03708651,
         0.00463668],
       [ 0.05963524,  0.0652074 ,  0.07807361,  0.19236311,  0.17215519,
         0.07042231,  0.08233002,  0.54374148,  0.25799294,  0.3409357 ,
        -0.49177226, -0.16771429,  0.05033319, -0.09918001,  0.292591  ,
         0.20112942, -0.02188441, -0.05225876, -0.09589855,  0.01754167,
        -0.03934466, -0.02033149, -0.03234882, -0.00660427, -0.01698125,
        -0.0693592 ]])

Let’s throw that into a dataframe to make it easier to understand. The rows represent the weights for columns for each principal component. The larger weights have more of an impact on the principal components.

def components_to_df(pca_components, feature_names):
    idxs = [f'PC{i}' for i in range(len(pca_components))]
    return pd.DataFrame(pca_components, columns=feature_names,
                            index=idxs)
comps = components_to_df(pca_pl.named_steps['pca'].components_, X.columns)
comps.head()

	num__city08	num__comb08	num__highway08	num__cylinders	num__displ	num__fuelCost08	num__range	num__year	num__automatic	num__speeds	...	cat__make_Chevrolet	cat__make_Ford	cat__make_Dodge	cat__make_GMC	cat__make_Toyota	cat__model_F150 Pickup 2WD	cat__model_F150 Pickup 4WD	cat__model_Mustang	cat__model_Jetta	cat__model_Truck 2WD
PC0	0.375011	0.389684	0.395548	-0.341489	-0.348354	-0.364211	0.193474	0.087574	-0.052002	0.076805	...	-0.050801		-0.037372	-0.055774	-0.096166	0.041239		-0.036494			-0.037693			-0.012864		0.037087		0.004637
PC1	0.059635	0.065207	0.078074	0.192363	0.172155	0.070422	0.082330	0.543741	0.257993	0.340936	...	-0.021884		-0.052259	-0.095899	0.017542	-0.039345		-0.020331			-0.032349			-0.006604		-0.016981		-0.069359
PC2	0.269448	0.235022	0.178760	0.130405	0.186508	0.123961	0.520875	-0.112167	0.020663	-0.230661	...	0.079818		0.162466	0.010807	0.112550	-0.092597		0.155051			0.026904			0.147292		-0.063889		0.046304
PC3	-0.058300	-0.047605	-0.030073	-0.030055	-0.049874	-0.067175	-0.144811	0.074356	-0.011682	0.081386	...	-0.205989		0.670696	-0.139969	-0.198711	-0.051163		0.368876			0.331516			0.367203		0.014556		-0.050755
PC4	0.094409	0.065148	0.012979	-0.003285	0.039323	0.091275	0.254591	-0.028703	0.008091	-0.102726	...	-0.016995		0.085440	-0.074755	0.150861	-0.051971		-0.090184			0.366551			-0.098427		-0.043081		-0.155391
5 rows × 26 columns

We often want to create a bar plot of those weights that have the most impact. However, we have 26 principal components and 26 weights for each of those components. This makes the plot really hard to interpret.

(comps
 .plot.bar()
)

Usually, we will inspect the weights for the first few components. Here, we see .iloc pulling out the first three rows!

This is still a little hard to interpret because many columns are included that clutter the plot and make it hard to interpret. Let’s limit the components and move the index to the side. (Here’s where Pandas iloc enters the example!)

(comps
 .iloc[:3]
 .plot.bar()
 .legend(bbox_to_anchor=(1,1))
)

Let's limit the plot only to show columns whose absolute value exceeds some threshold. We combine .pipe with limit_cols to perform this operation:

def limit_cols(df_, limit):
    cols = (df_
            .abs()
            .gt(limit)
            .any()
            .pipe(lambda s: s[s].index)
           )
    return df_.loc[:, cols]

(comps
 .iloc[:3]
 .pipe(limit_cols, limit=.32)
 .plot.bar()
 .legend(bbox_to_anchor=(1,1))
)

This looks much better. It appears that PC0 is impacted most by the city, combined, and highway mileage. PC1 looks like a proxy for year. PC2 is impacted by range, indicating the division between vehicles with electric and combustion engines.

This was an extended example, but probably the most common use I see of .iloc: Pulling off the first (or last) few rows.

My preference would be to rewrite the code without Pandas iloc, like this:

def limit_cols(df_, limit):
    cols = (df_
            .abs()
            .gt(limit)
            .any()
            .pipe(lambda s: s[s].index)
           )
    return df_.loc[:, cols]

num_components = 3

(comps
 #.iloc[:3]
 .head(num_components)
 .pipe(limit_cols, limit=.32)
 .plot.bar()
 .legend(bbox_to_anchor=(1,1))
)

Part 4: Using Pandas Iloc to Index with a Boolean Array

This is a weird one. Generally, when we speak of boolean arrays with Pandas, we refer to a series with the same index as the dataframe (or series) we are working with but with boolean (true / false) values.

Here is a boolean array of vehicles with city mileage above 60:

df.city08 > 60

0        False
1        False
2        False
3        False
4        False
         ...  
41139    False
41140    False
41141    False
41142    False
41143    False
Name: city08, Length: 41144, dtype: bool

Let's see if we can use this boolean array to select the rows of cars with city mileage above 60 in combination with .iloc:

df.iloc[df.city08 > 60]

This doesn’t work in Pandas — We get a NotImplementedError:

NotImplementedError: iLocation based boolean indexing on an integer type is not available

In this case, the term boolean array (for the .iloc docstring) means a list (or NumPy array) with the same length as the number of rows in the dataframe. I'll convert the boolean array to a Python list and try the code again:

df.iloc[list(df.city08 > 60)]

	barrels08	barrelsA08	charge120	charge240	city08	city08U		cityA08	cityA08U	cityCD	cityE	...	mfrCode	c240Dscr		charge240b	c240bDscr		createdOn			modifiedOn			startStop	phevCity	phevHwy	phevComb
7138	0.240		0.0		0.0		0.0		81	0.0000		0	0.0		0.0	41.0000	...	NaN	NaN			0.0		NaN			Tue Jan 01 00:00:00 EST 2013	Thu Jul 07 00:00:00 EDT 2016	N		0		0	0
7139	0.282		0.0		0.0		0.0		81	0.0000		0	0.0		0.0	41.0000	...	NaN	NaN			0.0		NaN			Tue Jan 01 00:00:00 EST 2013	Thu Jul 07 00:00:00 EDT 2016	N		0		0	0
8143	0.282		0.0		0.0		0.0		81	0.0000		0	0.0		0.0	41.0000	...	NaN	NaN			0.0		NaN			Tue Jan 01 00:00:00 EST 2013	Thu Jul 07 00:00:00 EDT 2016	N		0		0	0
8144	0.312		0.0		0.0		0.0		74	0.0000		0	0.0		0.0	46.0000	...	NaN	NaN			0.0		NaN			Tue Jan 01 00:00:00 EST 2013	Thu Jul 07 00:00:00 EDT 2016	N		0		0	0
8147	0.270		0.0		0.0		0.0		84	0.0000		0	0.0		0.0	40.0000	...	NaN	NaN			0.0		NaN			Tue Jan 01 00:00:00 EST 2013	Thu Jul 07 00:00:00 EDT 2016	N		0		0	0
...	...		...		...		...		...	...		...	...		...	...	...	...	...			...		...			...				...				...		...		...	...
34563	0.156		0.0		0.0		8.5		138	138.1100	0	0.0		0.0	24.4045	...	TSL	standard charger	0.0		80 amp dual charger	Thu May 02 00:00:00 EDT 2019	Thu May 02 00:00:00 EDT 2019	N		0		0	0
34564	0.150		0.0		0.0		9.5		140	140.4200	0	0.0		0.0	24.0030	...	TSL	standard charger	0.0		80 amp dual charger	Thu May 02 00:00:00 EDT 2019	Tue Jun 04 00:00:00 EDT 2019	N		0		0	0
34565	0.180		0.0		0.0		12.0		115	114.9369	0	0.0		0.0	29.3248	...	TSL	standard charger	8.0		72 amp dual charger	Thu May 02 00:00:00 EDT 2019	Tue Jun 04 00:00:00 EDT 2019	N		0		0	0
34566	0.192		0.0		0.0		12.0		104	104.2314	0	0.0		0.0	32.3367	...	TSL	standard charger	8.0		72 amp dual charger	Thu May 02 00:00:00 EDT 2019	Tue Jun 04 00:00:00 EDT 2019	N		0		0	0
34567	0.210		0.0		0.0		12.0		98	98.1135		0	0.0		0.0	34.3531	...	TSL	standard charger	8.0		72 amp dual charger	Thu May 02 00:00:00 EDT 2019	Tue Jun 04 00:00:00 EDT 2019	N		0		0	0
191 rows × 83 columns

That works! However, I have never found that this felt like the tool to reach for to limit rows of data. I prefer the .query method:

df.query('city08 > 60')

	barrels08	barrelsA08	charge120	charge240	city08	city08U		cityA08	cityA08U	cityCD	cityE	...	mfrCode	c240Dscr		charge240b	c240bDscr		createdOn			modifiedOn			startStop	phevCity	phevHwy	phevComb
7138	0.240		0.0		0.0		0.0		81	0.0000		0	0.0		0.0	41.0000	...	NaN	NaN			0.0		NaN			Tue Jan 01 00:00:00 EST 2013	Thu Jul 07 00:00:00 EDT 2016	N		0		0	0
7139	0.282		0.0		0.0		0.0		81	0.0000		0	0.0		0.0	41.0000	...	NaN	NaN			0.0		NaN			Tue Jan 01 00:00:00 EST 2013	Thu Jul 07 00:00:00 EDT 2016	N		0		0	0
8143	0.282		0.0		0.0		0.0		81	0.0000		0	0.0		0.0	41.0000	...	NaN	NaN			0.0		NaN			Tue Jan 01 00:00:00 EST 2013	Thu Jul 07 00:00:00 EDT 2016	N		0		0	0
8144	0.312		0.0		0.0		0.0		74	0.0000		0	0.0		0.0	46.0000	...	NaN	NaN			0.0		NaN			Tue Jan 01 00:00:00 EST 2013	Thu Jul 07 00:00:00 EDT 2016	N		0		0	0
8147	0.270		0.0		0.0		0.0		84	0.0000		0	0.0		0.0	40.0000	...	NaN	NaN			0.0		NaN			Tue Jan 01 00:00:00 EST 2013	Thu Jul 07 00:00:00 EDT 2016	N		0		0	0
...	...		...		...		...		...	...		...	...		...	...	...	...	...			...		...			...				...				...		...		...	...
34563	0.156		0.0		0.0		8.5		138	138.1100	0	0.0		0.0	24.4045	...	TSL	standard charger	0.0		80 amp dual charger	Thu May 02 00:00:00 EDT 2019	Thu May 02 00:00:00 EDT 2019	N		0		0	0
34564	0.150		0.0		0.0		9.5		140	140.4200	0	0.0		0.0	24.0030	...	TSL	standard charger	0.0		80 amp dual charger	Thu May 02 00:00:00 EDT 2019	Tue Jun 04 00:00:00 EDT 2019	N		0		0	0
34565	0.180		0.0		0.0		12.0		115	114.9369	0	0.0		0.0	29.3248	...	TSL	standard charger	8.0		72 amp dual charger	Thu May 02 00:00:00 EDT 2019	Tue Jun 04 00:00:00 EDT 2019	N		0		0	0
34566	0.192		0.0		0.0		12.0		104	104.2314	0	0.0		0.0	32.3367	...	TSL	standard charger	8.0		72 amp dual charger	Thu May 02 00:00:00 EDT 2019	Tue Jun 04 00:00:00 EDT 2019	N		0		0	0
34567	0.210		0.0		0.0		12.0		98	98.1135		0	0.0		0.0	34.3531	...	TSL	standard charger	8.0		72 amp dual charger	Thu May 02 00:00:00 EDT 2019	Tue Jun 04 00:00:00 EDT 2019	N		0		0	0
191 rows × 83 columns

Part 5: Using Pandas Iloc to Index with a Callable Function

Here’s another option: passing in a function. You can pass in a function to the index operation or a function for the row selector and/or the column selector. This function should return a single position, a slice, a list, or a tuple of those.

This is very rare in the wild. Here's an example I found: Finding peaks. I'm going to use the find_peaks function from the SciPy library.

The find_peaks function works with NumPy arrays and Pandas series and returns the index locations where the values are above the neighbors that surround them. (You can customize the behavior to constrain the definition of a "peak" as well. We'll use the out-of-the-box behavior.)

from scipy.signal import find_peaks

find_peaks(price_df.IBM)

(array([  2,   4,   7,   9,  14,  19,  24,  28,  31,  34,  36,  38,  48,
         53,  56,  59,  65,  68,  70,  72,  76,  80,  83,  85,  91,  94,
        103, 107, 117, 119]),
 {})

Note the return type of this is a tuple. This won’t work directly with .iloc:

price_df.IBM.iloc[find_peaks]

This throws a ValueError. (For the sake of brevity, I won’t share the whole traceback.)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

This is a great place to use lambda. Let's use lambda to adapt the result of find_peaks:

price_df.IBM.iloc[lambda ser: find_peaks(ser)[0]]

2      138.80
4      137.47
7      138.97
9      140.05
14     142.18
19     138.25
24     131.86
28     133.46
31     132.52
34     135.50
36     134.57
38     133.20
48     128.44
53     123.89
56     123.02
59     124.87
65     127.97
68     130.28
70     130.36
72     129.27
76     126.42
80     124.66
83     124.20
85     125.26
91     121.99
94     122.02
103    128.18
107    129.48
117    137.60
119    138.40
Name: IBM, dtype: float64

Ok, that works but what is a real life example? We can use it to highlight peaks in a plot:

import matplotlib.pyplot as plt

def plot_with_highlights(ser):
    ax = ser.plot()
    ser.iloc[lambda ser: find_peaks(ser)[0]].plot(linestyle='',
                                                  marker='o')
    return ser

(price_df
 .IBM
 .pipe(plot_with_highlights)
)

0      137.30
1      138.05
2      138.80
3      136.46
4      137.47
        ...  
119    138.40
120    137.48
121    135.96
122    133.69
123    130.91
Name: IBM, Length: 124, dtype: float64

Part 6: Using Pandas Iloc to Index with a Tuple

Finally, the documentation suggests we can also pass in row selectors. We can do this by passing in a tuple. The docstrings states: “A tuple of row and column indexes. The tuple elements consist of one of the above inputs, e.g. (0, 1).”

I'll note that this only applies to dataframes. Using .iloc with series only supports row selectors (because there aren't any columns).

When using .iloc with dataframes, these row selectors can be any of the options we provided above. I have presented some examples of these above.

One thing to note. With .iloc and a dataframe, you can pull out:

The value of a single cell as a:
- Scalar value
- Column series
- Row Series
- Dataframe
A Column as a:
- Column series
- Dataframe
A Row as a
- Row series
- Dataframe
A subset of a Dataframe

Here is an example of how we can change the types of data results. Looking at our price data, let’s pull out the first BRK-B value. Here’s the whole data:

price_df

	Date				BRK-B	DIS	F	IBM	MMM	NFLX	PG	QCOM	TSLA	UPS	^GSPC
0	2022-12-22 00:00:00-05:00	302.69	86.67	10.51	137.30	118.20	297.75	150.30	109.25	125.35	172.42	3822.39
1	2022-12-23 00:00:00-05:00	306.49	88.01	10.55	138.05	116.79	294.96	150.72	109.41	123.15	173.79	3844.82
2	2022-12-27 00:00:00-05:00	305.55	86.37	10.41	138.80	116.87	284.17	152.04	108.05	109.10	173.72	3829.25
3	2022-12-28 00:00:00-05:00	303.43	84.17	10.17	136.46	115.00	276.88	150.07	105.59	112.71	170.46	3783.22
4	2022-12-29 00:00:00-05:00	309.06	87.18	10.72	137.47	117.21	291.12	150.69	108.42	121.82	172.56	3849.28
...	...				...	...	...	...	...	...	...	...	...	...	...
119	2023-06-15 00:00:00-04:00	339.82	92.94	14.45	138.40	103.81	445.27	148.45	123.61	255.90	179.00	4425.84
120	2023-06-16 00:00:00-04:00	338.31	91.32	14.42	137.48	104.54	431.96	149.54	122.68	260.54	178.58	4409.59
121	2023-06-20 00:00:00-04:00	338.67	89.75	14.22	135.96	102.30	434.70	148.16	119.82	274.45	177.27	4388.71
122	2023-06-21 00:00:00-04:00	338.61	88.64	14.02	133.69	101.47	424.45	149.44	115.76	259.46	173.63	4365.69
123	2023-06-22 00:00:00-04:00	336.91	88.39	14.28	130.91	100.27	420.61	149.79	116.36	261.47	172.37	4368.72
124 rows × 12 columns

Here is a scalar result of the value we want:

price_df.iloc[0, 1]

302.69

We can pull this out as a column. Note that the index stays the same as the original data, and the series has the column name:

# a column (Series)
price_df.iloc[[0], 1]

0    302.69
Name: BRK-B, dtype: float64

Here is the value pulled out as a row. Because Pandas doesn’t have a row type, it changes the row to a series. The column name moves to the index, and the index value moves to the name of the series.

# a row (converted to a Series)
price_df.iloc[0, [1]]

BRK-B    302.69
Name: 0, dtype: object

Finally, we can get the result as a dataframe.

# A Dataframe
price_df.iloc[[0], [1]]

	BRK-B
0	302.69

What type of value should you pull out of your data? It depends on what you are doing. I find that when I work with row selectors, generally I want dataframes out. When I'm using the .assign method, I generally want a column (series) as a result.

Code Readability and the Danger of Magic Numbers

I’m a huge proponent of writing code that is easy to read, reuse, and share with others. If I have to choose between “easy to write” and “easy to read” code, I will choose easy-to-read every time! In fact, I will often write more code to make it easier to read.

So let’s talk about something that is easy to write, but often makes code hard to read, and comes up a lot when people use Pandas iloc: magic numbers.

In the world of programming, magic numbers refer to literal numbers that appear directly in the source code without any explanation. They are “magic” in that they “just work.” Over time, it becomes harder and harder for you (and certainly for your colleagues) to understand why these specific numbers were chosen, which can lead to errors and difficulties when updating or modifying the code.

For instance, consider the following:

selected_data = sales_data.iloc[0:5, 2:4]

In this case, 0, 5, 2, and 4 are magic numbers. Did you want the first five rows? Are those five rows specific to data at a particular time? If you run the code later, will the rows be in the same location? Similarly, with the columns? Did you just want the third and fourth column? What if more columns are added? This code might work, but the intent is not clear.

One of the most effective ways to avoid magic numbers is to use meaningful variable names instead. For example, instead of directly using numbers with .iloc, we could define variables that describe what these numbers represent:

start_row = 0
end_row = 5
start_column = 2
end_column = 4
selected_data = sales_data.iloc[start_row:end_row, start_column:end_column]

In this case, we’ve replaced the magic numbers 0 and 5 with the variables ‘start_row’ and ‘end_row’. This makes the code much more understandable: it’s clear that we’re selecting the rows from ‘start_row’ to ‘end_row’.

But my preference, if we wanted the first five rows and two specific columns, would be to avoid Pandas iloc entirely:

num_rows = 5
(sales_data
  .head(num_rows)
  .loc[:, ['Country', 'Sales_Amount']]
)

Although this requires more typing, this code makes our intent clear. When doing quick and dirty analysis, I might use the .iloc version, but when I start working with others, I will refactor.

Conclusion

Thanks for reading part two of our ultimate guide to indexing in Pandas. You should now be aware of some of the capabilities of  .iloc, which lets you pull out rows or columns based on integer location. This property is handy, but it is usually possible (and preferable) to refactor .iloc out of our code.

If you liked this, check out our previous Professional Pandas posts on Pandas loc, the Pandas merge method, and the Pandas assign method / chaining.

About Ponder

Ponder is the company that lets you run your Python data workflows (in Pandas or NumPy) securely, seamlessly, and scalably in your data warehouse or database. Sign up for a free trial!

Try Ponder Today

Start running your Python data science workflows in your database within minutes!

Get started

News

Oct 23, 2023

🐼 ❤️ ❄️

We are excited to announce Snowflake’s intent to acquire Ponder to bring Ponder’s Python data science innovations to its customers and to accelerate the growth of the Modin community.

Articles

Oct 3, 2023

Professional Pandas: Handling Missing Data With Pandas Dropna

This is the fifth in a series of blog posts that teach how to write professional-quality pandas code. We start by discussing pandas dropna generally and going over a simple example. Then we talk about identifying missing values, when to drop data, and how to drop entire rows that are missing.

Articles

Sep 19, 2023

How To Use pandas resample on a Database

In this article, we describe pandas resample + provide some examples, and then show how you can use it at scale in your database.

Ready to level up your Pandas game?

Try Ponder now

Professional Pandas: Indexing with Pandas Iloc

Introduction to Pandas Iloc

Try Ponder Today

Table of Contents

What is Pandas Iloc?

A Strong Opinion: Don’t Use Pandas Iloc in Production

What Inputs Does Pandas Iloc Accept?

How to Use Pandas Iloc with Each Input Type

Part 1: Using Pandas Iloc to Index with an Integer (e.g., 5)

Part 2: Using Pandas Iloc to Index with a List or Array of Integers (e.g., [4, 3, 0])

Part 3: Using Pandas Iloc to Index with a Slice Object with Ints (e.g., 1:7)

Part 4: Using Pandas Iloc to Index with a Boolean Array

Part 5: Using Pandas Iloc to Index with a Callable Function

Part 6: Using Pandas Iloc to Index with a Tuple

Code Readability and the Danger of Magic Numbers

Conclusion

About Ponder

Try Ponder Today

Read more:

🐼 ❤️ ❄️

Professional Pandas: Handling Missing Data With Pandas Dropna

How To Use pandas resample on a Database

Ready to level up your Pandas game?