One of the important libraries in data science and data engineering is that Panda. Data in panada is defined as DataFrame class.
In this post, I want to speak about indexing in DataFrame.
If you define a DataFrame as follow:
import pandas as pd
K = pd.DataFrame(np.random.rand(5,6))
0 0.457355 0.695109 0.960173 0.895233 0.913107 0.997462
1 0.159627 0.006112 0.751829 0.641470 0.430603 0.005721
2 0.167967 0.232892 0.000698 0.646807 0.359331 0.859992
3 0.114184 0.332704 0.224112 0.058897 0.547509 0.734783
4 0.623049 0.403003 0.384613 0.663572 0.866130 0.084359
You can not index, the value of K similar to array in python.
K[0,0]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “”, line 1, in
K[0,0]
File “/home/kazem/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py”, line 2800, in getitem
indexer = self.columns.get_loc(key)
File “/home/kazem/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/range.py”, line 353, in get_loc
return super().get_loc(key, method=method, tolerance=tolerance)
File “/home/kazem/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py”, line 2648, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File “pandas/_libs/index.pyx”, line 111, in pandas._libs.index.IndexEngine.get_loc
File “pandas/_libs/index.pyx”, line 135, in pandas._libs.index.IndexEngine.get_loc
File “pandas/_libs/index_class_helper.pxi”, line 109, in pandas._libs.index.Int64Engine._check_type
KeyError: (0, 0)
Row selection :
Suppose, you want to extract first row, you must use iloc command.
K.iloc[0]
or
K.iloc[0,:]
Out[ ]:
0 0.457355
1 0.695109
2 0.960173
3 0.895233
4 0.913107
5 0.997462
Name: 0, dtype: float64
For selecting the first row, it is just enough to enter zero number in row indexing.
Multiple row selection :
If you want to extract the first and third row, you can use :
K.iloc[[0,2],:]
Out[14]:
0 1 2 3 4 5
0 0.457355 0.695109 0.960173 0.895233 0.913107 0.997462
2 0.167967 0.232892 0.000698 0.646807 0.359331 0.859992
As you see, the first and third row is extracted.
Now, if you want to extract the rows from first to third :
In [19]: K.iloc[range(0,3),:]
Out[19]:
0 1 2 3 4 5
0 0.457355 0.695109 0.960173 0.895233 0.913107 0.997462
1 0.159627 0.006112 0.751829 0.641470 0.430603 0.005721
2 0.167967 0.232892 0.000698 0.646807 0.359331 0.859992
This way is true for column too.
K.iloc[:,range(0,3)]
Out[20]:
0 1 2
0 0.457355 0.695109 0.960173
1 0.159627 0.006112 0.751829
2 0.167967 0.232892 0.000698
3 0.114184 0.332704 0.224112
4 0.623049 0.403003 0.384613
In the top code, we extract the first to third column.
Binary Indexing:
In some applications, it is necessary to index a DataFrame with a binary variable. The variable that is used for binary indexing must have the following conditions:
- The size of the input vector must be the same with the number of rows or columns of DataFrame
- The class of input vector must be bool.
Example :
import pandas as pd
K = pd.DataFrame(np.random.rand(5,6))
idx = [1,0,1,1,0]
We want to extract the first and third and fourth row of input DataFrame (K). Every place in idx that is one, shows the selected row. K has five rows, then idx has five cells too.
x = [bool(d) for d in idx]
Idx is a list and we must convert it to bool.
Now, we can simply apply the indexing as :
K.iloc[x,:]
Out[26]:
0 1 2 3 4 5
0 0.457355 0.695109 0.960173 0.895233 0.913107 0.997462
2 0.167967 0.232892 0.000698 0.646807 0.359331 0.859992
3 0.114184 0.332704 0.224112 0.058897 0.547509 0.734783
Binary Indexing in NumPy array :
import numpy as np
G = np.round(10*np.random.rand(6,3))
out :
array([[8., 7., 1.],
[9., 9., 6.],
[4., 9., 9.],
[3., 5., 5.],
[4., 2., 6.],
[2., 0., 9.]])
rb = G <= 5
out :
array([[False, False, True],
[False, False, False],
[ True, False, False],
[ True, True, True],
[ True, True, False],
[ True, True, False]])
G=0
out :
array([[8., 7., 0.],
[9., 9., 6.],
[0., 9., 9.],
[0., 0., 0.],
[0., 0., 6.],
[0., 0., 9.]])