Indexing and selecting data
We will create simple time-series data to illustrate indexing functionality.
For creating the time-series data, I am using the numpy library.
Importing libraries
import pandas as pd import numpy as np
Creating data
dates = pd.date_range( '5/5/2021' , periods = 10 ) data = pd.DataFrame(np.random.randn( 10 , 5 ), index = dates, columns = [ 'A' , 'B' , 'C' , 'D' , 'E' ]) print(dates) data
DatetimeIndex(['2021-05-05', '2021-05-06', '2021-05-07', '2021-05-08', '2021-05-09', '2021-05-10', '2021-05-11', '2021-05-12', '2021-05-13', '2021-05-14'], dtype='datetime64[ns]', freq='D')
Basic Operations
Swapping two columns in DataFrame. In the code below, we are swapping column A with column B.
data[['B', 'A']] = data[['A', 'B']]
data[['A', 'B']]
Slicing operator
Some important operations used while slicing the data.
# skips first three rows from the dataframe data[ 3 :] # selects first three rows from the dataframe data[: 3 ] # selects from fourth to the seventh row from the dataframe data[ 3 : 7 ] # selects last two rows from the dataframe data[ -2 :] # selects all the rows except the last two rows from the dataframe data[: -2 ]
# select every second row starting from fourth-row data[ 3 :: 2 ]
# reverse all the rows data[:: -1 ]
Let us make the following changes before studying further operations.
data[ 'Val' ] = [ 'a' , 'b' , 'c' , 'd' , 'e' , 'f' , 'g' , 'h' , 'i' , 'j' ] for i in range(len(data)): data[ 'date' ] = data.index[i] data.index = [ 0 , 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 ]
data
We are adding a new column named “Val” and a new column as “date”, it will contain an index of the row and replace the index with normal numbers.
# selecting sub-dataframe from second to fifth row and from third to fifth column data.iloc[ 1 : 5 , 2 : 5 ]
Selection by label
# selecting all rows between 1 and 6 data.loc[ 1 : 6 ]
Selection by callable
# printing all values of data['A'] whose value is greater than 0 data[ 'A' ].loc[ lambda s : s > 0 ]
0 0.540595 1 1.939366 3 0.844723 4 0.106978 7 1.707557 Name: A, dtype: float64
# print all rows from dataframe where data['B'] is greater than zero data.loc[ lambda data : data[ 'B' ] > 0 , :]
Concatenate dataframe
Let us create DataFrames to illustrate merge, join, concatenate and compare operations.
Creating DataFrames
d1 = pd.DataFrame( { "X" : [ "X1" , "X2" , "X3" , "X4" ], "Y" : [ "Y1" , "Y2" , "Y3" , "Y4" ], "Z" : [ "Z1" , "Z2" , "Z3" , "Z4" ], }, index = [ 1 , 2 , 3 , 4 ] ) d2 = pd.DataFrame( { "X" : [ "X5" , "X6" , "X7" , "X8" ], "Y" : [ "Y5" , "Y6" , "Y7" , "Y8" ], "Z" : [ "Z5" , "Z6" , "Z7" , "Z8" ], }, index = [ 5 , 6 , 7 , 8 ] ) d3 = pd.DataFrame( { "X" : [ "X9" , "X10" , "X11" , "X12" ], "Y" : [ "Y9" , "Y10" , "Y11" , "Y12" ], "Z" : [ "Z9" , "Z10" , "Z11" , "Z12" ], }, index = [ 9 , 10 , 11 , 12 ] )
Concatenating DataFrames
We will concatenate all the three DataFrames in a single DataFrame.
res = pd.concat([d1, d2, d3]) res
Concatenating dataframe using key values so that we can easily identify each chunk by a key value.
frames = [d1, d2, d3] res = pd.concat(frames, keys = [ "a" , "b" , "c" ]) res
Now the res DataFrame is multi-indexed where key-value separated the each concatenated DataFrame.
MultiIndex([('a', 1), ('a', 2), ('a', 3), ('a', 4), ('b', 5), ('b', 6), ('b', 7), ('b', 8), ('c', 9), ('c', 10), ('c', 11), ('c', 12)], )
Let us see the DataFrame mapped with “a” value.
Joins
Let us create an additional DataFrame for illustration purpose.
d4 = pd.DataFrame( { "X" : [ "X2" , "X5" , "X6" ], "Y" : [ "Y2" , "Y5" , "Y6" ], "W" : [ "W2" , "W5" , "W6" ] }, index = [ 2 , 5 , 6 ] ) d4
First, we will perform outer join (default in the concatenate function) for axis = 1 and axis = 0.
An outer join will take a union of the datasets. Axis = 1 denotes the union of DataFrame with respect to column labels, and axis = 0 denotes the union of DataFrame with respect to row labels (index).
Outer join, axis = 1
pd.concat([d1, d4], axis = 1 , join = "outer" )
Outer join, axis = 0
pd.concat([d1, d4], axis = 0 , join = "outer" )
Now, we will perform inner join for axis = 1 and axis = 0.
An inner join will take an intersection of the datasets. Axis = 1 denotes the intersection of DataFrame with respect to column labels, and axis = 0 denotes the intersection of DataFrame with respect to row labels (index).
Inner join, axis = 1
pd.concat([d1, d4], axis = 1 , join = "inner" )
Inner join, axis = 1
pd.concat([d1, d4], axis = 0 , join = "inner" )
Working with text data
There are two ways to store text data in pandas.
Object
StringDtype
tdata = pd.Series([ "hello" , "world" , "happy" , "coding" ]) tdata
0 hello 1 world 2 happy 3 coding dtype: object
tdata = pd.Series([ "hello" , "world" , "happy" , "coding" ], dtype = "string" ) tdata
0 hello 1 world 2 happy 3 coding dtype: string
We can also use dtype = pd.StringDtype().
String methods
# converts all strings to upper case tdata.str.upper() # converts all strings to lowercase tdata.str.lower() # returns the length of each string tdata.str.len()
Output for the .str.len() function.
Index Cleaning
We will create index of strings for illustration purpose.
idx = pd.Index([ " hello i am soham " , " hello world" , "happy coding !" , "code everyday! " ], dtype = "string" )
We can see that the above index data has trailing as well as leading whitespaces. We will use the “.str.strip()” function to remove the extra spaces.
Index(['hello i am soham', 'hello world', 'happy coding !', 'code everyday!'], dtype='object')
Similarly to remove leading extra spaces use the ".str.lstrip()" function, and "str.rstrip()" function to remove trailing extra spaces.
Splitting and replacing strings
sdata = pd.Series([ " hello i am soham " , " hello world" , "happy coding !" , "code everyday! " ], dtype = "string" )
sdata
0 hello i am soham 1 hello world 2 happy coding ! 3 code everyday! dtype: string
Let us split every string in the Series wherever spaces occur using the ".str.split()" function.
0 [, hello, i, am, soham, ] 1 [, hello, world] 2 [happy, coding, !] 3 [code, everyday!, , , , ] dtype: object
Let us replace every space in the Series with “_” (underscore) using ".str.replace()" function.
sdata.str.replace( " " , "_" )
0 _hello_i_am_soham_ 1 _hello_world 2 happy_coding_! 3 code_everyday!____ dtype: string
Let us modify the sdata using the function we have seen till now. Changing the entire sdata to uppercase, replacing spaces with “_” (underscore), and removing trailing and leading spaces.
sdata.str.strip().str.replace( " " , "_" ).str.upper()
0 HELLO_I_AM_SOHAM 1 HELLO_WORLD 2 HAPPY_CODING_! 3 CODE_EVERYDAY! dtype: string
Plots
Below is a simple plot of DataFrame.
Normal plot
p_df = pd.DataFrame(np.random.rand( 15 , 4 ), columns=[ "A" , "B" , "C" , "D" ]) p_df.plot(figsize=( 15 , 6 ))
Bar plot
For further plots I will use matplotlib.
import matplotlib.pyplot as plt
# creating subplots using pyplot
figure, axes = plt.subplots( 2 , 2 )
# setting height and width of plot figure.set_figheight( 12 ) figure.set_figwidth( 15 )
# plot[0][0] simple bar plot p_df.plot.bar(ax = axes[ 0 ][ 0 ])
# plot[0][1] stacked bar plot p_df.plot.bar(stacked = True , color = [ "red" , "orange" , "green" , "darkblue" ], ax = axes[ 0 ][ 1 ])
# plot[1][0] horizontal bar plot p_df.plot.barh(ax = axes[ 1 ][ 0 ])
# plot[1][1] horizontal stacked bar plot p_df.plot.barh(stacked = True , ax = axes[ 1 ][ 1 ])
Histogram
We will create a DataFrame for illustration purpose.
p_df = pd.DataFrame( { "a" : np.random.randn( 1000 ) + 1 , "b" : np.random.randn( 1000 ), "c" : np.random.randn( 1000 ) - 1 , }, columns=[ "a" , "b" , "c" ], )
figure, axes = plt.subplots( 2 , 2 ) figure.set_figheight( 12 ) figure.set_figwidth( 15 )
# simple histogram p_df.plot.hist(alpha= 0.3 , ax = axes[ 0 ][ 0 ])
# stacked histogram p_df.plot.hist(stacked = True , ax = axes[ 0 ][ 1 ])
# horizontal histogram p_df.plot.hist(orientation = "horizontal" , ax=axes[ 1 ][ 0 ], cumulative = True , color = "red" , alpha = 0.4 )
# histogram using .diff() method for “a” column p_df[ "a" ].diff().hist()
Scatter plot
We will create a DataFrame for illustration purpose.
p_df = pd.DataFrame(np.random.rand( 100 , 4 ), columns=[ "A" , "B" , "C" , "D" ])
figure, axes = plt.subplots( 2 , 2 ) figure.set_figheight( 12 ) figure.set_figwidth( 15 )
# scatter plot A vs B p_df.plot.scatter(x = "A" , y= "B" , color = "red" , ax = axes[ 0 ][ 0 ])
# scatter plot A vs B
# The keyword C may be given as the name of a column to provide colors for each point p_df.plot.scatter(x= "A" , y= "B" , c= "C" , s= 50 , ax = axes[ 0 ][ 1 ])
# scatter plot A vs B p_df.plot.scatter(x= "A" , y= "B" , c= "C" , cmap= "viridis" , s= 50 , ax = axes[ 1 ][ 0 ])
# scatter plot A vs B p_df.plot.scatter(x= "A" , y= "B" , s=p_df[ "C" ] * 200 , ax = axes[ 1 ][ 1 ])
Hexbin plots
p_df = pd.DataFrame(np.random.randn( 500 , 2 ), columns=[ "a" , "b" ]) p_df[ "b" ] = p_df[ "b" ] + np.arange( 500 ) p_df.plot.hexbin(x= "a" , y= "b" , gridsize= 25 );
FAQs
Why are pandas faster? Pandas are faster because they used Numpy under the hood.
Are pandas series immutable? Series are size immutable.
How many data structures are there in pandas? There are two data structures in pandas i.e., Series and DataFrame.
Key Takeaways
In this article, we have learned some advanced functions in pandas that we come across while preprocessing and analyzing the data.
Data manipulation using pandas.
Handling textual data.
Different graphs.
Want to learn more about Machine Learning ? Here is an excellent course that can guide you in learning. Check out this problem - Largest Rectangle in Histogram
Happy Coding!