Table of contents
1.
Introduction
2.
Reading datasets
3.
Indexing and selecting data
3.1.
Basic Operations
3.2.
Slicing operator
3.3.
Selection by label
3.4.
Selection by callable
4.
Concatenate dataframe
4.1.
Concatenating DataFrames
4.2.
Joins
5.
Working with text data
6.
String methods
6.1.
Index Cleaning
6.2.
Splitting and replacing strings
7.
Plots
7.1.
Normal plot
7.2.
Bar plot
7.3.
Histogram
7.4.
Scatter plot
7.5.
Hexbin plots
7.6.
FAQs
7.7.
Key Takeaways
Last Updated: Mar 27, 2024

Advanced functions in Pandas

Author soham Medewar
0 upvote
Career growth poll
Do you think IIT Guwahati certified course can help you in your career?

Introduction

Pandas is a python library that is used to manipulate datasets. There are lots of functions in Pandas, and remembering every one of them is quite a challenging task. In this article, we will learn some essential and most used functions used while preparing a model.

Reading datasets

The first thing we need to do while preparing any machine learning model is to read the dataset. Below are some techniques to read the dataset.

Reading .csv data file.

data = pd.read_csv("filename.csv")

Reading .json data.

data = pd.read_json("filename.json")

Reading .html data.

data = pd.read_html("filename.html")

In all the above codes, data variable stores the dataset. There are other formats to read and write files they are as follows:

  • XML file: read_xml() and to_xml()
  • Pickle file: read_pickle() and to_pickle()
  • SQL file: read_sql() and to_sql()
  • HTML file: read_html() and to_html()

Indexing and selecting data

We will create simple time-series data to illustrate indexing functionality.

For creating the time-series data, I am using the numpy library.

Importing libraries

import pandas as pd
import numpy as np

Creating data

dates = pd.date_range('5/5/2021', periods = 10)
data = pd.DataFrame(np.random.randn(105),
                    index = dates,
                  columns = ['A''B''C''D''E'])
print(dates)
data

 

DatetimeIndex(['2021-05-05', '2021-05-06', '2021-05-07', '2021-05-08',
              '2021-05-09', '2021-05-10', '2021-05-11', '2021-05-12',
              '2021-05-13', '2021-05-14'],
              dtype='datetime64[ns]', freq='D')

Basic Operations

Swapping two columns in DataFrame. In the code below, we are swapping column A with column B.

data[['B', 'A']] = data[['A', 'B']]

data[['A', 'B']]

Slicing operator

Some important operations used while slicing the data.

# skips first three rows from the dataframe
data[3:]

# selects first three rows from the dataframe
data[:3]

# selects from fourth to the seventh row from the dataframe
data[3:7]

# selects last two rows from the dataframe
data[-2:]

# selects all the rows except the last two rows from the dataframe
data[:-2]

 

# select every second row starting from fourth-row
data[3::2]

# reverse all the rows
data[::-1]

Let us make the following changes before studying further operations.

data['Val'] = ['a''b''c''d''e''f''g''h''i''j'
forin range(len(data)):
    data['date'] = data.index[i]
data.index = [0,1,2,3,4,5,6,7,8,9]

data

We are adding a new column named “Val” and a new column as “date”, it will contain an index of the row and replace the index with normal numbers.

# selecting sub-dataframe from second to fifth row and from third to fifth column 
data.iloc[1:52:5]

Selection by label

# selecting all rows between 1 and 6
data.loc[16]

Selection by callable

# printing all values of data['A'] whose value is greater than 0
data['A'].loc[lambda s : s > 0]

 

0    0.540595
1    1.939366
3    0.844723
4    0.106978
7    1.707557
Name: A, dtype: float64

 

# print all rows from dataframe where data['B'] is greater than zero
data.loc[lambda data : data['B'] > 0, :]

Concatenate dataframe

Let us create DataFrames to illustrate merge, join, concatenate and compare operations.

Creating DataFrames

d1 = pd.DataFrame(
    {
        "X": ["X1","X2","X3","X4"],
        "Y": ["Y1","Y2","Y3","Y4"],
        "Z": ["Z1","Z2","Z3","Z4"],
    },
    index = [1,2,3,4]
)
d2 = pd.DataFrame(
    {
        "X": ["X5","X6","X7","X8"],
        "Y": ["Y5","Y6","Y7","Y8"],
        "Z": ["Z5","Z6","Z7","Z8"],
    },
    index = [5,6,7,8]
)
d3 = pd.DataFrame(
    {
        "X": ["X9","X10","X11","X12"],
        "Y": ["Y9","Y10","Y11","Y12"],
        "Z": ["Z9","Z10","Z11","Z12"],
    },
    index = [9,10,11,12]
)

Concatenating DataFrames

We will concatenate all the three DataFrames in a single DataFrame.

res = pd.concat([d1, d2, d3])
res

Concatenating dataframe using key values so that we can easily identify each chunk by a key value.

frames = [d1, d2, d3]
res = pd.concat(frames, keys = ["a""b""c"])
res

Now the res DataFrame is multi-indexed where key-value separated the each concatenated DataFrame.

res.index

 

MultiIndex([('a',  1),
            ('a',  2),
            ('a',  3),
            ('a',  4),
            ('b',  5),
            ('b',  6),
            ('b',  7),
            ('b',  8),
            ('c',  9),
            ('c', 10),
            ('c', 11),
            ('c', 12)],
          )

Let us see the DataFrame mapped with “a” value.

res.loc["a"]

Joins

Let us create an additional DataFrame for illustration purpose.

d4 = pd.DataFrame(
    {
        "X" : ["X2""X5""X6"],
        "Y" : ["Y2""Y5""Y6"],
        "W" : ["W2""W5""W6"]
    },
    index = [256]
)
d4

First, we will perform outer join (default in the concatenate function) for axis = 1 and axis = 0.

An outer join will take a union of the datasets. Axis = 1 denotes the union of DataFrame with respect to column labels, and axis = 0 denotes the union of DataFrame with respect to row labels (index).

Outer join, axis = 1

pd.concat([d1, d4], axis = 1, join = "outer")

Outer join, axis = 0

pd.concat([d1, d4], axis = 0, join = "outer")

Now, we will perform inner join for axis = 1 and axis = 0.

An inner join will take an intersection of the datasets. Axis = 1 denotes the intersection of DataFrame with respect to column labels, and axis = 0 denotes the intersection of DataFrame with respect to row labels (index).

Inner join, axis = 1

pd.concat([d1, d4], axis = 1, join = "inner")

Inner join, axis = 1

pd.concat([d1, d4], axis = 0, join = "inner")

Working with text data

There are two ways to store text data in pandas.

  1. Object
  2. StringDtype
tdata = pd.Series(["hello""world""happy""coding"])
tdata

 

0     hello
1     world
2     happy
3    coding
dtype: object

 

tdata = pd.Series(["hello""world""happy""coding"], dtype = "string")
tdata

 

0     hello
1     world
2     happy
3    coding
dtype: string

We can also use dtype = pd.StringDtype().

String methods

# converts all strings to upper case
tdata.str.upper()

# converts all strings to lowercase
tdata.str.lower()

# returns the length of each string
tdata.str.len()

Output for the .str.len() function.

0    5
1    5
2    5
3    6
dtype: Int64

Index Cleaning

We will create index of strings for illustration purpose.

idx = pd.Index([" hello i am soham "" hello world""happy coding !""code everyday!    "], dtype = "string")

We can see that the above index data has trailing as well as leading whitespaces. We will use the “.str.strip()” function to remove the extra spaces.

idx.str.strip()

 

Index(['hello i am soham', 'hello world', 'happy coding !', 'code everyday!'], dtype='object')

Similarly to remove leading extra spaces use the ".str.lstrip()" function, and "str.rstrip()" function to remove trailing extra spaces.

Splitting and replacing strings

sdata = pd.Series([" hello i am soham "" hello world""happy coding !""code everyday!    "], dtype = "string")

sdata

 

0     hello i am soham 
1           hello world
2        happy coding !
3    code everyday!    
dtype: string

Let us split every string in the Series wherever spaces occur using the ".str.split()" function.

sdata.str.split(" ")

 

0    [, hello, i, am, soham, ]
1             [, hello, world]
2           [happy, coding, !]
3    [code, everyday!, , , , ]
dtype: object

Let us replace every space in the Series with “_” (underscore) using ".str.replace()" function.

sdata.str.replace(" ""_")

 

0    _hello_i_am_soham_
1          _hello_world
2        happy_coding_!
3    code_everyday!____
dtype: string

Let us modify the sdata using the function we have seen till now. Changing the entire sdata to uppercase, replacing spaces with “_” (underscore), and removing trailing and leading spaces.

sdata.str.strip().str.replace(" ""_").str.upper()

 

0    HELLO_I_AM_SOHAM
1         HELLO_WORLD
2      HAPPY_CODING_!
3      CODE_EVERYDAY!
dtype: string

Plots

Below is a simple plot of DataFrame.

Normal plot

p_df = pd.DataFrame(np.random.rand(154), columns=["A""B""C""D"])
p_df.plot(figsize=(15,6))

Bar plot

For further plots I will use matplotlib.

import matplotlib.pyplot as plt

 

# creating subplots using pyplot

figure, axes = plt.subplots(2,2)

# setting height and width of plot
figure.set_figheight(12)
figure.set_figwidth(15)

# plot[0][0] simple bar plot
p_df.plot.bar(ax = axes[0][0])

# plot[0][1] stacked bar plot
p_df.plot.bar(stacked = True, color = ["red""orange""green""darkblue"], ax = axes[0][1])

# plot[1][0] horizontal bar plot
p_df.plot.barh(ax = axes[1][0])

# plot[1][1] horizontal stacked bar plot
p_df.plot.barh(stacked = True, ax = axes[1][1])

Histogram

We will create a DataFrame for illustration purpose.

p_df = pd.DataFrame(
    {
        "a": np.random.randn(1000) + 1,
        "b": np.random.randn(1000),
        "c": np.random.randn(1000) - 1,
    },
    columns=["a""b""c"],
)

 

figure, axes = plt.subplots(2,2)
figure.set_figheight(12)
figure.set_figwidth(15)

# simple histogram 
p_df.plot.hist(alpha=0.3, ax = axes[0][0])

# stacked histogram
p_df.plot.hist(stacked = True, ax = axes[0][1])

# horizontal histogram
p_df.plot.hist(orientation = "horizontal", ax=axes[1][0], cumulative = True, color = "red", alpha = 0.4)

# histogram using .diff() method for “a” column
p_df["a"].diff().hist()

Scatter plot

We will create a DataFrame for illustration purpose.

p_df = pd.DataFrame(np.random.rand(100,4), columns=["A""B""C""D"])

 

figure, axes = plt.subplots(2,2)
figure.set_figheight(12)
figure.set_figwidth(15)

# scatter plot A vs B
p_df.plot.scatter(x = "A", y="B", color = "red", ax = axes[0][0])

# scatter plot A vs B

#​​ The keyword C may be given as the name of a column to provide colors for each point
p_df.plot.scatter(x="A", y="B", c="C", s=50, ax = axes[0][1])

# scatter plot A vs B
p_df.plot.scatter(x="A", y="B", c="C", cmap="viridis", s=50, ax = axes[1][0])

# scatter plot A vs B
p_df.plot.scatter(x="A", y="B", s=p_df["C"] * 200, ax = axes[1][1])

Hexbin plots

p_df = pd.DataFrame(np.random.randn(5002), columns=["a""b"])
p_df["b"] = p_df["b"] + np.arange(500)
p_df.plot.hexbin(x="a", y="b", gridsize=25);

 

FAQs

  1. Why are pandas faster?
    Pandas are faster because they used Numpy under the hood.
     
  2. Are pandas series immutable?
    Series are size immutable.
     
  3. How many data structures are there in pandas?
    There are two data structures in pandas i.e., Series and DataFrame.

Key Takeaways

  • In this article, we have learned some advanced functions in pandas that we come across while preprocessing and analyzing the data.
  • Data manipulation using pandas.
  • Handling textual data.
  • Different graphs.


Want to learn more about Machine Learning? Here is an excellent course that can guide you in learning. 
Check out this problem - Largest Rectangle in Histogram

Happy Coding!

Live masterclass