Code360 powered by Coding Ninjas X Naukri.com. Code360 powered by Coding Ninjas X Naukri.com
Table of contents
1.
Introduction
2.
Features
3.
Installation
4.
What type of data do Pandas handle?
5.
Reading and writing tabular data
6.
Selecting subset and dropping a column of a DataFrame
7.
Creating plots in pandas
8.
Creating new columns from existing ones
9.
Calculating summary statistics
10.
Reshaping the layout of the tables
11.
Manipulating textual data
12.
FAQs
13.
Key Takeaways
Last Updated: Mar 27, 2024

Getting started with Pandas

Author soham Medewar
1 upvote

Introduction

Pandas is a python library that helps to analyze, clean, and manipulate the data. Pandas stand for “Python data analysis library”. It provides a data structure that helps to manage numerical tables and time series. Pandas is derived from the word “Panel Data” which is an econometric term for multidimensional structured datasets.

Features

1. Handling data

As it is a data analysis library, it provides a fast and efficient way to explore and manage the data. It does that by providing Series and DataFrame data structure.

2. Input and output tools

It has a wide range of input and output tools that make data reading and data writing much easier and faster.

3. Visualization

Visualization is an important part of data science. To analyze patterns and trends it is necessary to visualize data. Pandas have an inbuilt feature that helps to visualize the dataset.

4. Performing mathematical operations

Modification of data in a statistical manner requires lots of mathematical operations. In pandas, nearly every operation has an inbuilt function where simple and complex mathematical operations can be executed.

Installation

For Conda 

conda install pandas

For a specific version in conda

conda install pandas=version

For ubuntu

sudo apt-get install python3-pandas

For pip (via anaconda)

!pip install pandas

You can install pandas into your local machine in any of these ways.

After the installation process, import the pandas library to start using it.

Importing library

import pandas as pd

What type of data do Pandas handle?

Pandas support two data structures. One is Series, and another is DataFrame. Series is a one-dimensional data structure of the labeled array that can hold data of any type (int, float, double, string, python object, etc.). DataFrame is a two-dimensional data structure that is arranged in rows and columns. Basically, it is a heterogeneous data structure having a tabular form with labeled axes(columns and rows).

DataFrame representation in pandas.

Source

Let us take an example of a Series and DataFrame data structure.

data = pd.Series([1234567], index = ['a''b''c''d''e''f''g'])
data

data = pd.DataFrame({
    'State' : ['Maharashtra''West Bengal''Uttar Pradesh''Jharkhand''Telangana'],
    'Capital' : ['Mumbai''Kolkata''Lucknow''Ranchi''Hyderabad'],
    'Average Literacy Rate' : [84.880.573.074.372.8]
})
data

Getting information about DataFrame.

data.info()

Reading and writing tabular data

The most used way of reading the data is through a .csv file.

data = pd.read_csv("goodreads.csv")

To write the DataFrame into the .csv file we use the “to_csv” command.

data.to_csv("dataframestored.csv")

A file named “dataframestored.csv” is created in the current working directory.

There are similar functions like “read_excel” and “to_excel” that reads and writes data from a .csv file.

Selecting subset and dropping a column of a DataFrame

While reading the dataset, some of the columns are necessary, and some are useless, so for analyzing purposes, we must select a subset of the Dataframe. For model training, we have to eliminate some columns. Below are a few code snippets that will help you to select and drop the column of a DataFrame.

We are loading four columns from the data DataFrame.

sub_data = data[['title''language''pages''no_ratings']]
sub_data.head(4)

We are dropping a few columns for example purposes.

data = data.drop(['description''awards''author'], axis=1)
data.head(4)

Creating plots in pandas

I will be creating a DataFrame of height and weight column.

data = pd.DataFrame({
    'Height (cm)' : [175150198122153159164179181133145],
    'Weight (kgs)' : [5560853652458865735560]
})

A normal plot of data x-axis represents an index of particular data, and the y-axis represents the data itself.

figsize=(7,4) is complementary attribute (not necessary to include).

data.plot(figsize=(74))

A normal plot of data using the Height column where the x-axis represents index, and the y-axis represents height.

color=’magenta’ is for the line color in plot. 

data['Height (cm)'].plot(ylabel = "Height", xlabel="Index" , figsize=(74), color = 'magenta')

A scatter plot using Height and Weight column, the x-axis represents Height column and the y-axis represents Weight column.

data.plot.scatter(x = "Height (cm)", y = "Weight (kgs)", figsize=(74), color = 'darkblue', marker='*')

A box plot for the entire DataFrame.

data.plot.box(figsize=(74))

Two area subplots for Height and Weight.

data.plot.area(figsize=(74), subplots=True)

Creating new columns from existing ones

Let us consider the DataFrame of height and weight from the previous example. We will be adding an extra column to the DataFrame, i.e., division of height and weight. Furthermore, I will be calculating the Body Mass Index using the existing DataFrame and adding it into the DataFrame.

data['Division of weight and height'] = data['Weight (kgs)'] / data['Height (cm)']

data

Now we will calculate the Body Mass Index for each row. The formula for Body Mass Index is weight(kgs) / height2(mtr)

data["BMI"] = (data['Division of weight by height']*10000) / data['Height (cm)']
data

Calculating summary statistics

Let us calculate summary for the above taken DataFrame.

The average weight and the average height of the DataFrame using the “.mean()” method.

print(data['Height (cm)'].mean())
print(data['Weight (kgs)'].mean())

 

159.9090909090909
61.27272727272727

The median weight and the median height of the DataFrame using the “.median()” method.

data[['Weight (kgs)''Height (cm)']].median()

 

Weight (kgs)     60.0
Height (cm)     159.0
dtype: float64

Let us calculate an aggregating statistic for multiple columns simultaneously using the “.describe()” method.

data[['Weight (kgs)''Height (cm)''BMI']].describe()

We can also specify a combination of aggregating statistics for multiple columns simultaneously using the “.agg()” method.

data.agg(
    {
        "Height (cm)" : ["mean""std""max"],
        "Weight (kgs)" : ["median""count""max"],
        "BMI" : ["mean""count""max"]
    }
)

Let us add another column in the existing DataFrame to illustrate groupby function.

data["Gender"] = ["Boy""Girl""Girl""Boy""Girl""Boy""Girl""Girl""Boy""Boy""Boy"]
data.head(4)

Now we will calculate the mean using the "groupby()" function. Mean for boys and girls will be calculated differently.

data.groupby("Gender")["Height (cm)"].mean()

 

Gender
Boy     152.5
Girl    168.8
Name: Height (cm), dtype: float64

Counting the number of records by category. Here we are counting the total number of girls and boys in the DataFrame.

data["Gender"].value_counts()

 

Boy     6
Girl    5
Name: Gender, dtype: int64

Reshaping the layout of the tables

Sorting the table according to the specific column. Here we are formatting the table according to the height column.

data.sort_values(by="Height (cm)").head(11)

Sorting the same table in descending order.

data.sort_values(by="Height (cm)", ascending=False).head(11)

Manipulating textual data

Let us add name column in the DataFrame to understand manipulation of textual data.

data['Name'] = ["Ganesh""Sanskruti""Tulsi""Mayank""Bhanu""Shrihari""Soma""Divya""Rishab""Rajat""Debojeet"]
data

Printing all the students' names in uppercase using the “str.upper()” function.

data["Name"].str.upper()

 

0        GANESH
1     SANSKRUTI
2         TULSI
3        MAYANK
4         BHANU
5      SHRIHARI
6          SOMA
7         DIVYA
8        RISHAB
9         RAJAT
10     DEBOJEET
Name: Name, dtype: object

Similarly, we can use the ”.str.lower()” function to convert all the strings in lowercase.

Check whether the particular name is in the Name column of the DataFrame. It will mark the particular row true if the following name exists.

data["Name"].str.contains("Rishab")

 

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8      True
9     False
10    False
Name: Name, dtype: bool

FAQs

  1. How much memory can pandas use?
    Pandas is very efficient with small data (usually from 100MB up to 1GB), and performance is rarely a concern.
     
  2. Are pandas DataFrames thread-safe?
    No, pandas DataFrames are not thread-safe.
     
  3. Can a Pandas series object hold data of different types?
    Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).
     
  4. Are pandas multithreaded?
    By default, Pandas executes its functions as a single process using a single CPU core. That works just fine for smaller datasets since you might not notice much difference in speed.

Key Takeaways

  • In this article, we have seen the installation of pandas.
  • Pandas data structure format.
  • Basic functions with examples.


Want to learn more about Machine Learning? Here is an excellent course that can guide you in learning. 

Happy Coding!

Live masterclass