Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
Pandas is a python library that is used to manipulate datasets. There are lots of functions in Pandas, and remembering every one of them is quite a challenging task. In this article, we will learn some essential and most used functions used while preparing a model.
Reading datasets
The first thing we need to do while preparing any machine learning model is to read the dataset. Below are some techniques to read the dataset.
Reading .csv data file.
data = pd.read_csv("filename.csv")
Reading .json data.
data = pd.read_json("filename.json")
Reading .html data.
data = pd.read_html("filename.html")
In all the above codes, data variable stores the dataset. There are other formats to read and write files they are as follows:
XML file: read_xml() and to_xml()
Pickle file: read_pickle() and to_pickle()
SQL file: read_sql() and to_sql()
HTML file: read_html() and to_html()
Indexing and selecting data
We will create simple time-series data to illustrate indexing functionality.
For creating the time-series data, I am using the numpy library.
Importing libraries
import pandas as pd import numpy as np
Creating data
dates = pd.date_range('5/5/2021', periods = 10) data = pd.DataFrame(np.random.randn(10, 5), index = dates, columns = ['A', 'B', 'C', 'D', 'E']) print(dates) data
Swapping two columns in DataFrame. In the code below, we are swapping column A with column B.
data[['B', 'A']] = data[['A', 'B']]
data[['A', 'B']]
Slicing operator
Some important operations used while slicing the data.
# skips first three rows from the dataframe data[3:]
# selects first three rows from the dataframe data[:3]
# selects from fourth to the seventh row from the dataframe data[3:7]
# selects last two rows from the dataframe data[-2:]
# selects all the rows except the last two rows from the dataframe data[:-2]
# select every second row starting from fourth-row data[3::2]
# reverse all the rows data[::-1]
Let us make the following changes before studying further operations.
data['Val'] = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] for i in range(len(data)): data['date'] = data.index[i] data.index = [0,1,2,3,4,5,6,7,8,9]
data
We are adding a new column named “Val” and a new column as “date”, it will contain an index of the row and replace the index with normal numbers.
# selecting sub-dataframe from second to fifth row and from third to fifth column data.iloc[1:5, 2:5]
Selection by label
# selecting all rows between 1 and 6 data.loc[1: 6]
Selection by callable
# printing all values of data['A'] whose value is greater than 0 data['A'].loc[lambda s : s > 0]
First, we will perform outer join (default in the concatenate function) for axis = 1 and axis = 0.
An outer join will take a union of the datasets. Axis = 1 denotes the union of DataFrame with respect to column labels, and axis = 0 denotes the union of DataFrame with respect to row labels (index).
Outer join, axis = 1
pd.concat([d1, d4], axis = 1, join = "outer")
Outer join, axis = 0
pd.concat([d1, d4], axis = 0, join = "outer")
Now, we will perform inner join for axis = 1 and axis = 0.
An inner join will take an intersection of the datasets. Axis = 1 denotes the intersection of DataFrame with respect to column labels, and axis = 0 denotes the intersection of DataFrame with respect to row labels (index).
# converts all strings to upper case tdata.str.upper()
# converts all strings to lowercase tdata.str.lower()
# returns the length of each string tdata.str.len()
Output for the .str.len() function.
0 5 1 5 2 5 3 6 dtype: Int64
Index Cleaning
We will create index of strings for illustration purpose.
idx = pd.Index([" hello i am soham ", " hello world", "happy coding !", "code everyday! "], dtype = "string")
We can see that the above index data has trailing as well as leading whitespaces. We will use the “.str.strip()” function to remove the extra spaces.
idx.str.strip()
Index(['hello i am soham', 'hello world', 'happy coding !', 'code everyday!'], dtype='object')
Similarly to remove leading extra spaces use the ".str.lstrip()" function, and "str.rstrip()" function to remove trailing extra spaces.
Splitting and replacing strings
sdata = pd.Series([" hello i am soham ", " hello world", "happy coding !", "code everyday! "], dtype = "string")
sdata
0 hello i am soham 1 hello world 2 happy coding ! 3 code everyday! dtype: string
Let us split every string in the Series wherever spaces occur using the ".str.split()" function.
Let us modify the sdata using the function we have seen till now. Changing the entire sdata to uppercase, replacing spaces with “_” (underscore), and removing trailing and leading spaces.
# scatter plot A vs B p_df.plot.scatter(x = "A", y="B", color = "red", ax = axes[0][0])
# scatter plot A vs B
#​​ The keyword C may be given as the name of a column to provide colors for each point p_df.plot.scatter(x="A", y="B", c="C", s=50, ax = axes[0][1])
# scatter plot A vs B p_df.plot.scatter(x="A", y="B", c="C", cmap="viridis", s=50, ax = axes[1][0])
# scatter plot A vs B p_df.plot.scatter(x="A", y="B", s=p_df["C"] * 200, ax = axes[1][1])