Introduction
Pandas stands for "Python Data Analysis Library," It is a powerful, fast, flexible, and easy-to-use open-source data analysis & manipulation tool built on top of Python. It is mainly used for data analysis and allows importing data from various file formats. It offers special data structures and operations for manipulating numerical tables and time-series data.
The Data Structures provided by Pandas are of two distinct types:
- Pandas Series
- Pandas DataFrame
Before jumping on to these two Data Structures, Let’s take a look at the installation process in case you don’t have pandas installed on your machine.
You must have Python installed on your system for installing Pandas, obviously, as it's a python library. It can be installed either by using pip or the python package installer pip install pandas or by using Conda as Anaconda installs all important libraries for you. So I hope you got pandas installed!
Now let’s continue with the two Data structures as mentioned above.
Pandas Series
Pandas series is a 1-dimensional labeled array capable of storing any data type. Pandas series is nothing but a column in an excel sheet, where rows of this column are labeled. The labels need not be unique but must be a hashable type.
Let's take a look at a simple example and Its comparison with the numpy array.
import pandas as pd
import numpy as np
s1 = pd.Series([10,20,30,40,50]) #series is indexed
s2 = np.array([1,2,3,4,5])
print("s1:")
print(s1)
print("\ns2:")
print(s2)
Output:
We can also change the index from default numbering by specifically mentioning indexes while defining Series. We can also create pandas series from a dictionary or a list.
# changing indexs
s3 = pd.Series([1,2,3], index=['a','b','c'])
print("s3:")
print(s3)
# We can also create series object from a dictionary
s4 = pd.Series({'a':1, 'b':2, 'c':3})
print("s4:")
print(s4)
Output:
Elements in Series can be easily accessed by either their position or by using a label if it's defined.
Pandas Dataframe
Pandas Dataframe is nothing but a 2-Dimensional labeled, size-mutable, and potentially heterogeneous data structure. It’s like an excel sheet where data is aligned in a tabular manner and provides numerous functionalities to extract, analyze and manipulate data from the given dataset.
Let’s take a look at a simple example on creation of a dataframe.
pd.DataFrame({"Name":['Ritik', 'Suveer', 'Aman'], "Marks":[100,99,99.5]})
Output:
Creating Pandas Dataframe from a list:
my_list = [[1,2,3,4],
[5,6,7,8],
[9,10,11,12],
[13,14,15,16],
[17,18,19,20]]
df = pd.DataFrame(my_list)
Output:
Moving forward, Let's look at some simple functions that can be very helpful when dealing with rows and columns while analyzing data.
You can download the dataset from https://www.kaggle.com/imakash3011/customer-personality-analysis?select=marketing_campaign.csv
Reading Data:
data = pd.read_csv('marketing_campaign.csv',sep='\t')
# print(data) -will print complete data
data.head() # gives 5 rows glance of Data
data.head(10) # to see 10 records, data.tail() will show last records
Output: (Cropped image)
Similarly, data.tail() will show the last records from the data.
We can also read JSON data or Html data by just passing URL in the respective functions.
Ex. json_data = pd.read_json('URL_of_JSON') and
html_data = pd.read_html(‘html_url’)