Do you think IIT Guwahati certified course can help you in your career?
No
Introduction
When working with data in Python, the pandas library is the main component for data manipulation and analysis. By using import pandas as pd, we can streamline our code with a concise alias for pandas. This practice enhances readability and efficiency, making it easier to utilize pandas' extensive functionality.
In this blog, we will learn what a pandas library is, how to import it, and look at examples that import pandas as pd.
What is Pandas?
Pandas is open-source software written library in Python used for data manipulation and analysis. It provides easy-to-use functions for efficient handling of data by using data structures. Using Pandas, one can manipulate huge numeric tables in no time.
Pandas stands for ‘Python Data Analysis Library,” and it is a very popular and powerful tool for open-source data analysis, which is widely used for Data Science and Machine Learning tasks.
Pandas in Data Science
Pandas is a powerful open-source data manipulation and analysis library for Python. It is widely used in the field of data science for tasks related to data cleaning, exploration, and analysis. The name "Pandas" is derived from the term "Panel Data," which is an econometrics term for multidimensional structured data sets.
Working of Pandas
The working of Pandas involves using its core data structures, primarily the DataFrame and Series, to manipulate and analyze data effectively. Below are the key aspects of how Pandas works:
1. Importing Pandas: To use Pandas, you need to import it into your Python script or Jupyter Notebook.
import pandas as pd
2. Data Structures: It offers two data structures DataFrame and Series. DataFrame is a two-dimensional table with labeled axes (rows and columns). It can be thought of as a spreadsheet or SQL table. On the other hand, Series is a one-dimensional labeled array capable of holding any data type.
3. Creating DataFrames and Series: You can create a DataFrame by passing a dictionary of lists or NumPy arrays to the pd.DataFrame() constructor.
Creating a Series is similar, but with a single list or array.
series = pd.Series([1, 3, 5, np.nan, 6, 8])
4. Data Cleaning: Pandas provides methods for handling missing data, such as dropna() to drop missing values and fillna() to fill missing values.
df.dropna() # Drop rows with missing values
df.fillna(value=0) # Fill missing values with a specific value
5. Data Selection and Indexing: You can select specific columns or rows using column names or boolean indexing.
# Selecting a column
df['Name']
# Filtering data based on a condition
df[df['Age'] > 30]
6. Data Manipulation: Pandas allows for various operations like arithmetic operations, string operations, and applying functions to data.
# Adding a new column
df['Age_2_years_later'] = df['Age'] + 2
7. Grouping and Aggregation: Grouping data using groupby() and applying aggregate functions.
# Grouping by 'City' and calculating the mean age in each city
df.groupby('City')['Age'].mean()
8. Merging and Concatenating: Combining multiple DataFrames using merge() or concat().
# Merging two DataFrames based on a common column
pd.merge(df1, df2, on='common_column')
# Concatenating DataFrames vertically
pd.concat([df1, df2])
9. Input/Output: Reading and writing data from/to various file formats, such as CSV, Excel, SQL databases.
# Reading from CSV
df = pd.read_csv('data.csv')
# Writing to CSV
df.to_csv('output.csv', index=False)
10. Time Series Data: Pandas supports time series analysis with functionalities like date range creation, resampling, and shifting.
# Creating a date range
date_range = pd.date_range('2023-01-01', '2023-12-31', freq='D')
How to import Pandas as pd
Once you have installed Pandas after following the above steps, you can import Pandas as pd. For pandas, we usually import pandas with the pd alias, which means we refer to pandas as pd in the code instead of writing “pandas” each time.
Alias is an alternate name that can be used for referencing the same thing again and again. We can also import pandas without using an alias, but it is more convenient if we use an alias.
import pandas as pd
Examples of import Pandas as pd
Let us consider an example to under this:
Python
Python
import pandas as pd
# Creating a dictionary with sample data data = { 'Name': ['Rahul', 'Rohit', 'Virat'], 'Age': [25, 30, 18], 'City': ['Vizag', 'Mumbai', 'Delhi'] }
# Creating a DataFrame from the dictionary df = pd.DataFrame(data)
# Displaying the DataFrame print("Original DataFrame:") print(df)
# Adding a new column df['Playing Style'] = ['Classic', 'Hitman', 'Perfectionist']
# Displaying the DataFrame after adding a new column print("\nDataFrame with a new column:") print(df)
You can also try this code with Online Python Compiler
Data Handling: We can easily manage and explore data using the data structures provided by pandas - Series and Dataframes. It helps us present our data in an organized manner and play with it using various methods.
Data manipulation: Pandas library provides a range of functions to manipulate data, including filtering, sorting, grouping, joining, merging, and reshaping data.
Support for file formats: Pandas supports a wide range of file formats and makes it possible for us to manipulate and analyze data from files with different file formats, which greatly increases the speed of processing.
Data cleaning: Pandas provides functions to handle missing data, remove duplicates, and handle outliers in the data. Sometimes data can be very messy, so pandas help tidy up the data so that it becomes easy to work on.
Data analysis: Pandas provides several statistical and mathematical functions to perform data analysis, including descriptive statistics, correlation analysis, time series analysis, and regression analysis.
Data visualization: Pandas integrates with other Python libraries, such as Matplotlib and Seaborn, to create high-quality data visualizations without which the data won't make sense. Pandas make it easy for us to understand all the operations.
Data Filtering: We can filter the data according to what evaluation we want to perform on it. We can also prevent the repetition of the same data by filtering out the unique data.
Mathematics: We can apply various functions provided by Pandas for carrying out mathematical operations on our data. We can change the order of the data according to what we want and simplify what we want to do using maths.
But summarily, you can install Pandas using pip, the Python package manager. To install pandas, open the terminal/command prompt and run the following command:
pip install pandas
This will install the latest version of Pandas on your system.
Data Structures in Pandas
Pandas provides two primary data structures:
Series: A one-dimensional labeled array that can hold data of any type (integers, floats, strings, etc.). A Series can be created from a list, array, or dictionary; each element is associated with a unique index.
DataFrame:A two-dimensional labeled data structure with columns of potentially different types. You can look at a DataFrame as a collection of Series objects, where each column is a Series. You can create DataFrames from various data sources, including CSV files, Excel files, SQL databases, and JSON data.
How to create a Series in Pandas?
To create a series, we can use the series() function of pandas. Take a look at how, along with an example code.
Code Implementation
Python
Python
import pandas as pd import numpy as np ser = pd.Series() print(ser) data = np.array(['n', 'i', 'n', 'j', 'a', 's']) ser = pd.Series(data) print(ser)
You can also try this code with Online Python Compiler
In the above code, we created a Dataframe names df and filled it with the values of the list l which is ['n', 'i', 'n', 'j', 'a', 's'].
Some Common Operations using Pandas
In this section, we will look at some operations we can perform after importing pandas as pd. We will be using a file as well. For the given examples, the file name is just filename.csv, and its contents are:
The CSV file contains sales data for different products (column_name) on different dates (date). Two additional columns (column_name1 and column_name2) contain some additional sales information. The column names are date, column_name, column_name1, and column_name2, corresponding to the column names used in the pandas codes below.
You can copy and paste these columns in a notepad, save it as filename.csv, and use it for the examples below.
Reading CSV file
Let us look at how to read a CSV file, as it is a basic yet important concept used often while working on data analysis.
Code Implementation
import pandas as pd
data = pd.read_csv('filename.csv')
print(data.head())
Output
Explanation
The code above reads the CSV file using the read_csv() function and prints the first 5 rows of the resulting DataFrame using the head() function. This is a common workflow when working with pandas and CSV files. It can be easily modified to suit different requirements, such as reading different file formats or printing different parts of the DataFrame.
Selecting Rows and Columns
After reading a csv file, how would you select rows and columns for further processing? Let us look at how to do that.
Code Implementation
import pandas as pd
data = pd.read_csv('filename.csv')
subset = data.loc[(data['column_name'] == 'A') , ['column_name1', 'column_name2']]
print(subset.head())
Output
Explanation
This code snippet demonstrates how to use pandas to filter a CSV file based on specific criteria and select specific columns.
subset = data.loc[(data['column_name'] == 'value'), ['column_name1', 'column_name2']]: This line filters the DataFrame data based on one condition: column_name should have a value of 'value'. The filtered data is then stored in a new DataFrame object named subset, which only contains the columns column_name1 and column_name2.
print(subset.head()): This line prints the first 5 rows of the DataFrame subset using the head() function.
Grouping and Aggregating Data
Let us look at how we can group the data based on columns together and aggregate values of the data after grouping using the ‘groupby’ and the ‘agg’ functions.
Code Implementation
import pandas as pd
data = pd.read_csv('filename.csv')
grouped_data = data.groupby('column_name').agg({'column_name1': 'sum', 'column_name2': 'mean'})
print(grouped_data.head())
Output
Explanation
This code demonstrates how to group data in a CSV file by a specific column and perform aggregate functions on the grouped data using pandas. Here is a step-by-step breakdown of what is happening :
grouped_data = data.groupby('column_name').agg({'column_name1': 'sum', 'column_name2': 'mean'}): This line groups the DataFrame data by a specific column named column_name. The groupby() function in pandas groups the data based on the values in the specified column.
The agg() function is used to apply aggregate functions to the grouped data. In this case, we are computing the sum of column_name1 and the mean of column_name2 for each group. The resulting data is stored in a new DataFrame object called grouped_data.
print(grouped_data.head()): This line prints the first 5 rows of the DataFrame grouped_data using the head() function.
Advantages of Pandas
Here are some of the advantages of the Pandas Library.
Easy to use: Pandas is a very easy-to-use library and doesn't require much prerequisite knowledge. Only a basic Python coding skill can get you started with Pandas.
Data Merging: It is very easy to merge data in scenarios where the data is very huge or we have large datasets.
Efficient Data Structures: As discussed earlier, Pandas uses data structures like Series or ever two dimensional Dataframes which make data analysis and manipulation very easy.
Data flexibility: Data is very flexible when manipulated using Pandas. Customizing the various files is efficient because of Dataframes as well.
Less code: Pandas does not require you to code much for even very extensive tasks. Just a few functions provided by it are enough, you need to know when to use what.
Frequently Asked Questions
Why is import pandas as pd used?
It provides a shorthand alias ('pd') for the Pandas library, making code more concise and readable.
How do I import pandas locally?
To import pandas locally, first install it using pip by running pip install pandas in your terminal or command prompt. Then, in your Python script or notebook, import it with import pandas as pd to use the library.
How to import pandas in Python script?
To import pandas in a Python script, use the following line of code: import pandas as pd
This imports the pandas library and allows you to use it with the alias pd in your script.
What does import as PD mean?
import pandas as pd means importing the pandas library and giving it the alias pd. This allows you to refer to pandas functions and objects using pd, making your code shorter and more readable.
Conclusion
Pandas is a versatile library that provides a wide range of tools for data manipulation and analysis, making it an essential tool for data scientists and analysts working with structured data.