Python Pandas, Installation, series, labels, Dataframes, Indexing

What is Pandas?

Pandas is a Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
The name "Pandas" has a reference to both "Panel Data", and "Python Data Analysis" and was created by Wes McKinney in 2008.

Why Use Pandas?

Pandas allows us to analyze big data and make conclusions based on statistical theories.
Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.

What Can Pandas Do?

Pandas gives you answers about the data. Like:

Is there a correlation between two or more columns?
What is average value?
Max value?
Min value?

Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data.

Installation

If you have Python and PIP already installed on a system, then installation of Pandas is very easy.

Install it using this command:

C:\Users\Your Name>pip install pandas

If this command fails, then use a python distribution that already has Pandas installed like, Anaconda, Spyder etc.

Importing Pandas

Once Pandas is installed, import it in your applications by adding the import keyword:

import pandas

Example:


        import pandas
        

        mydataset = {
          'cars': ["BMW", "Volvo", "Ford"],
          'passings': [3, 7, 2]
        }
        

        myvar = pandas.DataFrame(mydataset)
        

        print(myvar)

Pandas as pd

Pandas is usually imported under the pd alias.

Create an alias with the as keyword while importing:

import pandas as pd

Now the Pandas package can be referred to as pd instead of pandas.

Example:


        import pandas as pd
        

        mydataset = {
          'cars': ["BMW", "Volvo", "Ford"],
          'passings': [3, 7, 2]
        }
        

        myvar = pd.DataFrame(mydataset)
        

        print(myvar)

What is a Series?

A Pandas Series is like a column in a table.

It is a one-dimensional array holding data of any type.

Example: (Creating a simple pandas series from a list)


        import pandas as pd
        

        a = [1, 7, 2]
        myvar = pd.Series(a)
        print(myvar)

Labels

If nothing else is specified, the values are labeled with their index number. First value has index 0, second value has index 1 etc.

This label can be used to access a specified value.

Example:


        import pandas as pd
        

        a = [1, 7, 2]
        myvar = pd.Series(a)
        print(myvar[0])

Create Labels

With the index argument, you can name your own labels.

Example (Creating your own labels):


        import pandas as pd
        

        a = [1, 7, 2]
        myvar = pd.Series(a, index = ["x", "y", "z"])
        print(myvar)

When you have created labels, you can access an item by referring to the label.

Example:


        import pandas as pd
        

        a = [1, 7, 2]
        myvar = pd.Series(a, index = ["x", "y", "z"])
        print(myvar["y"])

Pandas DataFrames

DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.

There are several ways to create a DataFrame. One way way is to use a dictionary. For example:


        dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
               "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
               "area": [8.516, 17.10, 3.286, 9.597, 1.221],
               "population": [200.4, 143.5, 1252, 1357, 52.98] }
               

        import pandas as pd
        brics = pd.DataFrame(dict)
        print(brics)

As you can see with the new brics DataFrame, Pandas has assigned a key for each country as the numerical values 0 through 4. If you would like to have different index values, say, the two letter country code, you can do that easily as well.


        dict = {"country": ["Brazil", "Russia", "India", "China", "South Africa"],
               "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
               "area": [8.516, 17.10, 3.286, 9.597, 1.221],
               "population": [200.4, 143.5, 1252, 1357, 52.98] }
               

        import pandas as pd
        brics = pd.DataFrame(dict)
        

        # Set the index for brics
        brics.index = ["BR", "RU", "IN", "CH", "SA"]
        

        print(brics)

Another way to create a DataFrame is by importing a csv file using Pandas. Now, the csv cars.csv is stored and can be imported using pd.read_csv:

# Import pandas as pd
import pandas as pd

# Import the cars.csv data: cars
cars = pd.read_csv('cars.csv')

# Print out cars
print(cars)

Indexing DataFrames

There are several ways to index a Pandas DataFrame. One of the easiest ways to do this is by using square bracket notation.

In the example below, you can use square brackets to select one column of the cars DataFrame. You can either use a single bracket or a double bracket. The single bracket will output a Pandas Series, while a double bracket will output a Pandas DataFrame.


        dict = {"country": ["United States", "Australia", "Japan", "India", "Russia", "Morocco", "Egypt"],
               "cars_per_cap": [809, 731, 588, 18, 200, 70, 45],
               "drives_right": [True, False, False, False, False, True, True] }
               

        import pandas as pd
        brics = pd.DataFrame(dict)
        

        brics.index = ["US", "AUS", "JAP", "IN", "RU", "MOR", "EG"]
        

        # Print out country column as Pandas Series
        print(brics['cars_per_cap'])

        # Print out country column as Pandas DataFrame
        print(brics[['cars_per_cap']])

        # Print out DataFrame with country and drives_right columns
        print(brics[['cars_per_cap', 'country']])

Square brackets can also be used to access observations (rows) from a DataFrame. For example:


        dict = {"country": ["United States", "Australia", "Japan", "India", "Russia", "Morocco", "Egypt"],
               "cars_per_cap": [809, 731, 588, 18, 200, 70, 45],
               "drives_right": [True, False, False, False, False, True, True] }
               

        import pandas as pd
        brics = pd.DataFrame(dict)
        

        brics.index = ["US", "AUS", "JAP", "IN", "RU", "MOR", "EG"]
        

        # Print out first 4 observations
        print(brics[0:4])
        

        # Print out fifth and sixth observation
        print(brics[4:6])

You can also use loc and iloc to perform just about any data selection operation. loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer index based, so you have to specify rows and columns by their integer index.


        dict = {"country": ["United States", "Australia", "Japan", "India", "Russia", "Morocco", "Egypt"],
               "cars_per_cap": [809, 731, 588, 18, 200, 70, 45],
               "drives_right": [True, False, False, False, False, True, True] }
               

        import pandas as pd
        brics = pd.DataFrame(dict)
        

        brics.index = ["US", "AUS", "JAP", "IN", "RU", "MOR", "EG"]
        

        # Print out observation for Japan
        print(brics.iloc[2])
        

        # Print out observations for Australia and Egypt
        print(brics.loc[['AUS', 'EG']])

Python Pandas

What is Pandas?

Why Use Pandas?

What Can Pandas Do?

Installation

Importing Pandas

Pandas as pd

What is a Series?

Labels

Create Labels

Pandas DataFrames

Indexing DataFrames

Introduction

Python Basics

Python Advance

Data Science Python Tutorials

Python Functions and Methods