Jumping into Pandas: Introduction To Pandas

Pandas is a powerful python package that is essential when working with data, it allows for easy manipulation and analysis of structured data, such as spreadsheets and databases.

In this article, you will learn the general idea about pandas and what is needed to get started working with it. We will explore the features and data structures of Pandas that make it such a valuable tool for working with data.

What is Pandas?

Pandas is a high-level data manipulation tool developed by Wes McKinney, it is an open-source python library that gives users a way to analyze and manipulate data in a structured way.

Pandas is a great tool that helps you make sense of your data it can provide you with important statistical values like mean, median, min, and max. It also helps you understand how your data is distributed and the relationships between different variables.

With Pandas, you can easily explore and analyze your data to gain deeper insights that can help inform your decision-making.

Pandas is built on Numpy and Matplotlib, learning the basics of these libraries can be helpful when learning pandas however, it is not strictly necessary. So don't let the idea of needing to learn multiple libraries overwhelm you, you can start with Pandas and expand your knowledge as needed.

Pandas: Data Structures

There are 2 main data structures supported by pandas

  • Series

  • Dataframes

Understanding these two data structures and how to use them effectively is key to becoming proficient in Pandas and data analysis with Python.

Pandas can be installed using pip

pip install pandas

For us to get started with pandas we need to import the Pandas library with the popular alias "pd".

import pandas as pd

Pandas: Series

A Pandas Series is a one-dimensional array-like object containing a sequence of values of the same type and labelled axes which are known as the "index".

A series can be seen as a single column of data, but it can also be thought of as a row of data in a table with just one column. A series can be manipulated independently, and it can also be combined with other series or columns to create a new DataFrame.

Creating a Pandas Series

Pandas Series can be created in different ways, here are some:

  1. From a List:

    You can create a pandas series by passing a python list as an argument to the Pandas Series() function.

     import pandas as pd
    
     my_list = [1, 2, 3, 4, 5]
    
     my_series = pd.Series(my_list)
    
     print(my_series)
    

    This will create a Series with the values [1, 2, 3, 4, 5]

  2. From a Dictionary:

    You can also create a Series from a Python dictionary by passing the dictionary as an argument to the Pandas Series()

     import pandas as pd
    
     my_dict = {'a': 1, 'b': 2, 'c': 3}
    
     my_series = pd.Series(my_dict)
    
     print(my_series)
    

    This will create a Series with the values { 'a': 1, 'b': 2, 'c': 3 }.

  3. From a NumPy Array:

    You can create a Series from a NumPy array by passing the array as an argument to the Pandas Series() function. For example:

     import pandas as pd
     import numpy as np
    
     my_array = np.array([1, 2, 3, 4, 5])
    
     my_series = pd.Series(my_array)
    
     print(my_series)
    

    This will create a Series with the values [1, 2, 3, 4, 5].

  4. With Explicit Index:

    You can also specify the index labels explicitly while creating a Series. The length of the index must be matched to the data, in the previous examples index labels were not passed, when not passed the default index values are from 0 to len(data) - 1 .

     import pandas as pd
    
     my_list = [1, 2, 3, 4, 5]
    
     my_index = ['a', 'b', 'c', 'd', 'e']
    
     my_series = pd.Series(my_list, index=my_index)
    
     print(my_series)
    

    This will create a Series with the values [1, 2, 3, 4, 5] and index labels ['a', 'b', 'c', 'd', 'e'].

These are just some ways of creating pandas series, there are other ways to create Pandas Series depending on your use case.

Why Series?

Pandas Series provide a lot of functionality and advantages that make them useful in data analysis tasks, especially when working with tabular data.

Here are some of the reasons Pandas Series are used:

  1. One-dimensional structure for working with single columns or rows of data, the pandas series can hold any data type and it makes it easier to work with data that is arranged in a single row or column.

  2. Label-based indexing for easy access to individual elements, this makes it easier to work with data that has a specific label or name associated with each value.

  3. Built-in methods for data manipulation and analysis, these methods can save you a lot of time and effort when working with data.

  4. Integration with other Pandas data structures, they can be combined with other data structures like DataFrames, to create more complex data structures for data analysis.

  5. Flexible enough to handle any data type, this makes them a flexible data structure for working with a wide range of data types.

Pandas: DataFrame

A Pandas Dataframe is a two-dimensional data structure that holds heterogeneous data and has a label-based index.

Dataframes can be seen as a collection of series where each series represents a column of data. Data stored in a DataFrame are stored in rows and columns similar to a SQL table or excel spreadsheet.

Creating a Pandas Dataframe

Like Series, DataFrames can be created in different ways:

  1. From a dictionary of lists:

    You can create a Series from a Python dictionary of lists or arrays by passing the dictionary as an argument to the Pandas DataFrame(), each key and value pair represent columns and rows in the DatatFrame respectively.

     import pandas as pd
    
     data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dave'],
             'Age': [25, 32, 18, 47],
             'Gender': ['F', 'M', 'M', 'M']}
    
     df = pd.DataFrame(data)
    
     print(df)
    

    This will create a DataFrame that has "Name", "Age" and "Gender" as the column names and their respective values as the rows of those columns.

  2. From a list of dictionaries:

    You can also create a DataFrame from a list of dictionaries, where each dictionary represents a row of data.

     import pandas as pd
    
     data = [{'Name': 'Alice', 'Age': 25, 'Gender': 'F'},
             {'Name': 'Bob', 'Age': 32, 'Gender': 'M'},
             {'Name': 'Charlie', 'Age': 18, 'Gender': 'M'},
             {'Name': 'Dave', 'Age': 47, 'Gender': 'M'}]
    
     df = pd.DataFrame(data)
    

    The resulting DataFrame will have columns named 'Name', 'Age', and 'Gender', with four rows of data, one for each dictionary in the list.

  3. From a CSV file:

    You can read data from a CSV file and create a DataFrame using the read_csv() method by passing the file path of the file as an argument in the function.

     import pandas as pd
    
     df = pd.read_csv("data.csv")
    

    The resulting DataFrame will be created from the data in the 'data.csv' file, with each row representing a line of data from the file.

  4. From a NumPy Array:

    You can create a DataFrame from a NumPy array using the DataFrame() method and passing the array as an argument. The resulting DataFrame will have columns named 'A', 'B', and 'C', with three rows of data, one for each row of the NumPy array.

     import pandas as pd
     import numpy as np
    
     data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    
     df = pd.DataFrame(data, columns=['A', 'B', 'C'])
    

Why Dataframes?

DataFrames are a widely used data structure in data analysis, and they offer several advantages over other data structures.

Here are some of the reasons Pandas Dataframes are used:

  1. Tabular data: DataFrames are designed to store tabular data, which makes them a natural fit for storing and analyzing structured data such as that found in spreadsheets or databases.

  2. Labelled axes: DataFrames have labelled rows and columns, which makes it easy to select, manipulate, and analyze subsets of the data based on specific criteria.

  3. Flexibility: DataFrames can hold many types of data, including numeric, character, and boolean data. They can also be used to store missing or undefined data.

  4. Integration: DataFrames can be easily integrated with other data analysis and visualization tools, including Python libraries such as NumPy, Matplotlib, and Seaborn.

  5. Data manipulation: DataFrames offer powerful tools for manipulating data, including filtering, grouping, and merging, which makes it easy to extract insights and create visualizations from large datasets.

Conclusion

In conclusion, Pandas is a powerful data manipulation tool in Python that makes data analysis faster and easier, and I am looking forward to delving deeper into its capabilities. I'll be concentrating on Pandas' extensive set of data cleaning, transformation, and manipulation features and I am excited to share my insights in my upcoming articles and take you along on this journey of data exploration with Pandas.