Data Analysis with Python

Filter, sort, aggregate, and visualize data with pandas and matplotlib.

📊 Data Analysis Techniques: Filtering, Sorting, and Aggregating Data (1.5 hours)

Filtering Data

  • Basic Filtering: To filter data based on conditions, you can use boolean indexing.

# Get rows where Age is greater than 30
df_filtered = df[df['Age'] > 30]

  • Multiple Conditions: Combine multiple conditions using logical operators.

# Get rows where Age is greater than 30 and City is 'New York'
df_filtered = df[(df['Age'] > 30) & (df['City'] == 'New York')]

Sorting Data

  • Sorting by Columns: Sort the DataFrame by a column in ascending or descending order.

df_sorted = df.sort_values(by='Age', ascending=False) # Sort by Age descending

  • Sorting by Multiple Columns:

df_sorted = df.sort_values(by=['Age', 'City'], ascending=[True, False])

Aggregating Data

  • Groupby: Aggregate data based on one or more columns.

grouped = df.groupby('City').agg({'Age': 'mean', 'Name': 'count'})
print(grouped)

  • Summary Statistics: Built-in functions to calculate statistics.

df['Age'].mean() # Mean of 'Age' column
df['Age'].sum() # Sum of 'Age' column
df['Age'].max() # Maximum of 'Age' column

📈 Data Visualization Basics with Matplotlib

Introduction to matplotlib

matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. It's often used for visualizing the results of data analysis. The most common visualizations are line plots, scatter plots, bar plots, and histograms.

Basic Plotting with matplotlib

  • Simple Line Plot:

import matplotlib.pyplot as plt
plt.plot(df['Name'], df['Age'])
plt.xlabel('Name')
plt.ylabel('Age')
plt.title('Name vs Age')
plt.show()

  • Bar Plot:

df.groupby('City')['Age'].mean().plot(kind='bar')
plt.xlabel('City')
plt.ylabel('Average Age')
plt.title('Average Age by City')
plt.show()

  • Histogram:

df['Age'].plot(kind='hist', bins=10)
plt.xlabel('Age')
plt.title('Age Distribution')
plt.show()

  • Scatter Plot:

df.plot(kind='scatter', x='City', y='Age')
plt.title('City vs Age')
plt.show()

Customizing Plots

  • Adding Labels and Title: plt.xlabel('City'), plt.ylabel('Average Age'), plt.title('Average Age by City')
  • Color and Style: df['Age'].plot(kind='line', color='green', linestyle='--', linewidth=2)

Best Practices for Working with Data

  1. Data Cleaning: Handle missing values (df.fillna()), remove duplicates (df.drop_duplicates()), and deal with outliers.
  2. Efficient Data Access: Use chunksize to read large CSV files in chunks.
  3. Handling Data Types: Ensure correct column data types (df['Age'] = df['Age'].astype(int)).
  4. Documentation: Clearly document code and reasoning behind transformations or computations.

Additional Resources:

By completing these exercises and concepts, learners will gain a solid foundation in handling and analyzing data using pandas and matplotlib. These skills are essential for performing data-driven tasks like cleaning, analyzing, and visualizing data.

📊 Introduction to Pandas: Creating, Reading, and Manipulating DataFrames

Objectives:

  • Use Python libraries to handle and analyze data.
  • Learn to perform basic data manipulation with pandas.
  • Understand data analysis techniques such as filtering, sorting, and aggregating data.
  • Gain insight into data visualization using matplotlib.

What is pandas?

pandas is an open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is primarily used for working with tabular data in the form of DataFrames. It allows you to efficiently manipulate, clean, and analyze structured data.

  • Series: A one-dimensional labeled array that can hold any data type (integers, strings, etc.).
  • DataFrame: A two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).

Key Operations in pandas:

import pandas as pd

Creating DataFrames:

You can create a DataFrame by loading data from various formats like CSV, Excel, or SQL databases.

# Creating DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
print(df)

Reading Data:

pandas allows you to load data from multiple sources.

# Read data from a CSV file
df = pd.read_csv('data.csv')

Manipulating DataFrames:

  • Selecting Columns: df['Age'] # Selects the Age column
  • Filtering Rows: df[df['Age'] > 30] # Returns rows where Age > 30
  • Adding/Removing Columns:

df['Country'] = ['USA', 'USA', 'USA'] # Adds a new column
df.drop('Country', axis=1, inplace=True) # Drops the 'Country' column