Data Science with Python

NumPy, Pandas, and Machine Learning

Introduction to Data Science

  • Data Science combines statistics, programming, and domain expertise
  • Python is one of the most popular languages for data science
  • Rich ecosystem of libraries for data manipulation and analysis
  • Used in research, business intelligence, machine learning, and AI
  • Skills applicable across many industries

Key Python libraries: - NumPy: Numerical computing - Pandas: Data manipulation and analysis - Matplotlib: Data visualization - Scikit-learn: Machine learning

NumPy: Numerical Computing

NumPy provides powerful array operations:

import numpy as np

# Create arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2], [3, 4], [5, 6]])

print(arr1)           # [1 2 3 4 5]
print(arr2.shape)     # (3, 2)
print(arr2.dtype)     # int64

Array operations are vectorized:

# Element-wise operations
result = arr1 * 2     # [2 4 6 8 10]
squared = arr1 ** 2   # [1 4 9 16 25]

# Mathematical functions
means = np.mean(arr1)    # 3.0
sums = np.sum(arr1)      # 15

NumPy Array Creation

Different ways to create arrays:

# From lists
arr_from_list = np.array([1, 2, 3, 4])

# Zeros and ones
zeros = np.zeros((3, 4))      # 3x4 array of zeros
ones = np.ones((2, 3))        # 2x3 array of ones

# Range arrays
range_arr = np.arange(0, 10, 2)    # [0 2 4 6 8]
linspace_arr = np.linspace(0, 1, 5)  # [0. 0.25 0.5 0.75 1.]

# Random arrays
random_arr = np.random.random((2, 3))  # Random values 0-1
normal_arr = np.random.normal(0, 1, (3, 3))  # Normal distribution

Array Indexing and Slicing

arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 10, 11, 12]])

# Basic indexing
print(arr[0, 1])      # 2 (first row, second column)
print(arr[1])         # [5 6 7 8] (entire second row)

# Slicing
print(arr[:2, 1:3])   # First 2 rows, columns 1-2
print(arr[::2, ::2])  # Every other row and column

# Boolean indexing
mask = arr > 6
print(arr[mask])      # [7 8 9 10 11 12]

Pandas: Data Manipulation

Pandas provides DataFrame for structured data:

import pandas as pd

# Create DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Tokyo', 'Paris'],
    'Salary': [70000, 80000, 90000, 75000]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age      City  Salary
0    Alice   25  New York   70000
1      Bob   30    London   80000
2  Charlie   35     Tokyo   90000
3    Diana   28     Paris   75000

DataFrame Operations

Basic DataFrame operations:

# Basic info
print(df.shape)         # (4, 4)
print(df.columns)       # ['Name', 'Age', 'City', 'Salary']
print(df.info())        # Data types and memory usage

# Selecting columns
names = df['Name']              # Single column (Series)
subset = df[['Name', 'Age']]    # Multiple columns (DataFrame)

# Selecting rows
first_row = df.iloc[0]          # By position
alice_data = df[df['Name'] == 'Alice']  # By condition

# Statistics
print(df['Age'].mean())         # 29.5
print(df['Salary'].max())       # 90000
print(df.describe())            # Summary statistics

Data Filtering and Selection

# Filtering data
high_earners = df[df['Salary'] > 75000]
young_people = df[df['Age'] < 30]

# Multiple conditions
young_high_earners = df[(df['Age'] < 30) & (df['Salary'] > 70000)]

# String operations
tokyo_people = df[df['City'].str.contains('Tokyo')]

# Sorting
sorted_by_age = df.sort_values('Age')
sorted_by_multiple = df.sort_values(['City', 'Age'], ascending=[True, False])

Data Cleaning

Handle missing data:

# Create data with missing values
data_with_nulls = {
    'A': [1, 2, None, 4],
    'B': [5, None, 7, 8],
    'C': [9, 10, 11, None]
}

df_nulls = pd.DataFrame(data_with_nulls)

# Check for missing values
print(df_nulls.isnull().sum())

# Drop rows with any null values
df_clean = df_nulls.dropna()

# Fill missing values
df_filled = df_nulls.fillna(0)              # Fill with 0
df_mean_filled = df_nulls.fillna(df_nulls.mean())  # Fill with mean

GroupBy Operations

Group data for analysis:

# Sample sales data
sales_data = pd.DataFrame({
    'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'Sales': [100, 150, 120, 180, 110, 160],
    'Quantity': [10, 15, 12, 18, 11, 16]
})

# Group by product
by_product = sales_data.groupby('Product').sum()
print(by_product)

# Group by multiple columns
by_product_region = sales_data.groupby(['Product', 'Region']).mean()

# Apply multiple functions
summary = sales_data.groupby('Product').agg({
    'Sales': ['sum', 'mean'],
    'Quantity': ['sum', 'max']
})

Data Visualization with Matplotlib

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Configure matplotlib for slides
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 100
plt.style.use('dark_background')  # Better for dark theme

# Sample data  
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Sample DataFrame for other plots
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'], 
    'Age': [25, 30, 35, 28],
    'Salary': [70000, 80000, 90000, 65000]
})

# Create plots
plt.figure(figsize=(10, 6))

# Line plot
plt.subplot(2, 2, 1)
plt.plot(x, y1, label='sin(x)')
plt.plot(x, y2, label='cos(x)')
plt.title('Trigonometric Functions')
plt.legend()

# Scatter plot
plt.subplot(2, 2, 2)
plt.scatter(df['Age'], df['Salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.title('Age vs Salary')

# Bar plot
plt.subplot(2, 2, 3)
df.set_index('Name')['Salary'].plot(kind='bar')
plt.title('Salary by Person')

# Histogram
plt.subplot(2, 2, 4)
plt.hist(np.random.normal(100, 15, 1000), bins=30)
plt.title('Normal Distribution')

plt.tight_layout()
plt.show()

Reading and Writing Data

Common data formats:

# CSV files
df_from_csv = pd.read_csv('data.csv')
df.to_csv('output.csv', index=False)

# Excel files
df_from_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df.to_excel('output.xlsx', index=False)

# JSON files
df_from_json = pd.read_json('data.json')
df.to_json('output.json', orient='records')

# SQL databases
import sqlite3
conn = sqlite3.connect('database.db')
df_from_sql = pd.read_sql('SELECT * FROM table_name', conn)
df.to_sql('new_table', conn, if_exists='replace', index=False)

Introduction to Machine Learning

Scikit-learn for machine learning:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Sample dataset
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

Classification Example

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Generate classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Real-World Example: Sales Analysis

# Generate sample sales data for demonstration
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

np.random.seed(42)

# Create sample sales data
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
products = ['Laptop', 'Desktop', 'Tablet', 'Phone', 'Monitor']
customers = [f'Customer_{i:03d}' for i in range(1, 101)]

sales_data = []
for _ in range(1000):
    date = np.random.choice(dates)
    product = np.random.choice(products)
    customer = np.random.choice(customers)
    quantity = np.random.randint(1, 5)
    unit_price = np.random.uniform(100, 2000)
    revenue = quantity * unit_price
    
    sales_data.append({
        'Date': date,
        'Product': product,
        'CustomerID': customer,
        'Quantity': quantity,
        'Revenue': revenue
    })

sales_df = pd.DataFrame(sales_data)
sales_df['Date'] = pd.to_datetime(sales_df['Date'])
sales_df['Month'] = sales_df['Date'].dt.month
sales_df['Year'] = sales_df['Date'].dt.year
sales_df['DayOfWeek'] = sales_df['Date'].dt.dayofweek

print(f"Generated {len(sales_df)} sales records")
print(f"Date range: {sales_df['Date'].min()} to {sales_df['Date'].max()}")

# Analysis
monthly_sales = sales_df.groupby(['Year', 'Month'])['Revenue'].sum()
product_performance = sales_df.groupby('Product').agg({
    'Revenue': 'sum',
    'Quantity': 'sum',
    'CustomerID': 'nunique'
}).rename(columns={'CustomerID': 'UniqueCustomers'})

# Configure matplotlib for dark theme
plt.style.use('dark_background')
plt.figure(figsize=(15, 10))

# Monthly sales trend
plt.subplot(2, 2, 1)
monthly_sales.plot(color='cyan', linewidth=2)
plt.title('Monthly Sales Trend', color='white')
plt.ylabel('Revenue', color='white')

# Product performance
plt.subplot(2, 2, 2)
product_performance['Revenue'].plot(kind='bar', color='lightgreen')
plt.title('Revenue by Product', color='white')
plt.xticks(rotation=45, color='white')

# Customer distribution
plt.subplot(2, 2, 3)
sales_df['CustomerID'].value_counts().head(10).plot(kind='bar', color='orange')
plt.title('Top 10 Customers by Transaction Count', color='white')

# Sales by day of week
plt.subplot(2, 2, 4)
sales_df.groupby('DayOfWeek')['Revenue'].mean().plot(kind='bar', color='lightcoral')
plt.title('Average Daily Sales by Day of Week', color='white')
plt.xlabel('Day of Week (0=Monday)', color='white')

plt.tight_layout()
plt.show()

Predictive Modeling Example

# Prepare data for modeling
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Feature selection
features = ['Quantity', 'UnitPrice', 'Month', 'DayOfWeek']
X = sales_df[features]
y = sales_df['Revenue']

# Handle categorical variables
le = LabelEncoder()
if 'Category' in sales_df.columns:
    X['Category_encoded'] = le.fit_transform(sales_df['Category'])

# Split and scale data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple models
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR

models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'SVM': SVR(kernel='rbf')
}

# Compare model performance
results = {}
for name, model in models.items():
    if name == 'SVM':
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {'MSE': mse, 'R²': r2}
    print(f"{name}: MSE={mse:.2f}, R²={r2:.3f}")

Exercise Time! 🚀

Challenge 1: Data Analysis

# Given this dataset, perform basic analysis
data = {
    'Name': ['John', 'Alice', 'Bob', 'Diana', 'Charlie'],
    'Age': [25, 30, 35, 28, 32],
    'Department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
    'Salary': [70000, 80000, 90000, 75000, 72000],
    'Years_Experience': [2, 5, 8, 3, 4]
}

# Tasks:
# 1. Create DataFrame
# 2. Find average salary by department
# 3. Find correlation between age and salary
# 4. Create a simple visualization

Challenge 2: Simple Prediction

# Create a simple linear regression model to predict salary
# based on years of experience

Exercise Solutions

Data Analysis Solution:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'Department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
    'Salary': [70000, 80000, 90000, 75000, 72000],
    'Years_Experience': [2, 5, 8, 3, 4]
}

# 1. Create DataFrame
df = pd.DataFrame(data)

# 2. Average salary by department
dept_salary = df.groupby('Department')['Salary'].mean()
print("Average Salary by Department:")
print(dept_salary)

# 3. Correlation between age and salary
correlation = df['Age'].corr(df['Salary'])
print(f"\nCorrelation between Age and Salary: {correlation:.3f}")

# 4. Visualization with dark theme
plt.style.use('dark_background')
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
dept_salary.plot(kind='bar', color='lightblue')
plt.title('Average Salary by Department', color='white')
plt.xticks(rotation=45, color='white')
plt.ylabel('Salary', color='white')

plt.subplot(1, 3, 2)
plt.scatter(df['Age'], df['Salary'], color='lightgreen', s=100, alpha=0.7)
plt.xlabel('Age', color='white')
plt.ylabel('Salary', color='white')
plt.title('Age vs Salary', color='white')

plt.subplot(1, 3, 3)
plt.scatter(df['Years_Experience'], df['Salary'], color='lightcoral', s=100, alpha=0.7)
plt.xlabel('Years Experience', color='white')
plt.ylabel('Salary', color='white')
plt.title('Experience vs Salary', color='white')

plt.tight_layout()
plt.show()

Upload Your Own Data Files

Quarto Drop enables drag-and-drop file uploads:

mydata.csv
# Drag and drop your CSV file here

import pandas as pd

try:
    df = pd.read_csv('mydata.csv')
    print(f"📊 Uploaded: {df.shape[0]} rows, {df.shape[1]} columns")
    print("\nFirst few rows:")
    print(df.head(3))
    
    # Quick visualization if numeric data exists
    numeric_cols = df.select_dtypes(include=['number']).columns
    if len(numeric_cols) >= 2:
        plt.figure(figsize=(8, 5))
        plt.scatter(df[numeric_cols[0]], df[numeric_cols[1]], alpha=0.6)
        plt.xlabel(numeric_cols[0])
        plt.ylabel(numeric_cols[1])
        plt.title('Your Data Visualization')
        plt.show()
        
except FileNotFoundError:
    print("📁 Drop a CSV file above to analyze it!")

Best Practices

  • Start with data exploration: Understand your data before analysis
  • Clean data thoroughly: Handle missing values and outliers
  • Use appropriate visualizations: Choose charts that tell the story
  • Validate your models: Use proper train/test splits and cross-validation
  • Document your process: Keep track of your analysis steps
  • Consider domain knowledge: Statistics + domain expertise = insights
  • Iterate and refine: Data science is an iterative process

Common Tools and Libraries

Essential libraries: - NumPy: Numerical computing foundation - Pandas: Data manipulation and analysis - Matplotlib/Seaborn: Data visualization - Scikit-learn: Machine learning algorithms - Jupyter: Interactive development environment

Advanced libraries: - TensorFlow/PyTorch: Deep learning - Statsmodels: Statistical modeling - Plotly: Interactive visualizations - Beautiful Soup: Web scraping - Requests: API data collection

Data Science Workflow

Typical data science workflow:

  1. Problem Definition: What are you trying to solve?
  2. Data Collection: Gather relevant data
  3. Data Exploration: Understand the data structure and patterns
  4. Data Cleaning: Handle missing values, outliers, and inconsistencies
  5. Feature Engineering: Create relevant features for modeling
  6. Model Development: Train and tune machine learning models
  7. Model Evaluation: Assess model performance
  8. Deployment: Put models into production
  9. Monitoring: Track model performance over time

Summary

  • Python provides excellent tools for data science
  • NumPy enables efficient numerical computing
  • Pandas makes data manipulation and analysis easy
  • Matplotlib and Seaborn provide powerful visualization capabilities
  • Scikit-learn offers comprehensive machine learning algorithms
  • Real-world data science requires exploration, cleaning, and iteration
  • Combine statistical knowledge with programming skills for best results

Next Steps

Continue learning: - Practice with real datasets (Kaggle, UCI ML Repository) - Learn about different types of machine learning problems - Explore deep learning with TensorFlow or PyTorch - Study statistics and probability theory - Work on end-to-end projects - Join data science communities and competitions