Data Science with Python

NumPy, Pandas, and Machine Learning

Introduction to Data Science

Data Science combines statistics, programming, and domain expertise
Python is one of the most popular languages for data science
Rich ecosystem of libraries for data manipulation and analysis
Used in research, business intelligence, machine learning, and AI
Skills applicable across many industries

Key Python libraries: - NumPy: Numerical computing - Pandas: Data manipulation and analysis - Matplotlib: Data visualization - Scikit-learn: Machine learning

NumPy: Numerical Computing

NumPy provides powerful array operations:

import numpy as np

# Create arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2], [3, 4], [5, 6]])

print(arr1)           # [1 2 3 4 5]
print(arr2.shape)     # (3, 2)
print(arr2.dtype)     # int64

Array operations are vectorized:

# Element-wise operations
result = arr1 * 2     # [2 4 6 8 10]
squared = arr1 ** 2   # [1 4 9 16 25]

# Mathematical functions
means = np.mean(arr1)    # 3.0
sums = np.sum(arr1)      # 15

NumPy Array Creation

Different ways to create arrays:

# From lists
arr_from_list = np.array([1, 2, 3, 4])

# Zeros and ones
zeros = np.zeros((3, 4))      # 3x4 array of zeros
ones = np.ones((2, 3))        # 2x3 array of ones

# Range arrays
range_arr = np.arange(0, 10, 2)    # [0 2 4 6 8]
linspace_arr = np.linspace(0, 1, 5)  # [0. 0.25 0.5 0.75 1.]

# Random arrays
random_arr = np.random.random((2, 3))  # Random values 0-1
normal_arr = np.random.normal(0, 1, (3, 3))  # Normal distribution

Array Indexing and Slicing

arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 10, 11, 12]])

# Basic indexing
print(arr[0, 1])      # 2 (first row, second column)
print(arr[1])         # [5 6 7 8] (entire second row)

# Slicing
print(arr[:2, 1:3])   # First 2 rows, columns 1-2
print(arr[::2, ::2])  # Every other row and column

# Boolean indexing
mask = arr > 6
print(arr[mask])      # [7 8 9 10 11 12]

Pandas: Data Manipulation

Pandas provides DataFrame for structured data:

import pandas as pd

# Create DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'London', 'Tokyo', 'Paris'],
    'Salary': [70000, 80000, 90000, 75000]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age      City  Salary
0    Alice   25  New York   70000
1      Bob   30    London   80000
2  Charlie   35     Tokyo   90000
3    Diana   28     Paris   75000

DataFrame Operations

Basic DataFrame operations:

# Basic info
print(df.shape)         # (4, 4)
print(df.columns)       # ['Name', 'Age', 'City', 'Salary']
print(df.info())        # Data types and memory usage

# Selecting columns
names = df['Name']              # Single column (Series)
subset = df[['Name', 'Age']]    # Multiple columns (DataFrame)

# Selecting rows
first_row = df.iloc[0]          # By position
alice_data = df[df['Name'] == 'Alice']  # By condition

# Statistics
print(df['Age'].mean())         # 29.5
print(df['Salary'].max())       # 90000
print(df.describe())            # Summary statistics

Data Filtering and Selection

# Filtering data
high_earners = df[df['Salary'] > 75000]
young_people = df[df['Age'] < 30]

# Multiple conditions
young_high_earners = df[(df['Age'] < 30) & (df['Salary'] > 70000)]

# String operations
tokyo_people = df[df['City'].str.contains('Tokyo')]

# Sorting
sorted_by_age = df.sort_values('Age')
sorted_by_multiple = df.sort_values(['City', 'Age'], ascending=[True, False])

Data Cleaning

Handle missing data:

# Create data with missing values
data_with_nulls = {
    'A': [1, 2, None, 4],
    'B': [5, None, 7, 8],
    'C': [9, 10, 11, None]
}

df_nulls = pd.DataFrame(data_with_nulls)

# Check for missing values
print(df_nulls.isnull().sum())

# Drop rows with any null values
df_clean = df_nulls.dropna()

# Fill missing values
df_filled = df_nulls.fillna(0)              # Fill with 0
df_mean_filled = df_nulls.fillna(df_nulls.mean())  # Fill with mean

GroupBy Operations

Group data for analysis:

# Sample sales data
sales_data = pd.DataFrame({
    'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Region': ['North', 'North', 'South', 'South', 'East', 'East'],
    'Sales': [100, 150, 120, 180, 110, 160],
    'Quantity': [10, 15, 12, 18, 11, 16]
})

# Group by product
by_product = sales_data.groupby('Product').sum()
print(by_product)

# Group by multiple columns
by_product_region = sales_data.groupby(['Product', 'Region']).mean()

# Apply multiple functions
summary = sales_data.groupby('Product').agg({
    'Sales': ['sum', 'mean'],
    'Quantity': ['sum', 'max']
})

Data Visualization with Matplotlib

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Configure matplotlib for slides
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['figure.dpi'] = 100
plt.style.use('dark_background')  # Better for dark theme

# Sample data  
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

# Sample DataFrame for other plots
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'], 
    'Age': [25, 30, 35, 28],
    'Salary': [70000, 80000, 90000, 65000]
})

# Create plots
plt.figure(figsize=(10, 6))

# Line plot
plt.subplot(2, 2, 1)
plt.plot(x, y1, label='sin(x)')
plt.plot(x, y2, label='cos(x)')
plt.title('Trigonometric Functions')
plt.legend()

# Scatter plot
plt.subplot(2, 2, 2)
plt.scatter(df['Age'], df['Salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.title('Age vs Salary')

# Bar plot
plt.subplot(2, 2, 3)
df.set_index('Name')['Salary'].plot(kind='bar')
plt.title('Salary by Person')

# Histogram
plt.subplot(2, 2, 4)
plt.hist(np.random.normal(100, 15, 1000), bins=30)
plt.title('Normal Distribution')

plt.tight_layout()
plt.show()

Reading and Writing Data

Common data formats:

# CSV files
df_from_csv = pd.read_csv('data.csv')
df.to_csv('output.csv', index=False)

# Excel files
df_from_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
df.to_excel('output.xlsx', index=False)

# JSON files
df_from_json = pd.read_json('data.json')
df.to_json('output.json', orient='records')

# SQL databases
import sqlite3
conn = sqlite3.connect('database.db')
df_from_sql = pd.read_sql('SELECT * FROM table_name', conn)
df.to_sql('new_table', conn, if_exists='replace', index=False)

Introduction to Machine Learning

Scikit-learn for machine learning:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Sample dataset
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R² Score: {r2:.2f}")

Classification Example

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Generate classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Real-World Example: Sales Analysis

# Generate sample sales data for demonstration
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

np.random.seed(42)

# Create sample sales data
dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
products = ['Laptop', 'Desktop', 'Tablet', 'Phone', 'Monitor']
customers = [f'Customer_{i:03d}' for i in range(1, 101)]

sales_data = []
for _ in range(1000):
    date = np.random.choice(dates)
    product = np.random.choice(products)
    customer = np.random.choice(customers)
    quantity = np.random.randint(1, 5)
    unit_price = np.random.uniform(100, 2000)
    revenue = quantity * unit_price
    
    sales_data.append({
        'Date': date,
        'Product': product,
        'CustomerID': customer,
        'Quantity': quantity,
        'Revenue': revenue
    })

sales_df = pd.DataFrame(sales_data)
sales_df['Date'] = pd.to_datetime(sales_df['Date'])
sales_df['Month'] = sales_df['Date'].dt.month
sales_df['Year'] = sales_df['Date'].dt.year
sales_df['DayOfWeek'] = sales_df['Date'].dt.dayofweek

print(f"Generated {len(sales_df)} sales records")
print(f"Date range: {sales_df['Date'].min()} to {sales_df['Date'].max()}")

# Analysis
monthly_sales = sales_df.groupby(['Year', 'Month'])['Revenue'].sum()
product_performance = sales_df.groupby('Product').agg({
    'Revenue': 'sum',
    'Quantity': 'sum',
    'CustomerID': 'nunique'
}).rename(columns={'CustomerID': 'UniqueCustomers'})

# Configure matplotlib for dark theme
plt.style.use('dark_background')
plt.figure(figsize=(15, 10))

# Monthly sales trend
plt.subplot(2, 2, 1)
monthly_sales.plot(color='cyan', linewidth=2)
plt.title('Monthly Sales Trend', color='white')
plt.ylabel('Revenue', color='white')

# Product performance
plt.subplot(2, 2, 2)
product_performance['Revenue'].plot(kind='bar', color='lightgreen')
plt.title('Revenue by Product', color='white')
plt.xticks(rotation=45, color='white')

# Customer distribution
plt.subplot(2, 2, 3)
sales_df['CustomerID'].value_counts().head(10).plot(kind='bar', color='orange')
plt.title('Top 10 Customers by Transaction Count', color='white')

# Sales by day of week
plt.subplot(2, 2, 4)
sales_df.groupby('DayOfWeek')['Revenue'].mean().plot(kind='bar', color='lightcoral')
plt.title('Average Daily Sales by Day of Week', color='white')
plt.xlabel('Day of Week (0=Monday)', color='white')

plt.tight_layout()
plt.show()

Predictive Modeling Example

# Prepare data for modeling
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Feature selection
features = ['Quantity', 'UnitPrice', 'Month', 'DayOfWeek']
X = sales_df[features]
y = sales_df['Revenue']

# Handle categorical variables
le = LabelEncoder()
if 'Category' in sales_df.columns:
    X['Category_encoded'] = le.fit_transform(sales_df['Category'])

# Split and scale data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple models
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR

models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'SVM': SVR(kernel='rbf')
}

# Compare model performance
results = {}
for name, model in models.items():
    if name == 'SVM':
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {'MSE': mse, 'R²': r2}
    print(f"{name}: MSE={mse:.2f}, R²={r2:.3f}")

Exercise Time! 🚀

Challenge 1: Data Analysis

# Given this dataset, perform basic analysis
data = {
    'Name': ['John', 'Alice', 'Bob', 'Diana', 'Charlie'],
    'Age': [25, 30, 35, 28, 32],
    'Department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
    'Salary': [70000, 80000, 90000, 75000, 72000],
    'Years_Experience': [2, 5, 8, 3, 4]
}

# Tasks:
# 1. Create DataFrame
# 2. Find average salary by department
# 3. Find correlation between age and salary
# 4. Create a simple visualization

Challenge 2: Simple Prediction

# Create a simple linear regression model to predict salary
# based on years of experience

Exercise Solutions

Data Analysis Solution:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'Department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
    'Salary': [70000, 80000, 90000, 75000, 72000],
    'Years_Experience': [2, 5, 8, 3, 4]
}

# 1. Create DataFrame
df = pd.DataFrame(data)

# 2. Average salary by department
dept_salary = df.groupby('Department')['Salary'].mean()
print("Average Salary by Department:")
print(dept_salary)

# 3. Correlation between age and salary
correlation = df['Age'].corr(df['Salary'])
print(f"\nCorrelation between Age and Salary: {correlation:.3f}")

# 4. Visualization with dark theme
plt.style.use('dark_background')
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
dept_salary.plot(kind='bar', color='lightblue')
plt.title('Average Salary by Department', color='white')
plt.xticks(rotation=45, color='white')
plt.ylabel('Salary', color='white')

plt.subplot(1, 3, 2)
plt.scatter(df['Age'], df['Salary'], color='lightgreen', s=100, alpha=0.7)
plt.xlabel('Age', color='white')
plt.ylabel('Salary', color='white')
plt.title('Age vs Salary', color='white')

plt.subplot(1, 3, 3)
plt.scatter(df['Years_Experience'], df['Salary'], color='lightcoral', s=100, alpha=0.7)
plt.xlabel('Years Experience', color='white')
plt.ylabel('Salary', color='white')
plt.title('Experience vs Salary', color='white')

plt.tight_layout()
plt.show()

Upload Your Own Data Files

Quarto Drop enables drag-and-drop file uploads:

mydata.csv

# Drag and drop your CSV file here

import pandas as pd

try:
    df = pd.read_csv('mydata.csv')
    print(f"📊 Uploaded: {df.shape[0]} rows, {df.shape[1]} columns")
    print("\nFirst few rows:")
    print(df.head(3))
    
    # Quick visualization if numeric data exists
    numeric_cols = df.select_dtypes(include=['number']).columns
    if len(numeric_cols) >= 2:
        plt.figure(figsize=(8, 5))
        plt.scatter(df[numeric_cols[0]], df[numeric_cols[1]], alpha=0.6)
        plt.xlabel(numeric_cols[0])
        plt.ylabel(numeric_cols[1])
        plt.title('Your Data Visualization')
        plt.show()
        
except FileNotFoundError:
    print("📁 Drop a CSV file above to analyze it!")

Best Practices

Start with data exploration: Understand your data before analysis
Clean data thoroughly: Handle missing values and outliers
Use appropriate visualizations: Choose charts that tell the story
Validate your models: Use proper train/test splits and cross-validation
Document your process: Keep track of your analysis steps
Consider domain knowledge: Statistics + domain expertise = insights
Iterate and refine: Data science is an iterative process

Common Tools and Libraries

Essential libraries: - NumPy: Numerical computing foundation - Pandas: Data manipulation and analysis - Matplotlib/Seaborn: Data visualization - Scikit-learn: Machine learning algorithms - Jupyter: Interactive development environment

Advanced libraries: - TensorFlow/PyTorch: Deep learning - Statsmodels: Statistical modeling - Plotly: Interactive visualizations - Beautiful Soup: Web scraping - Requests: API data collection

Data Science Workflow

Typical data science workflow:

Problem Definition: What are you trying to solve?
Data Collection: Gather relevant data
Data Exploration: Understand the data structure and patterns
Data Cleaning: Handle missing values, outliers, and inconsistencies
Feature Engineering: Create relevant features for modeling
Model Development: Train and tune machine learning models
Model Evaluation: Assess model performance
Deployment: Put models into production
Monitoring: Track model performance over time

Summary

Python provides excellent tools for data science
NumPy enables efficient numerical computing
Pandas makes data manipulation and analysis easy
Matplotlib and Seaborn provide powerful visualization capabilities
Scikit-learn offers comprehensive machine learning algorithms
Real-world data science requires exploration, cleaning, and iteration
Combine statistical knowledge with programming skills for best results

Next Steps

Continue learning: - Practice with real datasets (Kaggle, UCI ML Repository) - Learn about different types of machine learning problems - Explore deep learning with TensorFlow or PyTorch - Study statistics and probability theory - Work on end-to-end projects - Join data science communities and competitions