# Create data with missing valuesdata_with_nulls = {'A': [1, 2, None, 4],'B': [5, None, 7, 8],'C': [9, 10, 11, None]}df_nulls = pd.DataFrame(data_with_nulls)# Check for missing valuesprint(df_nulls.isnull().sum())# Drop rows with any null valuesdf_clean = df_nulls.dropna()# Fill missing valuesdf_filled = df_nulls.fillna(0) # Fill with 0df_mean_filled = df_nulls.fillna(df_nulls.mean()) # Fill with mean
# Given this dataset, perform basic analysisdata = {'Name': ['John', 'Alice', 'Bob', 'Diana', 'Charlie'],'Age': [25, 30, 35, 28, 32],'Department': ['IT', 'HR', 'IT', 'Finance', 'HR'],'Salary': [70000, 80000, 90000, 75000, 72000],'Years_Experience': [2, 5, 8, 3, 4]}# Tasks:# 1. Create DataFrame# 2. Find average salary by department# 3. Find correlation between age and salary# 4. Create a simple visualization
Challenge 2: Simple Prediction
# Create a simple linear regression model to predict salary# based on years of experience
Exercise Solutions
Data Analysis Solution:
import pandas as pdimport matplotlib.pyplot as pltimport numpy as np# Sample datadata = {'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],'Age': [25, 30, 35, 28, 32],'Department': ['IT', 'HR', 'IT', 'Finance', 'HR'],'Salary': [70000, 80000, 90000, 75000, 72000],'Years_Experience': [2, 5, 8, 3, 4]}# 1. Create DataFramedf = pd.DataFrame(data)# 2. Average salary by departmentdept_salary = df.groupby('Department')['Salary'].mean()print("Average Salary by Department:")print(dept_salary)# 3. Correlation between age and salarycorrelation = df['Age'].corr(df['Salary'])print(f"\nCorrelation between Age and Salary: {correlation:.3f}")# 4. Visualization with dark themeplt.style.use('dark_background')plt.figure(figsize=(12, 4))plt.subplot(1, 3, 1)dept_salary.plot(kind='bar', color='lightblue')plt.title('Average Salary by Department', color='white')plt.xticks(rotation=45, color='white')plt.ylabel('Salary', color='white')plt.subplot(1, 3, 2)plt.scatter(df['Age'], df['Salary'], color='lightgreen', s=100, alpha=0.7)plt.xlabel('Age', color='white')plt.ylabel('Salary', color='white')plt.title('Age vs Salary', color='white')plt.subplot(1, 3, 3)plt.scatter(df['Years_Experience'], df['Salary'], color='lightcoral', s=100, alpha=0.7)plt.xlabel('Years Experience', color='white')plt.ylabel('Salary', color='white')plt.title('Experience vs Salary', color='white')plt.tight_layout()plt.show()
Upload Your Own Data Files
Quarto Drop enables drag-and-drop file uploads:
mydata.csv
# Drag and drop your CSV file hereimport pandas as pdtry: df = pd.read_csv('mydata.csv')print(f"📊 Uploaded: {df.shape[0]} rows, {df.shape[1]} columns")print("\nFirst few rows:")print(df.head(3))# Quick visualization if numeric data exists numeric_cols = df.select_dtypes(include=['number']).columnsiflen(numeric_cols) >=2: plt.figure(figsize=(8, 5)) plt.scatter(df[numeric_cols[0]], df[numeric_cols[1]], alpha=0.6) plt.xlabel(numeric_cols[0]) plt.ylabel(numeric_cols[1]) plt.title('Your Data Visualization') plt.show()exceptFileNotFoundError:print("📁 Drop a CSV file above to analyze it!")
Best Practices
Start with data exploration: Understand your data before analysis
Clean data thoroughly: Handle missing values and outliers
Use appropriate visualizations: Choose charts that tell the story
Validate your models: Use proper train/test splits and cross-validation
Document your process: Keep track of your analysis steps
Real-world data science requires exploration, cleaning, and iteration
Combine statistical knowledge with programming skills for best results
Next Steps
Continue learning: - Practice with real datasets (Kaggle, UCI ML Repository) - Learn about different types of machine learning problems - Explore deep learning with TensorFlow or PyTorch - Study statistics and probability theory - Work on end-to-end projects - Join data science communities and competitions
Project ideas: - Analyze your personal data (fitness, spending, etc.) - Predict stock prices or sports outcomes - Build recommendation systems - Create data dashboards - Contribute to open-source data science projects