Exploratory analysis and predictive modeling projects built with Python, structured datasets, statistical comparisons, and clear visual communication.
01 / Classification
Breast Cancer Logistic Regression Project
A binary classification workflow that predicts whether a breast tumor is malignant or benign from diagnostic measurements.
The script audits the Wisconsin Diagnostic Breast Cancer dataset, converts the diagnosis into a numeric target, separates 30 measured features from identifiers, and creates an 80/20 train-test split. It standardizes the inputs before fitting a logistic regression model, evaluates train and test accuracy, and inspects coefficient magnitude to identify the measurements with the strongest influence.
Test accuracy reproduced from the referenced UCI dataset and repository pipeline.
69benign correct
0false malignant
1false benign
44malignant correct
radius2
1.20
texture3
1.11
concave points1
1.08
02 / Exploratory analysis
GDP Project
An exploration of the relationship between national GDP and life expectancy across six countries from 2000 through 2015.
The notebook uses scatter plots, histograms, line charts, grouped country views, and average comparisons to examine economic growth and life-expectancy trends. It finds a positive within-country relationship between GDP and life expectancy, highlights China's substantial GDP growth, and identifies Zimbabwe as having the largest increase in life expectancy during the period.
A regression study connecting professional tennis performance statistics with wins, losses, rankings, and prize winnings.
The script explores 17 performance variables against four outcomes, generates repeated scatter-plot comparisons, and tests single-feature, two-feature, and multiple-feature linear regressions. It also separates seasons to compare yearly winnings models. Match volume and break-point activity show the strongest relationship with wins, while the full feature set produces a strong model for winnings.
Multi-feature winnings modelR² .838
reproduced test score
04 / Cohort comparison
Medical Insurance Project
A foundational Python analysis of U.S. medical insurance charges across location, smoking status, family size, and demographic groups.
The notebook manually parses CSV records into lists and dictionaries, then builds reusable functions for grouped calculations. It compares average charges by region, identifies the highest and lowest regional averages, and isolates people without children to quantify the cost difference between smokers and non-smokers.