Data Analysis

01 / Classification

Breast Cancer Logistic Regression Project

A binary classification workflow that predicts whether a breast tumor is malignant or benign from diagnostic measurements.

The script audits the Wisconsin Diagnostic Breast Cancer dataset, converts the diagnosis into a numeric target, separates 30 measured features from identifiers, and creates an 80/20 train-test split. It standardizes the inputs before fitting a logistic regression model, evaluates train and test accuracy, and inspects coefficient magnitude to identify the measurements with the strongest influence.

Python
pandas
scikit-learn
Logistic regression
Feature scaling

View project on GitHub

Model result 99.1%

Test accuracy reproduced from the referenced UCI dataset and repository pipeline.

69benign correct

0false malignant

1false benign

44malignant correct

radius2

1.20

texture3

1.11

concave points1

1.08

02 / Exploratory analysis

GDP Project

An exploration of the relationship between national GDP and life expectancy across six countries from 2000 through 2015.

The notebook uses scatter plots, histograms, line charts, grouped country views, and average comparisons to examine economic growth and life-expectancy trends. It finds a positive within-country relationship between GDP and life expectancy, highlights China's substantial GDP growth, and identifies Zimbabwe as having the largest increase in life expectancy during the period.

Python
pandas
Seaborn
Matplotlib
Time-series comparison

View project on GitHub

Life expectancy 2000–2015

6countries

96observations

+813%China GDP

+14.7Zimbabwe years

03 / Regression modeling

Tennis Aces Project

A regression study connecting professional tennis performance statistics with wins, losses, rankings, and prize winnings.

The script explores 17 performance variables against four outcomes, generates repeated scatter-plot comparisons, and tests single-feature, two-feature, and multiple-feature linear regressions. It also separates seasons to compare yearly winnings models. Match volume and break-point activity show the strongest relationship with wins, while the full feature set produces a strong model for winnings.

Python
pandas
Seaborn
scikit-learn
Linear regression

View project on GitHub

Correlation with wins 1,721 players

Service games

.929

Return games

.928

Break opportunities

.923

Break points faced

.883

Aces

.825

Multi-feature winnings model R² .838 reproduced test score

04 / Cohort comparison

Medical Insurance Project

A foundational Python analysis of U.S. medical insurance charges across location, smoking status, family size, and demographic groups.

The notebook manually parses CSV records into lists and dictionaries, then builds reusable functions for grouped calculations. It compares average charges by region, identifies the highest and lowest regional averages, and isolates people without children to quantify the cost difference between smokers and non-smokers.

Python
CSV parsing
Dictionaries
Functions
Cohort analysis

View project on GitHub

Average regional charge 1,338 records

$12.3k SW

$14.7k SE

$12.4k NW

$13.4k NE

Smoker, no children $31,341

Non-smoker, no children $7,612

$23,729 average difference

Breast Cancer Logistic Regression Project

GDP Project

Tennis Aces Project

Medical Insurance Project