This project represents a focused effort to conduct an in-depth Exploratory Data Analysis (EDA) on a vital Breast Cancer Diagnosis dataset. My goal is to move beyond surface-level statistics and truly understand the intricate characteristics that distinguish benign (non-cancerous) from malignant (cancerous) tumors, utilizing features derived from digitized images of fine needle aspirates (FNA). Through meticulous data inspection, cleaning, and compelling visualizations, I aim to uncover critical patterns and relationships. These insights are not merely academic; they are the foundational steps in building a robust predictive model, a tool that holds the potential to contribute to more accurate and timely breast cancer diagnoses. This project is a personal commitment to leveraging data science for impactful contributions in healthcare.
This project is part of my data science portfolio
- Python (Jupyter Notebook)
- pandas and numpy for data wrangling
- matplotlib and seaborn for visualization
- GitHub for portfolio showcase
This analysis was guided by the following nine scientific/research questions, aimed at uncovering data patterns and relationships relevant to breast cancer diagnosis:
- What is the Diagnosis Distribution of the Tumor Types (M = Malignant, B = Benign)?
- What can you deduce from the descriptive statistics (mean, standard deviation, and skewness) for the 'mean' features in the breast cancer dataset?
- What can you deduce from the descriptive statistics (mean, standard deviation, and skewness) for the 'worst' features in the breast cancer dataset?
- How do the distributions of the top 6 mean features differ between benign and malignant tumors?
- How do the distributions of various breast cancer features (such as radius, texture, perimeter, etc.) differ between benign and malignant tumors?
- How are the various features correlated with each other and with the diagnosis?
- What are the shapes and ranges of the distributions for key features when separated by diagnosis?
- How do pairs of features relate to each other, and does the relationship differ based on the diagnosis?
- How do the correlations between pairs of features differ between malignant and benign tumors?
Each question was explored using descriptive statistics, visualizations, and correlation analyses to uncover clinically meaningful insights.
-
Data Inspection
- Dataset contains 569 samples and 33 features (including ID and diagnosis).
- One column (
Unnamed: 32) contained no values and was dropped. - Identified datatype mismatches across numeric and categorical fields.
-
Data Cleaning
- Removed redundant/unnecessary columns.
- Checked for and handled missing/empty values.
-
Exploratory Visualizations
- Class distribution between benign and malignant tumors.
- Correlation heatmap of all features.
- Boxplots of critical features across diagnosis classes.
- Pairplots of selected features to study separability.
- And so on.
- Class Distribution: The dataset is slightly imbalanced but still workable for modeling.
- Correlated Features: Strong relationships exist among radius, perimeter, and area features, which may influence feature selection.
- Scaling Needs: Features have widely varying ranges, highlighting the need for scaling before predictive modeling.
- Feature Separability: Certain features (e.g.,
concave points_worst,area_worst) strongly separate benign vs malignant tumors.
| Class Distribution | Correlation Heatmap |
|---|---|
![]() |
![]() |
| Boxplot of Features | Pairplot of Key Features |
|---|---|
![]() |
![]() |
EDA_Project2.ipynb: Jupyter notebook with full exploratory analysisImages/: Folder containing exported plots used in READMEData/: Folder with sample CSVs (dataset uploaded here)
You can access or download it here:
👉 Breast Cancer Wisconsin (Diagnostic) Dataset on Kaggle
Based on the Exploratory Data Analysis of the Breast Cancer Diagnosis dataset, some of the key findings and implications for building a predictive model are as follows:
- The dataset exhibits class imbalance, with significantly more benign cases than malignant ones, which will require consideration during model training (e.g., using techniques like stratification, resampling, or adjusting class weights).
- Many features show distinct distributions and significantly different mean and median values between benign and malignant tumors, particularly features related to the size and irregularity of cell nuclei (radius, perimeter, area, concave points, and their 'worst' counterparts). These features are likely to be strong predictors of diagnosis.
- High correlations exist among many features, especially within the 'mean' and 'worst' groups, suggesting potential multicollinearity. This might necessitate feature selection or dimensionality reduction techniques to improve model stability and interpretability.
- The differential correlation analysis revealed that the relationships between some feature pairs change significantly between benign and malignant tumors, providing further discriminatory information that a model could leverage.
- The presence of outliers, especially in malignant cases for certain features, indicates data variability and might require robust preprocessing steps or models less sensitive to extreme values.
- The clear separation observed in pairwise plots for several feature combinations reinforces their potential as powerful discriminators for classifying tumor type.
Overall, the EDA suggests that a predictive model should focus on features related to cell morphology and nuclear characteristics, while also addressing class imbalance and potential multicollinearity to achieve accurate and reliable breast cancer diagnosis prediction.
- Build baseline ML classifiers (Logistic Regression, Random Forest, XGBoost).
Adebayo Fashina
📍 Toronto, Canada | LinkedIn



