Skip to content

chonzadaniel/chonzadaniel

Repository files navigation

Salary Classification Web Application (Flask + Machine Learning)

A machine learning web application that predicts whether a person's salary is greater than $50K or less/equal to $50K based on demographic, education, and employment attributes.

The model is trained using Scikit-learn pipelines and deployed locally using Flask, allowing users to input their information through a web interface and receive real-time predictions.


Project Overview

This project demonstrates an end-to-end machine learning workflow, including:

  • Data exploration and visualization
  • Feature preprocessing
  • Machine learning model training
  • Hyperparameter tuning
  • Model comparison
  • Pipeline serialization
  • Web deployment using Flask
  • Interactive user interface for predictions

The final deployed model is a tuned Random Forest Classifier integrated within a preprocessing pipeline.


Problem Statement

Predict whether an individual's annual salary exceeds $50K based on demographic and employment attributes.

This is a binary classification problem where the target variable is:

salary ∈ {<=50K, >50K}


Machine Learning Pipeline

The project implements a Scikit-learn pipeline to ensure consistent preprocessing during both training and prediction.

Pipeline Components

  1. Feature separation

    • Numerical features
    • Categorical features
  2. Numerical preprocessing

    • StandardScaler
  3. Categorical preprocessing

    • OneHotEncoder
  4. Feature transformation

    • ColumnTransformer
  5. Model training

    • RandomForestClassifier
  6. Pipeline serialization

    • Saved using joblib

Dataset Features

Feature Description
age Age of the individual
workclass Type of employment
fnlwgt Final sampling weight
education Highest education level
education_num Total years of education
marital_status Marital status
occupation Job occupation
relationship Family relationship
race Race category
sex Gender
capital_gain Income from investments
capital_loss Loss from investments
hours_per_week Working hours per week
native_country Country of origin

Target Variable

salary: <=50K or >50K


Model Comparison

Two machine learning models were evaluated:

Model Description
Logistic Regression Linear baseline classifier
Random Forest Ensemble tree-based classifier

Evaluation Metrics

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • ROC-AUC

Final Model

The tuned Random Forest model achieved the best performance and was selected for deployment.


Project Structure

Salary-Classification-Flask-App/ │ ├── training.py ├── salary_classification_app.py │ Flask application that loads the trained ML pipeline │ and serves predictions through a web interface. │ ├── model_artifacts/ │ Saved machine learning artifacts. │ │ │ ├── random_forest_tuned.pkl │ │ Final trained pipeline containing preprocessing + model. │ │ │ └── random_forest_tuned_pickle.pkl │ Alternate serialized model object. │ ├── templates/ │ HTML templates rendered by Flask using Jinja2. │ │ │ ├── index.html │ │ User interface form for entering input features. │ │ │ └── model_results.html │ Displays salary prediction results. │ ├── static/ │ Static assets used by the web interface. │ │ │ └── style.css │ CSS styling for the application layout. │ ├── notebooks/ │ Jupyter notebooks used during experimentation. │ │ │ └── salary_classification_pipeline.ipynb │ Data exploration, visualization, model training, │ and pipeline serialization. │ ├── requirements.txt │ Python dependencies required to run the project. │ └── README.md


Application Workflow

User Input (Web Form) │ ▼ Flask Server │ ▼ Input Converted to Pandas DataFrame │ ▼ Saved ML Pipeline (Preprocessing + Random Forest) │ ▼ Prediction │ ▼ Render Result Page


Installation

Clone the Repository

git clone https://github.com/chonzadaniel/salary-classification-flask-app.git


Install Dependencies

pip install -r requirements.txt


Running the Application

Start the Flask server:

python app.py

Open your browser and navigate to:

http://127.0.0.1:5000/

Enter the required information and click Predict Salary.


Example Prediction Output

Predicted Salary: >50K Probability: 82.47%


Technologies Used

Backend

  • Python
  • Flask

Machine Learning

  • Scikit-learn
  • RandomForestClassifier
  • LogisticRegression
  • Pipeline
  • ColumnTransformer

Data Processing

  • Pandas
  • NumPy

Visualization

  • Matplotlib
  • Seaborn

Frontend

  • HTML5
  • CSS3
  • Jinja2 Templates

Future Improvements

Possible enhancements include:

  • Deploying the application on AWS / Render / Heroku
  • Containerizing the application using Docker
  • Adding input validation
  • Implementing feature importance visualization
  • Integrating SHAP explainability
  • Creating a REST API endpoint

Author

Emmanuel Daniel Chonza

Data Scientist | Monitoring & Evaluation Expert | Generative AI Enthusiast

GitHub:
https://github.com/chonzadaniel


License

This project is licensed under the MIT License.

About

Skilled Data Scientist with hands on experience in ML, NLP, Deep Learning & GenAI building clean, modular projects with real-world problems solutions: text classification, class imbalance, RAG systems, and PEFT. Developed impactful AI tools powered by AWS, Streamlit, Slack, & vector DBs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors