Skip to content

Wraps a standardized approach to risk stratification used by the intelligence function

License

Notifications You must be signed in to change notification settings

SNEE-ICS/risk_stratifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Risk stratifier

What is this package?

Risk stratifier is a python package which wraps some of the standard approaches the Intelligence Function takes to risk stratification.

installation

To use this package you can install from this repo from your terminal:

pip install "git+https://github.com/SNEE-ICS/risk_stratifier.git"

How to use this package?

Before you check out the documentation, it is recommended that you review The guiding principles of this package in the section below.

Can your data be used with this package? documentation for the validate_binary_y_and_X() function.

Handling missingness in numeric variables transparently documentation for the add_numeric_missingness_indicators()

Documentation on premade modelling pipelines and cross validation to be added soon...

The guiding principles of this package

Probabilities are more important than class prediction

In the context of an ICB, it is rare that a predictive model will be used to directly predict whether an event happens or not. This is because direct classification is somewhat difficult to interpret from a clinical perspective.

Furthermore, it is often difficult, both for the analyst and the organisation, to assess the cost of different types of error. For example, is incorrectly predicting an individual will get cancer less costly than incorrectly predicting an individual won't get cancer. Intuitively, maybe yes - but by how much?

In some cases class prediction could concievably be considered as automating clinical decision making. This would be innapropriate in almost all settings where the Intelligence Function is conducting predictive modelling.

For these reasons, this package instead focuses on generating probability predictions.

Predicted probabilities must be well calibrated

Given the focus on risk, predicted probabilities must be well calibrated. This means that the probabilities generated by a model must reflect the likelihood of the event occuring.

Simplifying slightly, this means that if 10 patients are all predicted to have a 10% risk of cancer incidence, it is expected that 1 in 10 of those patients receive a cancer diagnosis.

Many machine learning algorithms do produce a "probability" score. However, these scores very often should not be interpreted as a probablility prediction. A common example of this is Support Vector Machines, which often provide probabilities clustered at 0 and 1 and are poorly calibrated as a consequence.

This leads to several critical modelling choices that are consistently used throughout the package:

  1. Brier score, log-loss, ROC AUC and reliability plots are used to assess model performance. All 4 provide insight into the performance of a model through the lens of probability calibration.
  2. Log-loss is the loss-function optimized for during training (naturally suited as represents a convex optimization problem).
  3. Hyperparameter tuning selects for the best brier score. This is often preferable to log-loss when potential models under consideration can initially be very poorly calibrated.
  4. Up/down/synthetic sampling techniques are avoided entirely as this deteriorates calibration most of the time.
  5. Data partitioning is stratified by the dependent variable as a standard.

Avoiding leakage is critical

Leakage, in this context, occurs when information gleaned from assessment on a test set influences model development. Being able to state performance on unseen data is critical for the adoption and use of the insights generated.

Nested cross-validation is used as standard in this package when assessing the performance and calibration of a candidate model.

Being practical about computation and data usage

The Intelligence Function does not typically have acccess to large amounts of compute power. The cost of increasing compute power can be hard to justify in the context that the Intelligence Function operates in.

Whilst the datasets that are available might be considered large, prior experience modelling in this context has revealed that the choice to partition data for model assessment must be made prudently.

Ideally, we could incorporate both recalibration and conformal prediction into our modelling pipeline.

Recalibration would permit a wider range of performant models that also produce valid probability predictions. However, these methods require additional partitioning (which may reduce the performance of the final model) and introduces additional computational requirements.

Conformal prediction provides guaranteed uncertainty intervals. However, like recalibration this requires additional data partitioning and computation.

What is in development?

Binary nested cross-validation function is complete and tested - documentation needs to be written.

preprepared modelling pipelines are written but remain untested. These include:

  • lasso regression
  • ridge regression
  • xgboost

There is a desire to also write a cat_boost pipeline. Given the large volumes of categorical data in healthcare, it is a strong candidate for further development.

Integration with the premade modelling pipeline and binary nested cross-validation will require testing.

A function to fit the final model on the full dataset which is then stored and logged must be written.

About

Wraps a standardized approach to risk stratification used by the intelligence function

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages