Risk stratifier is a python package which wraps some of the standard approaches the Intelligence Function takes to risk stratification.
To use this package you can install from this repo from your terminal:
pip install "git+https://github.com/SNEE-ICS/risk_stratifier.git"
Before you check out the documentation, it is recommended that you review The guiding principles of this package in the section below.
Can your data be used with this package? documentation for the validate_binary_y_and_X() function.
Handling missingness in numeric variables transparently documentation for the add_numeric_missingness_indicators()
Documentation on premade modelling pipelines and cross validation to be added soon...
In the context of an ICB, it is rare that a predictive model will be used to directly predict whether an event happens or not. This is because direct classification is somewhat difficult to interpret from a clinical perspective.
Furthermore, it is often difficult, both for the analyst and the organisation, to assess the cost of different types of error. For example, is incorrectly predicting an individual will get cancer less costly than incorrectly predicting an individual won't get cancer. Intuitively, maybe yes - but by how much?
In some cases class prediction could concievably be considered as automating clinical decision making. This would be innapropriate in almost all settings where the Intelligence Function is conducting predictive modelling.
For these reasons, this package instead focuses on generating probability predictions.
Given the focus on risk, predicted probabilities must be well calibrated. This means that the probabilities generated by a model must reflect the likelihood of the event occuring.
Simplifying slightly, this means that if 10 patients are all predicted to have a 10% risk of cancer incidence, it is expected that 1 in 10 of those patients receive a cancer diagnosis.
Many machine learning algorithms do produce a "probability" score. However, these scores very often should not be interpreted as a probablility prediction. A common example of this is Support Vector Machines, which often provide probabilities clustered at 0 and 1 and are poorly calibrated as a consequence.
This leads to several critical modelling choices that are consistently used throughout the package:
- Brier score, log-loss, ROC AUC and reliability plots are used to assess model performance. All 4 provide insight into the performance of a model through the lens of probability calibration.
- Log-loss is the loss-function optimized for during training (naturally suited as represents a convex optimization problem).
- Hyperparameter tuning selects for the best brier score. This is often preferable to log-loss when potential models under consideration can initially be very poorly calibrated.
- Up/down/synthetic sampling techniques are avoided entirely as this deteriorates calibration most of the time.
- Data partitioning is stratified by the dependent variable as a standard.
Leakage, in this context, occurs when information gleaned from assessment on a test set influences model development. Being able to state performance on unseen data is critical for the adoption and use of the insights generated.
Nested cross-validation is used as standard in this package when assessing the performance and calibration of a candidate model.
The Intelligence Function does not typically have acccess to large amounts of compute power. The cost of increasing compute power can be hard to justify in the context that the Intelligence Function operates in.
Whilst the datasets that are available might be considered large, prior experience modelling in this context has revealed that the choice to partition data for model assessment must be made prudently.
Ideally, we could incorporate both recalibration and conformal prediction into our modelling pipeline.
Recalibration would permit a wider range of performant models that also produce valid probability predictions. However, these methods require additional partitioning (which may reduce the performance of the final model) and introduces additional computational requirements.
Conformal prediction provides guaranteed uncertainty intervals. However, like recalibration this requires additional data partitioning and computation.
Binary nested cross-validation function is complete and tested - documentation needs to be written.
preprepared modelling pipelines are written but remain untested. These include:
- lasso regression
- ridge regression
- xgboost
There is a desire to also write a cat_boost pipeline. Given the large volumes of categorical data in healthcare, it is a strong candidate for further development.
Integration with the premade modelling pipeline and binary nested cross-validation will require testing.
A function to fit the final model on the full dataset which is then stored and logged must be written.