Skip to content

Covarianced sampling#338

Open
DmitriyValetov wants to merge 20 commits intoSALib:mainfrom
DmitriyValetov:covarianced_sampling
Open

Covarianced sampling#338
DmitriyValetov wants to merge 20 commits intoSALib:mainfrom
DmitriyValetov:covarianced_sampling

Conversation

@DmitriyValetov
Copy link

Good day!

I'm daily using SALib and sometimes need to make correlated samples.
I know that there are methods aimed on problems with correlated
parameters, but to make it simpler for coding comparison calculating
I offer these changes.

  1. distrs now may be of string type - to signal a multivariative distribution modelling.
  2. added one multi distr - multi normal with applying the covariance matrix after
    Sobol sequence sampling by multiplying with cholesky decomposition of covariance matrix.

Attached a notebook example with this functionality use to examples directory.

@ConnectedSystems
Copy link
Member

Thanks for your PR @DmitriyValetov

I have made a few minor comments and changes for your review.

The notebook is welcome, however I would like to place the notebook in a separate notebooks folder inside the examples directory, with a more informative filename (perhaps something like 'cholesky_correlation_sampling').

I would also like to see:

  1. More informative comments in the notebook, and
  2. Tests to be provided to ensure future changes do not break your contribution.

I can help with number 2 if you'd like.

@DmitriyValetov
Copy link
Author

I will take a try. So, we need two tests: one for correlating and one correlated problem with analitical results?

@ConnectedSystems
Copy link
Member

I will take a try. So, we need two tests: one for correlating and one correlated problem with analitical results?

Yes, we can also re-use the example given in your notebook as a high-level check to ensure expected covariance and mean values are being generated (or at least approximately so):

https://numpy.org/doc/stable/reference/generated/numpy.testing.assert_almost_equal.html

You can add the tests to this file I think.

@DmitriyValetov
Copy link
Author

Added test for adding covariance to saltelli sample.
I was looking for a while for an example of test problem for high level check - but I failed to reproduce them straight by applying choletsky transformation on result sample.

Also I have encountered several papers like: https://www.sciencedirect.com/science/article/abs/pii/S1364815215300153
Where Sobol "as is" is used after Rosenblatt Transformation that converts correlated variable to another set of independent variables. After Sobol analysis for that variable there is "a back mapping" to assign S1 and ST for original variables.

Implementation of independent Sobol version have found here: https://gitlab.com/CEMRACS17/shapley-effects - this lib has errors (and dead for sever years), but the part of Sobol independent is ok. I have taken this implementation with small changes to lib on my job.

@DmitriyValetov
Copy link
Author

What about adding sobol independent and full indices (they are specialized for problems with correlated inputs) + shapley effects code from that library: shapley-effects (offcource adapted code)?

(It is dead, but have relatively fresh issues that are for noone - so it is needed. Also I need it sometimes.)

I can have a try to convert it to sample-analyze format and insert it in salib.

@ConnectedSystems
Copy link
Member

Yes, sounds good. We just need to be mindful of any licensing issues but you seem to have considered this already.
I can review and provide any additional suggestions, just let me know when ready.

Thanks for contributing!

@ConnectedSystems
Copy link
Member

Just adding a reference to a related issue (#193)

@DmitriyValetov
Copy link
Author

Shapley & Sobol methods added with tests and examples.
Reference to source papers are in according notebooks in examples.

@DmitriyValetov
Copy link
Author

Well, Shapley is a little unstable.

@jameswoodcock
Copy link

Hi @DmitriyValetov. Thank you for implementing the covarianced sampling methods in SALib, it's a really useful addition to the library! I have been using the Shapley method for a model with correlated inputs. One of my input variables is categorical and I was wondering if you are you aware of any issues with using categorical input variables to calculate Shapley effects, I couldn't find anything in the literature?

I've included the categorical distribution by adding an OpenTurns user defined distribution to distrs.py with the weights set to the probability of each category. (https://openturns.github.io/openturns/latest/user_manual/_generated/openturns.UserDefined.html) and this seems to give sensible answers. Do you know if there's a better way of doing this or does this seem like a sensible approach?

Thank you!

@DmitriyValetov
Copy link
Author

DmitriyValetov commented Mar 13, 2021

Hi @jameswoodcock . Have never analyzed data with categorical features by salib and alike methods. But there is another way... and there is used the Shapley method, but in different approach. Have you heard about Shap package (https://github.com/slundberg/shap)? It is usually used to interpret boosting methods. So you can analyze data as is this way:

  1. Get-prepear the dataset (real-table-data or generated from your openturns joint distribution) in form of pandas.DataFrame.
  2. Train one of the boosting methods: catboost, lightgbm (I prefer catboost. By the way, xgtboost couldn't work with categorical data, when I touched it last time.).
  3. Apply Shap on the trained model.
  4. Receive Shap values data-driven in a way.

My friend from bioinformatics usually goes this way.

Also, if this approach model-use is convenient, give it a try to https://github.com/oegedijk/explainerdashboard.

@jameswoodcock
Copy link

@DmitriyValetov thank you for your reply. I hadn't heard about the Shap package before, it looks really useful. I'll give it a try!

@mschrader15
Copy link

mschrader15 commented Jul 8, 2022

Hey all, is there are reason why this hasn't been merged into main?

A paper was recently published that builds off of aforementioned https://www.sciencedirect.com/science/article/abs/pii/S1364815215300153

@ConnectedSystems
Copy link
Member

Hi @mschrader15

The current implementation has dependencies on a few big external packages. We're currently looking into how best to reduce/remove these dependencies so that SALib continues to be relatively self-contained.

I'm not able to do this very quickly as I, and other maintainers here, don't have much time available currently. But if you're willing to look under the hood we'd welcome any contribution.

@mschrader15
Copy link

mschrader15 commented Jul 8, 2022

Totally understand. I really just wanted to make sure that it wasn't due to implementation errors. I forked and merged on my own branch for a time critical use case (no offense meant btw, @DmitriyValetov, this was a big effort and really helpful to my research!).

I've contacted the authors of https://www.sciencedirect.com/science/article/pii/S0307904X21002122 to see if they would be interested in sharing code / supporting integration into SALib. If so, I'll organize a PR (eventually)

@willu47 willu47 changed the base branch from master to main July 12, 2022 06:33
@tupui
Copy link
Member

tupui commented Aug 11, 2022

Regarding the dependency to OpenTURNS. I had a look and it seems to me that the main usage is Copulae to sample distributions.

Copula are also present in statsmodels (will not get into SciPy since it was rejected and motivated the work to statsmodels) but it would add an extra dependency as SALib does not has it either.

I am not sure it would be really on scope with the library to pull Copula in here. Or we would need to have another way to sample multivariate distributions. I am checking but in SciPy what we added to sample arbitrary distributions is just 1D.

But I would maybe suggest doing this in 2 steps. 1 add Shapley by itself as it does not seem to really require Copula (normal copula is just the classical multivariate normal distribution). And then think about what to do with Sobol' correlated version. (A first step could be to only support multivariate normal distributions.)

@ConnectedSystems
Copy link
Member

ConnectedSystems commented Sep 24, 2022

Regarding copulas @tupui do you think the Copulas package would be a sufficient alternative?

Seems like it would be a more lightweight dependency than statsmodels.

Otherwise, I agree that a first step would be to support multivariate normal distributions.

@tupui
Copy link
Member

tupui commented Sep 26, 2022

Regarding copulas @tupui do you think the Copulas package would be a sufficient alternative?

Seems like it would be a more lightweight dependency than statsmodels.

It would be enough in terms of features sure, but it's still an additional dependency.

In practice, I am wondering if we could not just go with 1 or 2 Copulas and if that's the case, we might as well just add these ourselves (I can).

Another possibility is to have clear instructions on how to use external libraries to generate a correlated sample. It could be argued that sampling is not really the responsibility of SALib.

@tupui
Copy link
Member

tupui commented Mar 17, 2023

FYI, the Copulas package is not anymore an option as they changed their license to an incompatible license.

https://github.com/sdv-dev/Copulas/blob/master/LICENSE

@ConnectedSystems
Copy link
Member

Apologies @DmitriyValetov , the contribution here is really valuable, I got waylaid with finishing my PhD.

If you don't mind I will try splitting off your contribution into smaller Pull Requests starting with the Shapley method.
This could be done over the next month or so.

@kka1996
Copy link

kka1996 commented May 28, 2024

Hi everyone,

is the method for covarianced sampling now Merged with the Main now? I did. Not find the method in the docs

@tupui
Copy link
Member

tupui commented May 28, 2024

@kka1996 this merge request is still open so no the method is not yet available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants