Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should have \sum_n W_nk = 1 error with pymbar3.0.5 while analysis is working with pymbar3.0.3 #419

Open
hannahbaumann opened this issue Nov 25, 2020 · 30 comments

Comments

@hannahbaumann
Copy link

Hi,

I'm analyzing free energy calculations (Gromacs-based) using alchemlyb. The analysis fails due to following error:
Warning: Should have \sum_n W_nk = 1. Actual column sum for state 0 was 0.835621. 21 other columns have similar problems.
The pymbar version here was 3.0.5

Running the same input files/script in a different environment with pymbar==3.0.3 didn't raise this error, the script was able to calculate the free energy difference.

The issue seems to be similar to other issues, see e.g. #320

Here is a folder with the .xvg files from one of the runs and a python script for the analysis and the dependencies in the 2 different environments:
https://drive.google.com/drive/folders/1lKncHmRj38OY8H2YjsnYHdIpEcvxKKpX?usp=sharing
Let me know if you need more information!

@davidlmobley
Copy link

Tagging @mrshirts

@jaimergp
Copy link
Member

This is the diff between two versions: 3.0.3...3.0.5

This particular change in the default tolerance might be interesting to analyze.

Otherwise, the standard approach here is to run git bisect run ... between the two tags and let the script exit code reveal the offending commit. For this to happen you'll need a local copy of the repo, and fresh environment where pymbar is installed in editable mode. Roughly:

# get pymbar env file for the 3.0.x branch
% wget https://github.com/choderalab/pymbar/raw/3.0_branch/devtools/conda-envs/test_env.yaml
% conda env create -n pymbar-dev test_env.yaml
% conda activate pymbar-dev
% git clone https://github.com/choderalab/pymbar
% cd pymbar
% git checkout 3.0.3  # start with the working state
% pip install -e .
% git bisect start 3.0.3
% git bisect run your_script.py

@xiki-tempula
Copy link

This is similar to the issue that I got in #407

@jaimergp
Copy link
Member

jaimergp commented Dec 1, 2020

Ok, I think I have found the issue.

History is not lineal at that point because both #295 and #293 were being developed in parallel back then. The dict API change makes it a bit more complicated to test because the provided script breaks between merged #293 (released in 3.0.4) and 3.0.5.

Anyway, the error should lie in this diff: fe9ea53...fe57313

Hope that helps debug the actual issue!

@mikemhenry
Copy link
Contributor

Also running into this issue when following the openmm alchemical free energy tutorial

@rsdefever
Copy link

I'm also having this issue where I get the error with 3.0.5 but not 3.0.3. It doesn't happen for all of my systems.

In case it helps with debugging, I have attached a minimal working example for energy distributions where 3.0.3 gives a result and 3.0.5 fails with the Should have \sum_n W_nk = 1 error error. The attached zip file contains 3 things: A u_kn.txt, N_k.txt, and a short python script that loads the matrices and calls pymbar.

Thanks in advance!

pymbar_debug.zip

@jchodera
Copy link
Member

@mrshirts Do you have the resources to look into this right now? I think this must have been introduced in the overhaul.

@mrshirts
Copy link
Collaborator

Let me take a look over the weekend (Mar 13-15) and see what I can identify.

@mrshirts
Copy link
Collaborator

I don't know if this is the problem every time, but I suspect for the majority of 3.0.3 -> 3.0.5 issues, the default solver switched (which is good - the "adaptive" solver is much more consistent), but the default maximum number of iterations is too low (250), and thus MBAR doesn't go to completion.

It's actually a bit of a pain to increase the maximum number of iterations in 3.0: the way to do it is invoke mbar this way:

    solver_options = {"maximum_iterations":10000,"verbose":True}
    solver_protocol = {"method":"adaptive","options":solver_options}
    mbar = MBAR(u_kn, N_k, solver_protocol = (solver_protocol,))

Let me know how many of the issues are addressed with this change. We will look at better ways of passing in parameters for 4.0.

@rsdefever
Copy link

rsdefever commented Mar 22, 2021

This helped for some cases that were failing in 3.0.5. However, I still have several cases that result in the Warning: Should have \sum_n W_nk = 1 error in 3.0.5 but seem to converge in 3.0.3. Interestingly, the cases that fail require far fewer solver iterations. When I switch the solver in 3.0.5 to BFGS, the Warning: Should have \sum_n W_nk = 1 goes away.

If it would be helpful, I can pull together some more MWEs.

Thanks!

@mrshirts
Copy link
Collaborator

Yes, I would very much like to see those examples.

@rsdefever
Copy link

rsdefever commented Mar 22, 2021

Attached is another MWE. The attached zip file contains u_kn.txt, N_k.txt, and test.py, a short python script that loads the matrices and calls pymbar. The script has blocks for 3.0.3, 3.0.5 + adaptive, and 3.0.5 + BGFS. At least for me, 3.0.3 and 3.0.5 + BGFS work, while 3.0.5 + adaptive results in the \sum_n W_nk error. Let me know if you have any questions. Thanks!

pymbar_debug_solver.zip

@mrshirts
Copy link
Collaborator

OK, looking into this; adaptive is terminating too soon. However, it does look like one CAN use LBFGS with 3.0.5, so at least the correct results can be obtained without changing the code itself! Better automated handling is, of course, better when possible.

@jchodera
Copy link
Member

Why not have it fall back to a more robust solver if it fails to converge with the requested solver?
This robust fallback behavior can be the default, but you can let people disable it if they want to live dangerously.

We should always provide robust results by default if at all possible, and force people to specify optional arguments if they want to try something specific.

@mrshirts
Copy link
Collaborator

Why not have it fall back to a more robust solver if it fails to converge with the requested solver?

Adaptive was previously the more robust solver. This is the first case where BFGS worked where adaptive didn't (which is why we reintroduced adaptive and made it default in 3.0.5 - MOST of the issues were improper minimum number of iteration handling). Basically, we need to assemble a set of hard problems and see what the best overall strategy may be, and provide clearer details when it fails about other options to take. In this particular case, self-consistent iteration iteration is getting stuck with the free energies not changing between iterations, apparently because of rounding errors.

Going through all the cases and finding a robust scheme is probably going to have to wait until classes are over for me (last week in April), at which point getting a full 4.0 release is a high priority (90% done, except for things like this).

@jchodera
Copy link
Member

Based on what you said above, wouldn't simply falling back to adaptive---and then to BFGS if that fails---work in all the cases we know about?

@mrshirts
Copy link
Collaborator

Probably? But I'd rather get everything incorporated into 4.0, understand WHY these cases are failing in different places. Will just take some undivided hours to get everything together. And update the documentation, etc.

@xiki-tempula
Copy link

@rsdefever Thanks for the example, I'm interested in reproducing this problem, the link that you have provided seems to be broken and can I have the files, please? Thank you.

@rsdefever
Copy link

@xiki-tempula could you check again? I was able down download the zip from the link just now.

@xiki-tempula
Copy link

@rsdefever Thank you. It seems that changing the browser allows me to download it. I wonder if you mind allow me to put this file in https://github.com/alchemistry/alchemtest to check the stability of the estimator?

@rsdefever
Copy link

@xiki-tempula that is absolutely ok.

@zhang-ivy
Copy link

@jchodera @mrshirts What's the status of this issue? Will there be a bug fix release soon? I just ran into this issue as well and fixed it by installing from master (as I think this PR #425 fixes the issue for the adaptive solver).

(Note: I was having trouble installing pymbar 3.0.3 in my existing environment because it uses python 3.8.11, which is why I installed from master instead)

@mrshirts
Copy link
Collaborator

Semester is over, so I'm planning on revisiting things over the next few weeks for a FINAL 4.0 release; I should have time to play around with some options. I have the two data sets, so should be able to test with both of those. Maybe a target of Jan 10th?

@mikemhenry
Copy link
Contributor

@mrshirts Just wanted to give this issue a nudge :)

@mrshirts
Copy link
Collaborator

Hah, thanks. It was a crazy break (Colorado wildfires, car crash, illness). Still on my radar, but may be another couple of weeks.

@sef43
Copy link

sef43 commented Jul 14, 2023

I was getting this error on the OpenMM Alchemical free energy tutorial but it has now gone away after fixing a bug in the simulation that would have been incorrectly sampling states: openmm/openmm-cookbook#28

@ijpulidos
Copy link
Contributor

We are hitting the same issue with pymbar 4 for our perses GPU CI tests.

I have been able to reproduce it locally and if I use pymbar 3 they run just fine. We probably need to dive into this with more detail and maybe come up with an easy to reproduce example, since these tests take at least 20 mins using a decent GPU.

@mrshirts
Copy link
Collaborator

There's about 20 different optimizers for MBAR now (adaptive or any of the scipy ones), so it would be good to know what options are failing, and if they are failing because they reach the maximum iteration, or some other failure mode.

Some data sets are going to be numerically impossible to converge, so perhaps we need a better reporting framework when a given dataset has problems - this particular error is uninformative.

@ijpulidos
Copy link
Contributor

I hope this can help debug the problem. I'm attaching a .tar.gz package with the required .nc files and python script to reproduce the problem. Just copy the nc files and python script into a directory and run the script with python process_with_pymbar.py

The output with pymbar 4.0.x is as follows:

Warning on use of the timeseries module: If the inherent timescales of the system are long compared to those being analyzed, this statistical inefficiency may be an underestimate.  The estimate presumes the use of many statistically independent samples.  Tests should be performed to assess whether this condition is satisfied.   Be cautious in the interpretation of the data.

****** PyMBAR will use 64-bit JAX! *******
* JAX is currently set to 32-bit bitsize *
* which is its default.                  *
*                                        *
* PyMBAR requires 64-bit mode and WILL   *
* enable JAX's 64-bit mode when called.  *
*                                        *
* This MAY cause problems with other     *
* Uses of JAX in the same code.          *
******************************************

Warning: The openmmtools.multistate API is experimental and may change in future releases
Warning: The openmmtools.multistate API is experimental and may change in future releases

******* JAX 64-bit mode is now on! *******
*     JAX is now set to 64-bit mode!     *
*   This MAY cause problems with other   *
*      uses of JAX in the same code.     *
******************************************

Failed to reach a solution to within tolerance with hybr: trying next method
WARNING: Did not converge to within specified tolerance.
max_delta = 0.000000e+00, tol = 1.000000e-12, maximum_iterations = 10000, iterations completed = 9999
Failed to reach a solution to within tolerance with adaptive: trying next method
No solution found to within tolerance.
The solution with the smallest gradient 5.107602e+02 norm is hybr
Please exercise caution with this solution and consider alternative methods or a different tolerance.
Traceback (most recent call last):
  File "/home/user/mambaforge/envs/perses-dev/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 1923, in _compute_free_energy
    Deltaf_ij, dDeltaf_ij = self.mbar.getFreeEnergyDifferences()
AttributeError: 'MBAR' object has no attribute 'getFreeEnergyDifferences'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/workdir/debugging/pymbar/process_with_pymbar.py", line 10, in <module>
    f_ij, df_ij = analyzer.get_free_energy()
  File "/home/user/mambaforge/envs/perses-dev/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 1965, in get_free_energy
    self._compute_free_energy()
  File "/home/user/mambaforge/envs/perses-dev/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 1926, in _compute_free_energy
    results = self.mbar.compute_free_energy_differences()
  File "/home/user/mambaforge/envs/perses-dev/lib/python3.10/site-packages/pymbar/mbar.py", line 698, in compute_free_energy_differences
    Theta_ij = self._computeAsymptoticCovarianceMatrix(
  File "/home/user/mambaforge/envs/perses-dev/lib/python3.10/site-packages/pymbar/mbar.py", line 1793, in _computeAsymptoticCovarianceMatrix
    check_w_normalized(W, N_k)
  File "/home/user/mambaforge/envs/perses-dev/lib/python3.10/site-packages/pymbar/utils.py", line 371, in check_w_normalized
    raise ParameterError(
pymbar.utils.ParameterError: Warning: Should have \sum_n W_nk = 1. Actual column sum for state 0 was 0.000000. 14 other columns have similar problems. 
This generally indicates the free energies are not converged.

whereas with pymbar 3.1.1 you shouldn't get any errors. I tried watching the difference between both versions for the relevant code paths and couldn't see any difference that would explain the error. I hope this helps with the debugging.

pymbar4-test.tar.gz

@pitmanme
Copy link

pitmanme commented Feb 20, 2024

This is more for users, but if anybody is running into this issue, one option is to downgrade pymbar, but if you want to continue with pymbar version 4.0 you can also set the solver to robust to avoid this bug with

analyzer = MultiStateSamplerAnalyzer(reporter, max_n_iterations=n_iterations, analysis_kwargs={"solver_protocol": "robust"})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests