Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel optimization of RHS Noises #159

Closed
aferragu opened this issue Aug 4, 2018 · 3 comments
Closed

Parallel optimization of RHS Noises #159

aferragu opened this issue Aug 4, 2018 · 3 comments

Comments

@aferragu
Copy link
Contributor

aferragu commented Aug 4, 2018

Hi Oscar:

We are working together with @rodripor in a hydro-thermal scheduling model. The main difference between our model and your examples is that we have two timescales (hours and days), so every stage has 24x times variables to account for demand and generation throughout the day, with a daily water balance constraint. The days are the stages of our problem formulation.

Today we started to analyze the parallel version of your implementation, with the Asynchronous() solver. I refactored our code to use the asynchronous version (which was extremely easy thanks to your implementation :-)) and ran it in an AWS c5.18xlarge instance (72 cores, 144Gb RAM). However the performance gains were not that big (as compared to my computer) and I would like to understand why, and try to hint towards a possible solution.

If I understand correctly your code, the Asynchronous() solver performs several forward-backward passes of the algorithm in parallel, and shares the cuts generated among them. However, if I understand correctly the SDDP algorithm, in each stage of each backward pass, you have to optimize for every possible noise realization, and average them to get the cut. So, if the number of possible noise outcomes is large, every backward pass has a lot to do in each stage, and this is not parallelized (please correct me if anything I said before is wrong!)

In our case, we have 365 stages and approx 100 noise realizations (is a bit more complicated due to the Markov states, but never mind). Seems to me that, in our case, it would be better to perform one backward-pass, but in every stage of that backward pass parallelize the optimization for every noise realization. So the pass would be faster and the cuts will be available for the next iteration.

A more clever thing to do (daydreaming now) would be to split your available cores: if you have n noise realizations and M>n cores, then parallelize your n rhsnoise optimizations and do M/n forward-backward passes also in parallel, with n cores devoted to each one.

Let me know if you think this is feasible (or plain wrong :-)), and point us in the right direction, and maybe we can contribute with some of this.

@odow
Copy link
Owner

odow commented Aug 4, 2018

Everything you said is correct :)

I chose the asynchronous version because it was easy to implement. You can easily think up much more ambitious parallelization schemes (as you have done).

The main issue with the asynchronous version is that with lots of cores, you end up doing lots of iterations poor approximations of the value function. (Consider the first iteration. Each core does an iteration with no value information, so you discover 72 useless cuts.) The second issue is that the master process collecting the cuts and sharing them back out gets a bit overloaded.

Let me know if you think this is feasible (or plain wrong :-)),

Sounds fantastic!

point us in the right direction, and maybe we can contribute with some of this.

A basic outline would be:

Step 1. Define a new subtype of SDDPSolveType just like

struct Asynchronous <: SDDPSolveType
slaves::Vector{Int} # pid of slave processors
step::Float64 # number of iterations before introducing another slave
end

Step 2. Overload the solve method using your new parallel type just like

function JuMP.solve(async::Asynchronous, m::SDDPModel{T}, settings::Settings=Settings()) where T

Write a new solve method using liberal copy and paste until you get something working. Then we can clean it up and integrate it more cleanly.

You should be able to pull in much of the existing machinery. In fact, you should be able to take most of the serial code, and modify this function:

function solvesubproblem!(::Type{BackwardPass}, m::SDDPModel, sp::JuMP.Model, incoming_probability::Float64=1.0)
ex = ext(sp)
if hasnoises(sp)
for i in 1:length(ex.noiseprobability)
setnoise!(sp, ex.noises[i])
JuMPsolve(BackwardPass, m, sp)
push!(m.storage.objective, getobjectivevalue(sp))
push!(m.storage.noise, i)
push!(m.storage.probability, incoming_probability * ex.noiseprobability[i])
push!(m.storage.modifiedprobability, incoming_probability * ex.noiseprobability[i])
push!(m.storage.markov, ex.markovstate)
push!(m.storage.duals, zeros(nstates(sp)))
saveduals!(m.storage.duals[end], sp)
end
else
JuMPsolve(BackwardPass, m, sp)
push!(m.storage.objective, getobjectivevalue(sp))
push!(m.storage.noise, 0)
push!(m.storage.probability, incoming_probability)
push!(m.storage.modifiedprobability, incoming_probability)
push!(m.storage.markov, ex.markovstate)
push!(m.storage.duals, zeros(nstates(sp)))
saveduals!(m.storage.duals[end], sp)
end
end

Some things to consider:

  • you can't serialize JuMP models between cores. However, you don't want to rebuild them each time either. So you really want to associate models to cores and keep them there.
  • My approach was to build full models on all the cores, and keep them all in sync. (That hurts our memory requirement though.) (Side note, the new version of JuMP will enable us to halve our memory requirement.)
  • It's actually really, really fast to solve repeated RHS's in sequence. It's probably faster to just parallelize the realizations within a Markov state.
  • Since we're a long way from the metal (SDDP.jl > JuMP > Gurobi.jl > C), the overhead might be quite large. This might be a lot of work for little benefit. (But I have no idea of the magnitude.)
  • Before you embark on this, have a look at Improve JuMP speed #62. In fact, just go through JuMPfunctions and comment out anything you're not needing. This can give an improvement.

It would be super great if you wanted to attack this. If you need any help, just develop everything on a branch and then submit a PR and we can discuss in there!

@aferragu
Copy link
Contributor Author

aferragu commented Aug 4, 2018

The main issue with the asynchronous version is that with lots of cores, you end up doing lots of iterations poor approximations of the value function. (Consider the first iteration. Each core does an iteration with no value information, so you discover 72 useless cuts.)

That's exactly why I started looking into this. In fact, why do you use step<1 then? Seems to me that step=1 would lead to a first pass, then add another slave and do two parallel passes, then 4 and then exponentially until using all cores...or maybe just unleashing the full cores only after the first pass.

As for the PR, let me look into this a little since I'm not an expert in Julia, but it would be great if we can help. I have to fork the repository to create a branch or just branch it locally and then submit the PR in order to publish?

@odow
Copy link
Owner

odow commented Jan 26, 2021

Closing because I don't implement this, and the above hints are no longer relevant. In addition, the overhead (in my developer time) of maintaining another hard-to-test parallelization scheme probably outweighs the benefits.

@odow odow closed this as completed Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants