DEVPROD-8323 Add query timeout to evergreen db client #8148
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
DEVPROD-8323
Description
The attached jobs in this ticket all appear to have timed out after having excessively long queries to the MCI cluster. For this job for example, two consecutive queries to the tasks collection that each took 60 minutes before the socket was closed is why the job had a 2hr runtime. These queries show up and stick out as big outliers in the DB cluster query insights. From what I can tell (DB cluster logs don't go back that far), this is true for the other linked examples in the ticket.
I dug for a while and the 1 hour timeout doesn't appear to be related to anything in our DB client or Kanopy settings, and is most likely configured somewhere in our network infrastructure that isn't visible to us.
We could add more
MaxTime
values to the jobs that were getting stuck, but I think as an initial step we can just try to set a max query time to the mci cluster and see if that resolves the issue. Looking at DB activity for the cluster it seems like 5 minutes is a good conservative limit, but I'm fine to tweak this if there are concerns about any workflows that might run a query that long.This seems to have been effective in #7879.