Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED #9899

Merged
merged 8 commits into from
Feb 6, 2020
Merged

dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED #9899

merged 8 commits into from
Feb 6, 2020

Conversation

junr03
Copy link
Member

@junr03 junr03 commented Jan 31, 2020

Description: this PR adds logic to the DnsResolverImpl to destroy and re-initialize its c-ares channel under certain circumstances. A better option would require work in c-ares c-ares/c-ares#301.
Risk Level: med changes in low-level DNS resolution.
Testing: unit tests

Fixes #4543

Signed-off-by: Jose Nino jnino@lyft.com

Jose Nino added 2 commits January 31, 2020 14:01
wip
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03
Copy link
Member Author

junr03 commented Jan 31, 2020

opening as draft until I write tests monday.

cc @htuch as the original writer.

Signed-off-by: Jose Nino <jnino@lyft.com>
fmt
Signed-off-by: Jose Nino <jnino@lyft.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating and fixing this issue. Very excited to see this finally understood and resolved.

source/common/network/dns_impl.cc Outdated Show resolved Hide resolved
source/common/network/dns_impl.cc Outdated Show resolved Hide resolved
source/common/network/dns_impl.cc Show resolved Hide resolved
source/common/network/dns_impl.h Outdated Show resolved Hide resolved
@mattklein123 mattklein123 self-assigned this Jan 31, 2020
@junr03
Copy link
Member Author

junr03 commented Feb 1, 2020

Whoops did not mean for you to review my half baked code. Thanks for the comments though, I agree with all of them @mattklein123. Will update and write tests Monday.

Jose Nino added 3 commits January 31, 2020 16:51
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
Copy link
Member Author

@junr03 junr03 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 updated with your comments

// The channel cannot be destroyed and reinitialized here because that leads to a c-ares
// segfault.
if (status == ARES_ECONNREFUSED) {
parent_.dirty_channel_ = true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this for a bit and debated between returning here, vs. allowing execution to continue and eventually call the callback_.

I ended up going for allowing the callback for a couple reasons:

  1. Ease of testing, as having the callback allows for nice event loop triggers and a way to test assertions on the address_list.
  2. It preserves previous behavior where every time resolution completed, whether successfully or not (with the exception of destruction), the callback_ was called. Obviously after fallbacks were exhausted, and if the query was not cancelled.

wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree we need to call the callback. As you already pointed out this is related to the other issue, so let's just fix the callbacks to expose errors in a future change.

@junr03 junr03 marked this pull request as ready for review February 5, 2020 18:41
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work investigating this and fixing! Very excited to see this finally understood.

// The channel cannot be destroyed and reinitialized here because that leads to a c-ares
// segfault.
if (status == ARES_ECONNREFUSED) {
parent_.dirty_channel_ = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree we need to call the callback. As you already pointed out this is related to the other issue, so let's just fix the callbacks to expose errors in a future change.

Comment on lines +613 to +615
// If that flag needs to be set, or c-ares changes its handling this test will need to be updated
// to create another condition where c-ares invokes onAresGetAddrInfoCallback with status ==
// ARES_ECONNREFUSED.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is a little fragile but I think it's OK for now. Nice work figuring out a way to test this!

@junr03 junr03 merged commit a8a43cb into envoyproxy:master Feb 6, 2020
rebello95 added a commit to envoyproxy/envoy-mobile that referenced this pull request Feb 6, 2020
Bumping to include:
- `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: envoyproxy/envoy#9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS
- `gzip: add force load factory declaration`: envoyproxy/envoy#9942. This will allow us to use the gzip filter with Envoy Mobile

Signed-off-by: Michael Rebello <me@michaelrebello.com>
rebello95 added a commit to envoyproxy/envoy-mobile that referenced this pull request Feb 10, 2020
Bumping to include the following fixes:
- `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: envoyproxy/envoy#9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS. Fixes #672, though more improvements to DNS resolution will be investigated in #673
- `gzip: add force load factory declaration`: envoyproxy/envoy#9942. This will allow us to use the gzip filter with Envoy Mobile
- `api listener: add shutdown method and call during server termination`: envoyproxy/envoy#9959. Fixes #667. Fixes #674

Includes the following PRs to fix iOS liveliness tests as well:
- `metrics service: force link v2 config`: envoyproxy/envoy#9875
- `config: remove ApiTypeOracle assert`: envoyproxy/envoy#9973

Also contains necessary updates to accommodate [this change](envoyproxy/envoy@dea4eb0).

Signed-off-by: Michael Rebello <me@michaelrebello.com>
junr03 added a commit that referenced this pull request Feb 17, 2020
Description:PendingResolutions get destroyed when complete or when c-ares sent ARES_EDESTRUCTION. Prior to #9899 ARES_EDESTRUCTION only happened when the resolver was destroyed. However, after #9899 the channel, and thus PendingResolutions can be destroyed, without the callback target being aware. This leads to potential use after free issues. This PR calls the resolve callback when the status received is ARES_EDESTRUCTION.
Risk Level: med
Testing: added test that reproduced segfault and passed after fix.

Signed-off-by: Jose Nino <jnino@lyft.com>
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Bumping to include the following fixes:
- `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: #9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS. Fixes envoyproxy/envoy-mobile#672, though more improvements to DNS resolution will be investigated in envoyproxy/envoy-mobile#673
- `gzip: add force load factory declaration`: #9942. This will allow us to use the gzip filter with Envoy Mobile
- `api listener: add shutdown method and call during server termination`: #9959. Fixes envoyproxy/envoy-mobile#667. Fixes envoyproxy/envoy-mobile#674

Includes the following PRs to fix iOS liveliness tests as well:
- `metrics service: force link v2 config`: #9875
- `config: remove ApiTypeOracle assert`: #9973

Also contains necessary updates to accommodate [this change](dea4eb0).

Signed-off-by: Michael Rebello <me@michaelrebello.com>
Signed-off-by: JP Simard <jp@jpsim.com>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Bumping to include the following fixes:
- `dns: destroy/reinitialize c-ares channel on ARES_ECONNREFUSED`: #9899. This should resolve the issues we've been seeing with Envoy Mobile hanging on startup and never properly issuing requests if started in the offline state on iOS. Fixes envoyproxy/envoy-mobile#672, though more improvements to DNS resolution will be investigated in envoyproxy/envoy-mobile#673
- `gzip: add force load factory declaration`: #9942. This will allow us to use the gzip filter with Envoy Mobile
- `api listener: add shutdown method and call during server termination`: #9959. Fixes envoyproxy/envoy-mobile#667. Fixes envoyproxy/envoy-mobile#674

Includes the following PRs to fix iOS liveliness tests as well:
- `metrics service: force link v2 config`: #9875
- `config: remove ApiTypeOracle assert`: #9973

Also contains necessary updates to accommodate [this change](dea4eb0).

Signed-off-by: Michael Rebello <me@michaelrebello.com>
Signed-off-by: JP Simard <jp@jpsim.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dns: DnsResolverImpl keeps using a "broken" c-ares channel
2 participants