Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix ~InstanceImpl() segfault #4244

Merged
merged 1 commit into from
Aug 30, 2018

Conversation

ambuc
Copy link
Contributor

@ambuc ambuc commented Aug 24, 2018

Description: This fixes a rare but repeatedly observed segfault, described in detail in the comments of InstanceImpl::~InstanceImpl().

TL;DR; the InstanceImpl's listener manager instance can potentially contain a target with a callback which, at destruction time (see RdsRouteConfigSubscription::~RdsRouteConfigSubscription) goes to unregister it from the top-level InitManager, which has already been freed.

In practice, I see this segfault a few times a week, and only ever at teardown (inside ~MainCommonBase), which could be why it hasn't been crashing anyone.

After a medium amount of trying, I couldn't figure out how to reproduce this in a unit or integration test. A unit test for //test/server:server_test would need to stand up a real or mocked ListenerManager which owns some target with a callback to server.initManager(), which seems very complex (unless I'm overlooking something and there's in-repo precedent for this sort of thing).

Risk Level: Medium -- this passes all tests, but what fixes one uncaught bug might cause another.
Testing: Al tests pass, no tests changed.
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: James Buckland <jbuckland@google.com>
Copy link
Contributor

@dnoe dnoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice detective work on this and the comment you've left a fantastic documentation trail which I really appreciate.

I'd love to figure out a way that we can add a regression test for this but it does look complex. Any thoughts from other maintainers on an approach that could work?

@ggreenway
Copy link
Contributor

It seems like an integration test that does a shutdown like this may catch this in the ASAN/TSAN builds.

ggreenway
ggreenway previously approved these changes Aug 24, 2018
@htuch
Copy link
Member

htuch commented Aug 24, 2018

I'd like @lizan to also take a look before we consider this finalized, he has the deep knowledge on RDS shutdown from recent PRs and reviews.

@htuch htuch requested a review from lizan August 24, 2018 18:35
@htuch
Copy link
Member

htuch commented Aug 24, 2018

Otherwise LGTM, thanks for the detective work. I'm surprised we can't create a unit (rather than full system integration) test to mock this up reasonably though.

Copy link
Member

@lizan lizan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch @ambuc!

// Destruct the ListenerManager explicitly, before InstanceImpl's local init_manager_ is
// destructed.
//
// The ListenerManager's DestinationPortsMap contains FilterChainSharedPtrs. There is a rare race
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this is a RARE race, can you elaborate a bit more? Particularly, when RdsRouteConfigSubscription get registered to top-level InitManager?

  • FilterChains contains HttpConnectionManager is pretty common
  • HttpConnectionManager contains RdsRouteConfigProvider is pretty common too
  • Any dynamic RdsRouteConfigProvider contains a RdsRouteConfigSubscription

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the circumstances are common but the resulting segfault is rare.

I'm not 100% on how memory works here under the hood, but my personal theory is that when the subscription goes to unregister itself from the list of targets_ inside InitManagerImpl, targets_ has just been freed but not not overwritten (?) so this operation often succeeds. @mrice32 might have more color here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes use-after-free is always rare without ASAN, does that mean you can have a config which always trigger use-after-free with ASAN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting thought, the answer is surely yes. I'll do some digging and report back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you have that, we could have some sort of test here. Could be an integration test with shutdown or sth similar.

// RdsRouteConfigSubscription is an Init::Target, ~RdsRouteConfigSubscription triggers a callback
// set at initialization, which goes to unregister it from the top-level InitManager, which has
// already been destructed (use-after-free) causing a segfault.
listener_manager_.reset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I think this did fix for the case described above, can we move init_manager_ before listener_manager or potentially anything may contain it, in member variable list? InitManagerImpl doesn't seems to have anything depends on others, so it could moved before other stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to move the private member variables around, but without a test it's easy to imagine this accidentally getting undone in the future. A manual .reset() seems like a better solution for that fear.

@lizan
Copy link
Member

lizan commented Aug 30, 2018

Was looking around a way for test, but it was hard to reproduce with minimal config, filed #4302 to track tests. Lets merge this to resolve the real issue first.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ambuc and @lizan.

@htuch htuch merged commit aa4481e into envoyproxy:master Aug 30, 2018
@ambuc ambuc deleted the list-remove-crash branch September 5, 2018 14:04
rshriram pushed a commit to rshriram/envoy that referenced this pull request Oct 30, 2018
Pulling the following changes from github.com/envoyproxy/envoy:

f936fc6 ssl: serialize accesses to SSL socket factory contexts (envoyproxy#4345)
e34dcd6 Fix crash in tcp_proxy (envoyproxy#4323)
ae6a252 router: fix matching when all domains have wildcards (envoyproxy#4326)
aa06142 test: Stop fake_upstream methods from accidentally succeeding (envoyproxy#4232)
5d73187 rbac: update the authenticated.user to a StringMatcher. (envoyproxy#4250)
c6bfc7d time: Event::TimeSystem abstraction to make it feasible to inject time with simulated timers (envoyproxy#4257)
752483e Fixing the fix (envoyproxy#4333)
83487f6 tls: update BoringSSL to ab36a84b (3497). (envoyproxy#4338)
7bc210e test: fixing interactions between waitFor and ignore_spurious_events (envoyproxy#4309)
69474b3 admin: order stats in clusters json admin (envoyproxy#4306)
2d155f9 ppc64le build (envoyproxy#4183)
07efc6d fix static initialization fiasco problem (envoyproxy#4314)
0b7e3b5 test: Remove declared but undefined class methods (envoyproxy#4297)
1485a13 lua: make sure resetting dynamic metadata wrapper when request info is marked dead
d243cd6 test: set to zero when start_time exceeds limit (envoyproxy#4328)
0a1e92a test: fix heap use-after-free in ~IntegrationTestServer. (envoyproxy#4319)
cddc732 CONTRIBUTING: Document 'kick-ci' trick. (envoyproxy#4335)
f13ef24 docs: remove reference to deprecated value field (envoyproxy#4322)
e947a27 router: minor doc fixes in stream idle timeout (envoyproxy#4329)
0c2e998 tcp-proxy: fixing a TCP proxy bug where we attempted to readDisable a closed connection (envoyproxy#4296)
00ffe44 utility: fix strftime overflow handling. (envoyproxy#4321)
af1183c Re-enable TcpProxySslIntegrationTest and make the tests pass again. (envoyproxy#4318)
3553461 fuzz: fix H2 codec fuzzer post envoyproxy#4262. (envoyproxy#4311)
42f6048 Proto string issue fix (envoyproxy#4320)
9c492a0 Support Envoy to fetch secrets using SDS service. (envoyproxy#4256)
a857219 ratelimit: revert `revert rate limit failure mode config` and add tests (envoyproxy#4303)
1d34172 dns: fix exception unsafe behavior in c-ares callbacks. (envoyproxy#4307)
1212423 alts: add gRPC TSI socket (envoyproxy#4153)
f0363ae fuzz: detect client-side resets in H2 codec fuzzer. (envoyproxy#4300)
01aa3f8 test: hopefully deflaking echo integration test (envoyproxy#4304)
1fc0f4b ratelimit: link legacy proto when message is being used (envoyproxy#4308)
aa4481e fix rare List::remove(&target) segfault (envoyproxy#4244)
89e0f23 headers: fixing fast fail of size-validation (envoyproxy#4269)
97eba59 build: bump googletest version. (envoyproxy#4293)
0057e22 fuzz: avoid false positives in HCM fuzzer. (envoyproxy#4262)
9d094e5 Revert ac0bd74 (envoyproxy#4295)
ddb28a4 Add validation context provider (envoyproxy#4264)
3b47cba added histogram latency information to Hystrix dashboard stream (envoyproxy#3986)
cf87d50 docs: update SNI FAQ. (envoyproxy#4285)
f952033 config: fix update empty stat for eds (envoyproxy#4276)
329e591 router: Add ability of custom headers to rely on per-request data (envoyproxy#4219)
68d20b4  thrift: refactor build files and imports (envoyproxy#4271)
5fa8192 access_log: log requested_server_name in tcp proxy (envoyproxy#4144)
fa45bb4 fuzz: libc++ clocks don't like nanos. (envoyproxy#4282)
53f8944 stats: add symbol table for future stat name encoding (envoyproxy#3927)
c987b42 test infra: Remove timeSource() from the ClusterManager api (envoyproxy#4247)
cd171d9 websocket: tunneling websockets (and upgrades in general) over H2 (envoyproxy#4188)
b9dc5d9 router: disallow :path/host rewriting in request_headers_to_add. (envoyproxy#4220)
0c91011 network: skip socket options and source address for UDS client connections (envoyproxy#4252)
da1857d build: fixing a downstream compile error by noting explicit fallthrough (envoyproxy#4265)
9857cfe fuzz: cleanup per-test environment after each fuzz case. (envoyproxy#4253)
52beb06 test: Wrap proto string in std::string before comparison (envoyproxy#4238)
f5e219e extensions/thrift_proxy: Add header matching to thrift router (envoyproxy#4239)
c9ce5d2 fuzz: track read_disable_count bidirectionally in codec_impl_fuzz_test. (envoyproxy#4260)
35103b3 fuzz: use nanoseconds for SystemTime in RequestInfo. (envoyproxy#4255)
ba6ba98 fuzz: make runtime root hermetic in server_fuzz_test. (envoyproxy#4258)
b0a9014 time: Add 'format' test to ensure no one directly instantiates Prod*Time from source. (envoyproxy#4248)
8567460 access_log: support beginning of epoch in START_TIME. (envoyproxy#4254)
28d5f41 proto: unify envoy_proto_library/api_proto_library. (envoyproxy#4233)
f7d3cb6 http: fix allocation bug introduced in envoyproxy#4211. (envoyproxy#4245)

Fixes istio/istio#8310 (once pulled into istio/istio).

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
htuch pushed a commit that referenced this pull request Nov 2, 2018
This fixes a crash during early shutdown that occurs in the InstanceImpl destructor when it frees the ListenerManagerImpl object. PR #4244 tried to fix this but we're still seeing this periodically. In the crash scenario, the RdsRouteConfigSubscription destructor calls the initialize_callback_ function which ends up calling the InitManager initialize lambda created in the RunHelper constructor. That lambda has a call to workers_start_cb which is protected by a check on the RunHelpers shutdown_ field. However at the point this lambda is executed the RunHelper object has already been destructed (in InstanceImpl::run) so the shutdown_ field is dangling and can contain arbitrary content. In this case we ended up falling through and calling the workers_start_cb function which called InstanceImpl::startWorkers which caused the crash because InstanceImpl::listener_manager_ was null. This fix avoids the crash by making the lambda passed to the InitManager::initialize method not bind to the RunHelper object since it may not exist at the time the lambda is executed. Instead keep the shutdown state in the InstanceImpl object itself.

Risk Level: low
Testing: unit testing

Signed-off-by: Elisha Ziskind <eziskind@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants