Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HTTP] High level design: Remove exceptions in codecs #10878

Closed
8 of 9 tasks
asraa opened this issue Apr 21, 2020 · 3 comments · Fixed by #14381
Closed
8 of 9 tasks

[HTTP] High level design: Remove exceptions in codecs #10878

asraa opened this issue Apr 21, 2020 · 3 comments · Fixed by #14381
Assignees
Labels
area/http no stalebot Disables stalebot from closing an issue
Milestone

Comments

@asraa
Copy link
Contributor

asraa commented Apr 21, 2020

This is a tracking issue and design overview for removing exceptions:

Title: Remove exceptions in Envoy's HTTP/1 and HTTP/2 codecs

Objective:
Remove exceptions in the presence of untrusted traffic, specifically on codec errors in the data plane with careful attention to codec callbacks and resumption points to ensure the control flow is preserved. This moderate risk change that will include runtime feature flag protection.

Background:
There are performance and security concerns around the use of C++ exceptions in the presence of untrusted traffic. Most simply, exceptions on the data plane are error-prone and offer the potential for denial of service attacks through resource attacks on exception paths and query-of-deaths resulting from uncaught exceptions (#10475). The pattern of throwing exceptions across a language boundary in http_parser and nghttp2’s C callbacks should generally be avoided.

Finally, Envoy deployments at scale are potentially vulnerable to NetSpectre attacks. While it comes with a performance cost, compiling the Envoy binary with Speculative Load Hardening (SLH) is a mitigation technique against Spectre v1 style exploits for TLS deployments. However, protecting exceptions is considerably harder and more complex than normal code, and SLH will not support exceptions. While Envoy’s data plane and control plane are compiled as a single binary, we can narrow our concern to code paths that can be trained with real executions, and then misspeculated with hostile data. This limits the attack surface and scope of the removal to Envoy’s data plane, and specifically, when dispatching incoming connection data to codecs.

Envoy’s HTTP/1 and HTTP/2 codec implementations make use of exceptions in codec callbacks for control flow. These exceptions, defined in exception.h, may catch non-recoverable HTTP protocol errors, client side errors, buffer flooding, and premature responses in the case of HTTP/1. The HTTP/1 codec contains 13 exception callsites, while the HTTP/2 codec implementation contains 7. Although the number of callsites are low, careful attention to the callbacks and resumption points is required to ensure that control flow is preserved.

Design Overview:
See the proof of concept: #10484
Our core design idea is to replace the use of exceptions for control flow with callback error codes and local variables that hold information about the error that occurred during the callback. Error status will have four Envoy-specific types mirroring the subclasses of exceptions. These error statuses will be handled in the HTTP Connection Manager and codec client where a connection dispatches incoming connection data to the codecs.

A local boolean dispatching_ in both codecs’ connections will be used to indicate a state under a dispatch call. Since we will only handle error statuses during a call to dispatch, it is only safe to replace the exceptions with an error status while we are dispatching. We will use debug ASSERTs to validate that the error statuses are set in a dispatching context. If some exceptional circumstance arises and the codec method throws outside of a dispatching context, we will instead RELEASE_ASSERT as this would have been an uncaught exception. In places where the input may be from filters, for example, RequestEncoder::encodeHeaders, proper error handling should be added to guard against intermediate filters causing errors.

To facilitate incremental progress for the change, the try/catch block around dispatch will remain while the HCM and codec client handles dispatch error statuses. The HCM and codec client will set a local status variable that will be set with the return status from dispatch or with the corresponding exception:

Envoy::Http::Status status;
try {
status = codec_->dispatch(data);
// Exception removal is still in migration. Soon we won't need to catch these exceptions, as
// they'll be propagated through the callbacks and returned from dispatch.
} catch (CodecProtocolException& e) {
status = Envoy::Http::codecProtocolError(e.what());
} catch (PrematureResponseException& e) {
status = Envoy::Http::prematureResponseError(e.what(), e.responseCode());
}

Simultaneous handling of statuses and exceptions allows users to opt-out of the change during the deprecation period by using a runtime override that swaps out forked legacy versions of the codecs. To maintain the legacy codecs during the deprecation period, format rules run on presubmit that compare the legacy and new codecs to an expected “golden” diff that is checked in to Envoy. If a developer makes a codec change, they must ensure that they port the change to the legacy codec, or update the “golden” diff to reflect a change.

The runtime feature flag envoy.reloadable_features.new_codec_behavior and legacy versions will remain for a short deprecation period of about 4-6 weeks while canary testing and fuzzing provide confidence in the change. Fuzz testing will detect any violations to the assumption that codec methods are only called in a dispatching context by crashing on debug ASSERTs placed around throw sites. A simple self-differential fuzz target will also be checked-in to compare legacy and new codec behavior when dispatching incoming connection data. The error status type and message of the new codec’s dispatch should always match the exception type and message returned by the legacy codec’s. We will monitor any crashes caused by RELEASE_ASSERTs where exceptions were thrown outside dispatching while canary testing the change.

Tasks

@yanavlasov @envoyproxy/security-team

@mattklein123 mattklein123 added area/http no stalebot Disables stalebot from closing an issue labels Apr 22, 2020
@mattklein123 mattklein123 added this to the 1.15.0 milestone Apr 22, 2020
mattklein123 pushed a commit that referenced this issue May 5, 2020
(This is the merge-able portion of #10484. It does not include the behavior changes in H/1. Only the necessary changes are there for H/1 and H/2 exception removal to happen separately.)

Description: This introduces a new return value, `Envoy::Http::Status` for `Connection::dispatch`. Currently, dispatch will always return an OK status. This facilitates the migration to remove exceptions in codecs and replace them with error statuses. The HCM and codec client can now handle statuses returned from dispatch, although they will continue to support exceptions (by translating them to the corresponding status) until legacy codecs are deprecated. 

Following this, PRs will stage exception removal in H/1 and H/2 codecs, codecs will be forked to have a legacy version, and a runtime override will allow users to continue to use legacy codecs during the migration. 

See issue #10878 for a full detailed plan and overview.

Risk: Medium-high. A no-op, but touches sensitive code.
Testing: All tests pass, this is a no-op
Issue: #10878

Signed-off-by: Asra Ali <asraa@google.com>
jmarantz pushed a commit that referenced this issue May 14, 2020
Commit Message: Remove exceptions from HTTP/1 codec callbacks. Replaces with http_parser exit codes that indicate failure. codec_status_ propagates the error.

Additional Description:

I know the diff is slightly messy but the principles I abided by were: Replace throw with setting codec_status_, immediately return and propagate return up to callback with an error exit code, always ASSERT(dispatching_) in the body of the method that throws, always ASSERT(codec_status_.ok()) before setting the codec status.
The remaining exception is in encodeHeaders, which I will need to replace with ENVOY_BUG
I audited for throws in the includes for this file and did not find anything used in the codec_impl, but I will need to do another pass.
This is just part 1 of my HTTP/1 PRs. Part 2 is exception to error handling for encodeHeaders and any other utility functions. This is just a PR to stage.
Testing: Tests pass, codec_impl_fuzz_test has been running for a few minutes.
Risk level: Medium, this should do nothing but is a codec behavior change.
Issues: #10878

Signed-off-by: Asra Ali asraa@google.com
@mattklein123 mattklein123 modified the milestones: 1.15.0, 1.16.0 Jun 24, 2020
asraa pushed a commit that referenced this issue Sep 25, 2020
Fixed crashing fuzz test in Http Health Check Fuzz
The http utility function getResponseStatus() throws an exception. This problem falls within the high level design issue of #10878. I will follow up after finishing gRPC health check fuzzing to remove expections in http utility functions.
Signed-off-by: Zach <zasweq@google.com>
@mattklein123
Copy link
Member

@asraa @yanavlasov what is the plan for cleaning up the old codecs? Should we do it before 1.16 ships or right after? I guess right after?

@mattklein123 mattklein123 modified the milestones: 1.16.0, 1.17.0 Oct 7, 2020
@asraa
Copy link
Contributor Author

asraa commented Nov 30, 2020

Just updating: the codec default was new right after 1.16 (merged on 10/16). If people are running the code in staging for a few weeks and haven't reported any crashes, maybe we can clean up before the 1.17 release?

@mattklein123
Copy link
Member

@asraa yeah IMO I would just clean it up. We can always revert if an issue comes up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/http no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants