Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky test package: test/xds #6914

Open
zasweq opened this issue Jan 10, 2024 · 10 comments · Fixed by #7411
Open

Flaky test package: test/xds #6914

zasweq opened this issue Jan 10, 2024 · 10 comments · Fixed by #7411
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. P2 Type: Testing

Comments

@zasweq
Copy link
Contributor

zasweq commented Jan 10, 2024

Alongside #6913 and #6912, I have ran the test/xds suite on master since I added tests to it for my xDS Server fix #6889. I have encountered numerous flakes on g3, particularly those outlined in custom lb tests for distribution #6601. However, I have encountered almost every client and server side xDS test flake with a context timeout for a RPC expected to proceed. Each has different logs/events preceeding it's timeout, but every test seems susceptible to timeout. The flakes are generally rare, but due to the number of tests in the test suite you can successfully trigger by running the full test suite enough times. My initial inkling tells me there's some synchronization needed or something gets stuck in the management server/testing xDS Client flow. This also manifests in rare flakes for my xDS Server fix, where I expect something like an err that represents Accept and Close, and I get a context timeout instead.

@arvindbr8
Copy link
Member

arvindbr8 commented Jan 10, 2024

another one for TestServerSideXDS_WithValidAndInvalidSecurityConfiguration: https://github.com/grpc/grpc-go/actions/runs/7480796959/job/20361025267?pr=6916

@zasweq
Copy link
Contributor Author

zasweq commented Jan 22, 2024

@arvindbr8
Copy link
Member

@arvindbr8
Copy link
Member

@zasweq
Copy link
Contributor Author

zasweq commented Apr 3, 2024

@arjan-bal
Copy link
Contributor

@arjan-bal
Copy link
Contributor

@zasweq I investigated this and the problem seems to be due to the xDS management server getting stuck while writing to this buffered channel

In the logs of failing runs for TestServerSideXDS_WithValidAndInvalidSecurityConfiguration, I noticed that the resource snapshot update request is sent to the xds management server before the xds client is able to connect to the xds server. This somehow results in more than 1 Listener requests being sent to the xds server which get stuck waiting to write to the buffered channel.

This seems to be a problem with the test and not the implementation. Adding a 50 millis sleep after starting both the servers did get rid of the flakiness in TestServerSideXDS_WithValidAndInvalidSecurityConfiguration.

@zasweq
Copy link
Contributor Author

zasweq commented Jul 15, 2024

Ah nice thank you for figuring this out!

@arvindbr8
Copy link
Member

@zasweq
Copy link
Contributor Author

zasweq commented Jul 23, 2024

You mentioned this solved the test, but not the flakes in the full package. This was my flaky test in this PR so thanks for fixing this: https://github.com/grpc/grpc-go/actions/runs/10050840269/job/27779434995?pr=7434 :).

@zasweq zasweq reopened this Jul 23, 2024
@eshitachandwani eshitachandwani added the Area: xDS Includes everything xDS related, including LB policies used with xDS. label Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: xDS Includes everything xDS related, including LB policies used with xDS. P2 Type: Testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants