Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GC] Fixed flaky GC stats end-to-end test #21919

Merged
merged 1 commit into from
Jul 17, 2024

Conversation

agarwal-navin
Copy link
Contributor

@agarwal-navin agarwal-navin commented Jul 16, 2024

Bug and root-cause

A GC test that validates deleted stats is flaky. The test uses sweep timeout of 200 ms. It does the following in sequence:

  1. Runs GC wtih every thing referenced and validates that there unreferenced stats are 0.
  2. Unreferences one data store and one blob, runs GC and validates they show as unreferenced.
  3. Unreferences another data store and another blob, runs GC and validates they show as unreferenced.
  4. Wait for sweep timeout, runs GC so that GC delete op is sent.
  5. Run GC again and validate that the unreferenced data stores and blobs show up as deleted.

If 200 ms have passed before step 3 is run, the unreferenced data store and blob will become sweep ready and will show up as deleted in GC stats. There was a change few months back that changed the logic to add sweep ready objects as deleted even when sweep is enabled (#19524) and the test probably became flaky then.

Fix

Broke the test above into 2:

  1. The first one only validates unreferenced stats (steps 1 to 3 above).
  2. The second one that only validates sweep ready and deleted stats. It unreferences objects, waits for sweep timeout so that nodes are sweep ready and validates that they show as deleted. It then runs GC again so that the GC sweep op has round tripped and these nodes are deleted.
    Since the test goes from unreferenced -> wait for sweep timeout -> validation, there is no intermediate GC run like bfore where it can unexpectedly become sweep ready.

AB#8350

@agarwal-navin agarwal-navin requested review from markfields, a team, pragya91, jatgarg, tyler-cai-microsoft, kian-thompson and rajatch-ff and removed request for a team July 16, 2024 23:01
@github-actions github-actions bot added area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch labels Jul 16, 2024
@agarwal-navin agarwal-navin changed the title [GC'] Fixed flaky GC stats end-to-end test [GC] Fixed flaky GC stats end-to-end test Jul 16, 2024
@msfluid-bot
Copy link
Collaborator

@fluid-example/bundle-size-tests: +245 Bytes
Metric NameBaseline SizeCompare SizeSize Diff
aqueduct.js 457.6 KB 457.63 KB +35 Bytes
azureClient.js 555.31 KB 555.35 KB +49 Bytes
connectionState.js 680 Bytes 680 Bytes No change
containerRuntime.js 258.68 KB 258.7 KB +14 Bytes
fluidFramework.js 391.58 KB 391.59 KB +14 Bytes
loader.js 134.02 KB 134.03 KB +14 Bytes
map.js 42.17 KB 42.18 KB +7 Bytes
matrix.js 145.68 KB 145.69 KB +7 Bytes
odspClient.js 523.45 KB 523.5 KB +49 Bytes
odspDriver.js 97.55 KB 97.57 KB +21 Bytes
odspPrefetchSnapshot.js 42.61 KB 42.62 KB +14 Bytes
sharedString.js 162.69 KB 162.7 KB +7 Bytes
sharedTree.js 382.09 KB 382.1 KB +7 Bytes
Total Size 3.27 MB 3.27 MB +245 Bytes

Baseline commit: 3f00c6d

Generated by 🚫 dangerJS against 9f46ec7

@agarwal-navin agarwal-navin merged commit a0bed77 into microsoft:main Jul 17, 2024
36 checks passed
RishhiB pushed a commit to RishhiB/FluidFramework-1 that referenced this pull request Jul 18, 2024
## Bug and root-cause

A GC test that validates deleted stats is flaky. The test uses sweep
timeout of 200 ms. It does the following in sequence:
1. Runs GC wtih every thing referenced and validates that there
unreferenced stats are 0.
2. Unreferences one data store and one blob, runs GC and validates they
show as unreferenced.
3. Unreferences another data store and another blob, runs GC and
validates they show as unreferenced.
4. Wait for sweep timeout, runs GC so that GC delete op is sent.
5. Run GC again and validate that the unreferenced data stores and blobs
show up as deleted.

If 200 ms have passed before step 3 is run, the unreferenced data store
and blob will become sweep ready and will show up as deleted in GC
stats. There was a change few months back that changed the logic to add
sweep ready objects as deleted even when sweep is enabled
(microsoft#19524) and the test
probably became flaky then.

## Fix
Broke the test above into 2:
1. The first one only validates unreferenced stats (steps 1 to 3 above).
2. The second one that only validates sweep ready and deleted stats. It
unreferences objects, waits for sweep timeout so that nodes are sweep
ready and validates that they show as deleted. It then runs GC again so
that the GC sweep op has round tripped and these nodes are deleted.
Since the test goes from unreferenced -> wait for sweep timeout ->
validation, there is no intermediate GC run like bfore where it can
unexpectedly become sweep ready.


[AB#8350](https://dev.azure.com/fluidframework/235294da-091d-4c29-84fc-cdfc3d90890b/_workitems/edit/8350)
@agarwal-navin agarwal-navin deleted the gcStatFlakyTest branch August 23, 2024 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants