Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize representation of runfiles in compact execution log #23321

Closed
wants to merge 51 commits into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Aug 16, 2024

Runfiles trees are now represented with a custom RunfilesTree message in the compact execution log. This allows using InputSets to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths.

Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an InputSet. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. Spawns), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries.

Progress on #18643.

RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format.

@fmeum fmeum changed the title Runfiles spawn log Optimize representation of runfiles in compact execution log Aug 19, 2024
@fmeum fmeum marked this pull request as ready for review August 21, 2024 10:50
@fmeum fmeum requested review from a team as code owners August 21, 2024 10:50
@fmeum fmeum requested review from aranguyen and removed request for a team and aranguyen August 21, 2024 10:50
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Configurability platforms, toolchains, cquery, select(), config transitions team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Aug 21, 2024
@fmeum fmeum removed team-Configurability platforms, toolchains, cquery, select(), config transitions team-Remote-Exec Issues and PRs for the Execution (Remote) team labels Aug 21, 2024
@fmeum fmeum requested a review from tjgq August 21, 2024 10:59
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 22, 2024

I resolved the conflict and subsequently fixed the existing tests after the merge of the SymlinkAction PR.

src/main/protobuf/spawn.proto Outdated Show resolved Hide resolved
src/main/protobuf/spawn.proto Outdated Show resolved Hide resolved
src/main/protobuf/spawn.proto Outdated Show resolved Hide resolved
src/main/protobuf/spawn.proto Outdated Show resolved Hide resolved
src/main/protobuf/spawn.proto Show resolved Hide resolved
@fmeum fmeum force-pushed the runfiles-spawn-log branch 2 times, most recently from 94f75fc to dfe1deb Compare August 30, 2024 13:12
@fmeum
Copy link
Collaborator Author

fmeum commented Aug 30, 2024

While adding more tests, I noticed that there is a pathological case that the new representation currently doesn't handle: If multiple artifacts at canonical locations collide, then the last in nested set order wins. Runfiles always use postorder, which I can almost emulate, but the distinction by type in InputSet loses some information about the order. For example, a directory might have come before a file, but InputSet doesn't allow me to see that.

@tjgq Since entry IDs are globally unique anyway, what do you think of merging file_ids, directory_ids and unresolved_symlink_ids into a single field that preserves the order?

@tjgq
Copy link
Contributor

tjgq commented Aug 30, 2024

While adding more tests, I noticed that there is a pathological case that the new representation currently doesn't handle: If multiple artifacts at canonical locations collide, then the last in nested set order wins. Runfiles always use postorder, which I can almost emulate, but the distinction by type in InputSet loses some information about the order. For example, a directory might have come before a file, but InputSet doesn't allow me to see that.

@tjgq Since entry IDs are globally unique anyway, what do you think of merging file_ids, directory_ids and unresolved_symlink_ids into a single field that preserves the order?

Oof, it's a disruptive change for any tools already consuming the compact format, but that's exactly why I wanted the format to remain experimental until Bazel 8 is released. Let's do it.

@fmeum fmeum force-pushed the runfiles-spawn-log branch 2 times, most recently from 7cb21e0 to 147ffc2 Compare September 2, 2024 08:45
@fmeum fmeum requested a review from tjgq September 2, 2024 12:55
@fmeum
Copy link
Collaborator Author

fmeum commented Sep 2, 2024

@tjgq I added a bunch of tests and addressed your comments. Since we were already making breaking changes to the format, I slightly tweaked how we represent IDs also in other parts of the protocol. I can revert that part of the change if you prefer.

@fmeum fmeum requested a review from tjgq September 17, 2024 22:49
Copy link
Contributor

@tjgq tjgq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this amazing contribution. I'll get started on the import.

@fmeum
Copy link
Collaborator Author

fmeum commented Sep 19, 2024

@bazel-io fork 7.4.0

@tjgq
Copy link
Contributor

tjgq commented Sep 19, 2024

I'm making the following changes on import:

  • Use new field ID for InputSet.input_ids and Output.output_id instead of reusing an existing ID (makes it easier to write a polyglot tool, as discussed out of band)
  • Replace occurrences of bazel in tests with TestConstants.PRODUCT_NAME (which is blaze internally)

@fmeum fmeum deleted the runfiles-spawn-log branch September 20, 2024 10:26
iancha1992 pushed a commit that referenced this pull request Sep 20, 2024
Runfiles trees are now represented with a custom `RunfilesTree` message in the compact execution log. This allows using `InputSet`s to representing all artifacts staged at canonical locations, with only symlinks and root symlinks stored flattened and with explicit runfiles paths.

Since runfile paths can collide, this change makes it necessary to preserve the order of elements in an `InputSet`. The previous representation as repeated ID fields for each type (file, symlink, directory) made this impossible, so the representation has been modified to reference all direct entry IDs in a single repeated field. Since this also reduces the potential for type mismatches between the ID field type and the referenced message type, all other typed IDs are replaced with untyped ID fields. By slightly tweaking the way IDs are generated for nested entries and not emitting IDs for entries that are never referenced (e.g. `Spawn`s), IDs are now consecutive, which simplifies the (possibly concurrent) bookkeeping for consumers by allowing them to use an array to store the entries.

Progress on #18643.

RELNOTES: The compact execution log now stores runfiles in a more compact representation that should reduce the memory overhead and log output size, in particular for test spawns. This change required breaking changes to the (experimental) log format.

Closes #23321.

PiperOrigin-RevId: 676773599
Change-Id: I010653681ffa44557142bf25009e9178b5d68515
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-user-response Awaiting a response from the author team-Performance Issues for Performance teams
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants