Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sui-proxy] Allow sui bridge validator metrics #19076

Merged
merged 7 commits into from
Sep 13, 2024

Conversation

johnjmartin
Copy link
Contributor

Description

Introduces the ability for sui proxy to accept metrics that are pushed to it from sui-bridge. Works by looking up the http_rest_url for each proxy in the on-chain metadata, and querying each endpoint for the metrics pub key

Depends on #18877.

Test plan

Tested in testnet

Copy link

vercel bot commented Aug 22, 2024

@johnjmartin is attempting to deploy a commit to the Mysten Labs Team on Vercel.

A member of the Team first needs to authorize it.

Copy link

vercel bot commented Aug 22, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
sui-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 12, 2024 5:03pm
3 Skipped Deployments
Name Status Preview Comments Updated (UTC)
multisig-toolkit ⬜️ Ignored (Inspect) Visit Preview Sep 12, 2024 5:03pm
sui-kiosk ⬜️ Ignored (Inspect) Visit Preview Sep 12, 2024 5:03pm
sui-typescript-docs ⬜️ Ignored (Inspect) Visit Preview Sep 12, 2024 5:03pm

Comment on lines -82 to -83
/// friendly ip address we may see in metrics
pub p2p_address: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? don't we want this for sui-node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaict, it's not used anywhere in the proxy. The downside of keeping it is it forces us to set meaningless defaults for the static peers feature (as well as for the bridge validators). For example, this is a snippet of what our prod config looks like right now:

static-peers:
  pub-keys:
    - name: ewr-mysten-ssfn1
      p2p-address: 0.0.0.1
      peer-id: c7bf6cb93ca8fdda655c47ebb85ace28e6931464564332bf63e27e90199c50ee
    - name: ewr-mysten-ssfn2
      p2p-address: 0.0.0.2
      peer-id: 3227f8a05f0faa1a197c075d31135a366a1c6f3d4872cb8af66c14dea3e0eb66
    - name: lhr-mysten-ssfn
      p2p-address: 0.0.0.3
      peer-id: c619a5e0f8f36eac45118c1f8bda28f0f508e2839042781f1d4a9818043f732c

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @suiwombat for thoughts too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its not uses for anything then i agree we should remove it

}

/// is_private makes a decent guess at determining of an addr is publicly routable.
pub fn is_private(addr: IpAddr) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this check anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It goes with removing the p2p_address (#19076 (comment))

crates/sui-proxy/src/peers.rs Outdated Show resolved Hide resolved
crates/sui-proxy/src/peers.rs Outdated Show resolved Hide resolved
crates/sui-proxy/src/peers.rs Outdated Show resolved Hide resolved
crates/sui-proxy/src/peers.rs Outdated Show resolved Hide resolved
crates/sui-proxy/src/peers.rs Outdated Show resolved Hide resolved
crates/sui-proxy/src/peers.rs Outdated Show resolved Hide resolved
crates/sui-proxy/src/peers.rs Outdated Show resolved Hide resolved
crates/sui-proxy/src/peers.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@suiwombat suiwombat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to see this approach instead. it's much safer, isolates these systems from each other, and if one fails, it won't take the other out.

#19173

Comment on lines 264 to 289
let bridge_validators_summary =
match Self::get_bridge_validators(rpc_url.to_owned()).await {
Ok(summary) => summary,
Err(error) => {
JSON_RPC_STATE
.with_label_values(&["update_bridge_peer_count", "failed"])
.inc();
error!("unable to refresh sui bridge peer list: {error}");
continue;
}
};
// iff both the `sui_validators` and `bridge_validators` lists are up to date, then clear and update the allow list
{
let sui_validators = extract(sui_validators_summary);
let bridge_validators = extract_bridge(bridge_validators_summary).await;

let mut allow = nodes.write().unwrap();
allow.clear();
allow.extend(sui_validators);
allow.extend(bridge_validators);

info!("{} peers managed to make it on the allow list", allow.len());
JSON_RPC_STATE
.with_label_values(&["update_peer_count", "success"])
.inc_by(allow.len() as f64);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really prefer NOT to have these here and instead be in their own route handlers. It's dangerous, in my opinion to mix these together. Why? because if either get_bridge_validators fails, or get_validators fails, we don't update any peers with which we'll communicate. We double our failure modes in a hot path, critical function. super bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only possible for the set of bridge or sui validators to change once per day (epoch change), while we poll for updates to the system state every 60s. I am comfortable with the risk that if our public rpc goes down for longer than a day, then we won't display metrics for newly joined validators

Copy link
Contributor

@suiwombat suiwombat Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the service restarts, you're dead in the water in that case. this is dangerous implementation, i'd implore you not to do it this way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you suggest we recover in the case where the service restarts and we're unable to communicate with the backing fullnode?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i don't think we necessarily need to separate these into their own "route handlers", i mean this isn't even a route handler, this is the background task that is responsible for updating the allowed peers, but i do think that this could be improved upon to be a bit more resilient to failures as well as to be a bit clearer about the source of each of the peers.

If the "AllowedPeers" set was able to curry with it a bit more information as to the source of each peer (static, SuiValidator, SuiBridgeNode, WalrusNode, etc) then we could handle the updates to each set independently of each other, and so that a failure to update one set doesn't impact an update to another set. This tagging could be done either by tagging each key, or having a separate set entirely for each type of peer (it doesn't matter which, the implementation is just going to be slightly different). Then we could have the update logic essentially be:

loop {
    update_sui_validator_set();
    update_sui_bridge_set();
    update_walrus_validator_set();
    // ... others
}

where on each tick we unconditionally try to update each type and a failure in one doesn't impact updates in the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original design with the allowers was to enable the expansion of other routes to accommodate use cases like what john is doing here. I have to disagree with @bmwill on this. there isn't a legitimate reason to mix everything into one update loop and it is dangerous/flakey to do so.

the sui validators have a route they hit, with middleware they use and update loops that they depend on. the bridge should likewise, have its own route, use the same middleware when appropriate, and their own update loops. They definitely should not be integrated into the existing. I'm beating a dead horse here, but it's a poor design to do so. it leads to confusing code and needlessly mixes concerns of two different data flows.

The loop outlined here is even worse than what is already in this PR because it goes from an isolated update loop, to adding three (this pr adds 1, making 2). this is exactly what shouldn't happen. please just make a route for these and keep this logic isolated to their respective parts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original design with the allowers was to enable the expansion of other routes to accommodate use cases like what john is doing here.

If you're referring to the TLS verifier we have that allows connections based on the Allower, then yes John is doing exactly as expected to extend the sets of peers that we allow to be connected. There can only be one allower per network endpoint so unless we have a full complete deployment of this for each "type" (which i don't think is the case), then this is exactly how we should extend it. As far as which http "route" we expect metrics to be pushed to, I think that's a completely orthogonal discussion

Comment on lines -82 to -83
/// friendly ip address we may see in metrics
pub p2p_address: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its not uses for anything then i agree we should remove it

Comment on lines 264 to 289
let bridge_validators_summary =
match Self::get_bridge_validators(rpc_url.to_owned()).await {
Ok(summary) => summary,
Err(error) => {
JSON_RPC_STATE
.with_label_values(&["update_bridge_peer_count", "failed"])
.inc();
error!("unable to refresh sui bridge peer list: {error}");
continue;
}
};
// iff both the `sui_validators` and `bridge_validators` lists are up to date, then clear and update the allow list
{
let sui_validators = extract(sui_validators_summary);
let bridge_validators = extract_bridge(bridge_validators_summary).await;

let mut allow = nodes.write().unwrap();
allow.clear();
allow.extend(sui_validators);
allow.extend(bridge_validators);

info!("{} peers managed to make it on the allow list", allow.len());
JSON_RPC_STATE
.with_label_values(&["update_peer_count", "success"])
.inc_by(allow.len() as f64);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So i don't think we necessarily need to separate these into their own "route handlers", i mean this isn't even a route handler, this is the background task that is responsible for updating the allowed peers, but i do think that this could be improved upon to be a bit more resilient to failures as well as to be a bit clearer about the source of each of the peers.

If the "AllowedPeers" set was able to curry with it a bit more information as to the source of each peer (static, SuiValidator, SuiBridgeNode, WalrusNode, etc) then we could handle the updates to each set independently of each other, and so that a failure to update one set doesn't impact an update to another set. This tagging could be done either by tagging each key, or having a separate set entirely for each type of peer (it doesn't matter which, the implementation is just going to be slightly different). Then we could have the update logic essentially be:

loop {
    update_sui_validator_set();
    update_sui_bridge_set();
    update_walrus_validator_set();
    // ... others
}

where on each tick we unconditionally try to update each type and a failure in one doesn't impact updates in the others.

let mut allow = nodes.write().unwrap();
allow.clear();
allow.extend(sui_validators);
allow.extend(bridge_validators);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For updating the sui validator set i do think it makes sense to maybe nuke the whole thing and replace it wholesale, but for the bridge node, since we have to do this awkward fetching of their pubkeys as an additional step (which i still don't like and think we should force this info to be encoded onchain, @longbowlu can do we this instead?), I think we probably want to have a map from Bridge node -> Pubkey and we only remove a entry here if the bridge node is no longer in the committee and update if we're able to successfully reach that bridge node to fetch its key. if the node is still a part of the committee but we happen to fail fetching its pubkey on one tick, we likely don't want to remove it from the set as it could result in losing some small window of data from that node.

Copy link
Contributor

@bmwill bmwill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes! lgtm

@johnjmartin johnjmartin merged commit 3fd7845 into MystenLabs:main Sep 13, 2024
44 checks passed
@johnjmartin johnjmartin deleted the allow-bridge-metrics branch September 13, 2024 22:10
suiwombat pushed a commit that referenced this pull request Sep 16, 2024
## Description 

Introduces the ability for sui proxy to accept metrics that are pushed
to it from sui-bridge. Works by looking up the `http_rest_url` for each
proxy in the on-chain metadata, and querying each endpoint for the
metrics pub key

Depends on #18877.

## Test plan 

Tested in testnet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants