-
Notifications
You must be signed in to change notification settings - Fork 11.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sui-proxy] Allow sui bridge validator metrics #19076
[sui-proxy] Allow sui bridge validator metrics #19076
Conversation
@johnjmartin is attempting to deploy a commit to the Mysten Labs Team on Vercel. A member of the Team first needs to authorize it. |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
3 Skipped Deployments
|
/// friendly ip address we may see in metrics | ||
pub p2p_address: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? don't we want this for sui-node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaict, it's not used anywhere in the proxy. The downside of keeping it is it forces us to set meaningless defaults for the static peers feature (as well as for the bridge validators). For example, this is a snippet of what our prod config looks like right now:
static-peers:
pub-keys:
- name: ewr-mysten-ssfn1
p2p-address: 0.0.0.1
peer-id: c7bf6cb93ca8fdda655c47ebb85ace28e6931464564332bf63e27e90199c50ee
- name: ewr-mysten-ssfn2
p2p-address: 0.0.0.2
peer-id: 3227f8a05f0faa1a197c075d31135a366a1c6f3d4872cb8af66c14dea3e0eb66
- name: lhr-mysten-ssfn
p2p-address: 0.0.0.3
peer-id: c619a5e0f8f36eac45118c1f8bda28f0f508e2839042781f1d4a9818043f732c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @suiwombat for thoughts too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its not uses for anything then i agree we should remove it
} | ||
|
||
/// is_private makes a decent guess at determining of an addr is publicly routable. | ||
pub fn is_private(addr: IpAddr) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need this check anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It goes with removing the p2p_address (#19076 (comment))
8dc12ff
to
803a6d4
Compare
803a6d4
to
ba85962
Compare
ba85962
to
a2ebdba
Compare
…emove unnecessary ip addr checks
a2ebdba
to
0db4cf5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to see this approach instead. it's much safer, isolates these systems from each other, and if one fails, it won't take the other out.
crates/sui-proxy/src/peers.rs
Outdated
let bridge_validators_summary = | ||
match Self::get_bridge_validators(rpc_url.to_owned()).await { | ||
Ok(summary) => summary, | ||
Err(error) => { | ||
JSON_RPC_STATE | ||
.with_label_values(&["update_bridge_peer_count", "failed"]) | ||
.inc(); | ||
error!("unable to refresh sui bridge peer list: {error}"); | ||
continue; | ||
} | ||
}; | ||
// iff both the `sui_validators` and `bridge_validators` lists are up to date, then clear and update the allow list | ||
{ | ||
let sui_validators = extract(sui_validators_summary); | ||
let bridge_validators = extract_bridge(bridge_validators_summary).await; | ||
|
||
let mut allow = nodes.write().unwrap(); | ||
allow.clear(); | ||
allow.extend(sui_validators); | ||
allow.extend(bridge_validators); | ||
|
||
info!("{} peers managed to make it on the allow list", allow.len()); | ||
JSON_RPC_STATE | ||
.with_label_values(&["update_peer_count", "success"]) | ||
.inc_by(allow.len() as f64); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really prefer NOT to have these here and instead be in their own route handlers. It's dangerous, in my opinion to mix these together. Why? because if either get_bridge_validators fails, or get_validators fails, we don't update any peers with which we'll communicate. We double our failure modes in a hot path, critical function. super bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only possible for the set of bridge or sui validators to change once per day (epoch change), while we poll for updates to the system state every 60s. I am comfortable with the risk that if our public rpc goes down for longer than a day, then we won't display metrics for newly joined validators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the service restarts, you're dead in the water in that case. this is dangerous implementation, i'd implore you not to do it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you suggest we recover in the case where the service restarts and we're unable to communicate with the backing fullnode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i don't think we necessarily need to separate these into their own "route handlers", i mean this isn't even a route handler, this is the background task that is responsible for updating the allowed peers, but i do think that this could be improved upon to be a bit more resilient to failures as well as to be a bit clearer about the source of each of the peers.
If the "AllowedPeers" set was able to curry with it a bit more information as to the source of each peer (static, SuiValidator, SuiBridgeNode, WalrusNode, etc) then we could handle the updates to each set independently of each other, and so that a failure to update one set doesn't impact an update to another set. This tagging could be done either by tagging each key, or having a separate set entirely for each type of peer (it doesn't matter which, the implementation is just going to be slightly different). Then we could have the update logic essentially be:
loop {
update_sui_validator_set();
update_sui_bridge_set();
update_walrus_validator_set();
// ... others
}
where on each tick we unconditionally try to update each type and a failure in one doesn't impact updates in the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original design with the allowers was to enable the expansion of other routes to accommodate use cases like what john is doing here. I have to disagree with @bmwill on this. there isn't a legitimate reason to mix everything into one update loop and it is dangerous/flakey to do so.
the sui validators have a route they hit, with middleware they use and update loops that they depend on. the bridge should likewise, have its own route, use the same middleware when appropriate, and their own update loops. They definitely should not be integrated into the existing. I'm beating a dead horse here, but it's a poor design to do so. it leads to confusing code and needlessly mixes concerns of two different data flows.
The loop outlined here is even worse than what is already in this PR because it goes from an isolated update loop, to adding three (this pr adds 1, making 2). this is exactly what shouldn't happen. please just make a route for these and keep this logic isolated to their respective parts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original design with the allowers was to enable the expansion of other routes to accommodate use cases like what john is doing here.
If you're referring to the TLS verifier we have that allows connections based on the Allower
, then yes John is doing exactly as expected to extend the sets of peers that we allow to be connected. There can only be one allower per network endpoint so unless we have a full complete deployment of this for each "type" (which i don't think is the case), then this is exactly how we should extend it. As far as which http "route" we expect metrics to be pushed to, I think that's a completely orthogonal discussion
/// friendly ip address we may see in metrics | ||
pub p2p_address: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its not uses for anything then i agree we should remove it
crates/sui-proxy/src/peers.rs
Outdated
let bridge_validators_summary = | ||
match Self::get_bridge_validators(rpc_url.to_owned()).await { | ||
Ok(summary) => summary, | ||
Err(error) => { | ||
JSON_RPC_STATE | ||
.with_label_values(&["update_bridge_peer_count", "failed"]) | ||
.inc(); | ||
error!("unable to refresh sui bridge peer list: {error}"); | ||
continue; | ||
} | ||
}; | ||
// iff both the `sui_validators` and `bridge_validators` lists are up to date, then clear and update the allow list | ||
{ | ||
let sui_validators = extract(sui_validators_summary); | ||
let bridge_validators = extract_bridge(bridge_validators_summary).await; | ||
|
||
let mut allow = nodes.write().unwrap(); | ||
allow.clear(); | ||
allow.extend(sui_validators); | ||
allow.extend(bridge_validators); | ||
|
||
info!("{} peers managed to make it on the allow list", allow.len()); | ||
JSON_RPC_STATE | ||
.with_label_values(&["update_peer_count", "success"]) | ||
.inc_by(allow.len() as f64); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So i don't think we necessarily need to separate these into their own "route handlers", i mean this isn't even a route handler, this is the background task that is responsible for updating the allowed peers, but i do think that this could be improved upon to be a bit more resilient to failures as well as to be a bit clearer about the source of each of the peers.
If the "AllowedPeers" set was able to curry with it a bit more information as to the source of each peer (static, SuiValidator, SuiBridgeNode, WalrusNode, etc) then we could handle the updates to each set independently of each other, and so that a failure to update one set doesn't impact an update to another set. This tagging could be done either by tagging each key, or having a separate set entirely for each type of peer (it doesn't matter which, the implementation is just going to be slightly different). Then we could have the update logic essentially be:
loop {
update_sui_validator_set();
update_sui_bridge_set();
update_walrus_validator_set();
// ... others
}
where on each tick we unconditionally try to update each type and a failure in one doesn't impact updates in the others.
crates/sui-proxy/src/peers.rs
Outdated
let mut allow = nodes.write().unwrap(); | ||
allow.clear(); | ||
allow.extend(sui_validators); | ||
allow.extend(bridge_validators); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For updating the sui validator set i do think it makes sense to maybe nuke the whole thing and replace it wholesale, but for the bridge node, since we have to do this awkward fetching of their pubkeys as an additional step (which i still don't like and think we should force this info to be encoded onchain, @longbowlu can do we this instead?), I think we probably want to have a map from Bridge node -> Pubkey and we only remove a entry here if the bridge node is no longer in the committee and update if we're able to successfully reach that bridge node to fetch its key. if the node is still a part of the committee but we happen to fail fetching its pubkey on one tick, we likely don't want to remove it from the set as it could result in losing some small window of data from that node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these changes! lgtm
f4df136
to
7ab836b
Compare
## Description Introduces the ability for sui proxy to accept metrics that are pushed to it from sui-bridge. Works by looking up the `http_rest_url` for each proxy in the on-chain metadata, and querying each endpoint for the metrics pub key Depends on #18877. ## Test plan Tested in testnet
Description
Introduces the ability for sui proxy to accept metrics that are pushed to it from sui-bridge. Works by looking up the
http_rest_url
for each proxy in the on-chain metadata, and querying each endpoint for the metrics pub keyDepends on #18877.
Test plan
Tested in testnet