[sui-proxy] Allow sui bridge validator metrics #19076

johnjmartin · 2024-08-22T21:52:55Z

Description

Introduces the ability for sui proxy to accept metrics that are pushed to it from sui-bridge. Works by looking up the http_rest_url for each proxy in the on-chain metadata, and querying each endpoint for the metrics pub key

Depends on #18877.

Test plan

Tested in testnet

vercel · 2024-08-22T21:53:03Z

@johnjmartin is attempting to deploy a commit to the Mysten Labs Team on Vercel.

A member of the Team first needs to authorize it.

vercel · 2024-08-22T21:53:06Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
sui-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Sep 12, 2024 5:03pm

3 Skipped Deployments

Name	Status	Preview	Updated (UTC)
multisig-toolkit	⬜️ Ignored (Inspect)	Visit Preview	Sep 12, 2024 5:03pm
sui-kiosk	⬜️ Ignored (Inspect)	Visit Preview	Sep 12, 2024 5:03pm
sui-typescript-docs	⬜️ Ignored (Inspect)	Visit Preview	Sep 12, 2024 5:03pm

longbowlu · 2024-08-26T18:26:49Z

crates/sui-proxy/src/config.rs

-    /// friendly ip address we may see in metrics
-    pub p2p_address: String,


why? don't we want this for sui-node?

afaict, it's not used anywhere in the proxy. The downside of keeping it is it forces us to set meaningless defaults for the static peers feature (as well as for the bridge validators). For example, this is a snippet of what our prod config looks like right now:

static-peers: pub-keys: - name: ewr-mysten-ssfn1 p2p-address: 0.0.0.1 peer-id: c7bf6cb93ca8fdda655c47ebb85ace28e6931464564332bf63e27e90199c50ee - name: ewr-mysten-ssfn2 p2p-address: 0.0.0.2 peer-id: 3227f8a05f0faa1a197c075d31135a366a1c6f3d4872cb8af66c14dea3e0eb66 - name: lhr-mysten-ssfn p2p-address: 0.0.0.3 peer-id: c619a5e0f8f36eac45118c1f8bda28f0f508e2839042781f1d4a9818043f732c

cc @suiwombat for thoughts too

If its not uses for anything then i agree we should remove it

longbowlu · 2024-08-26T18:27:55Z

crates/sui-proxy/src/ip.rs

-}
-
-/// is_private makes a decent guess at determining of an addr is publicly routable.
-pub fn is_private(addr: IpAddr) -> bool {


we don't need this check anymore?

It goes with removing the p2p_address (#19076 (comment))

crates/sui-proxy/src/peers.rs

…emove unnecessary ip addr checks

temporarily fail

suiwombat

I'd prefer to see this approach instead. it's much safer, isolates these systems from each other, and if one fails, it won't take the other out.

#19173

suiwombat · 2024-09-05T00:59:12Z

crates/sui-proxy/src/peers.rs

+                let bridge_validators_summary =
+                    match Self::get_bridge_validators(rpc_url.to_owned()).await {
+                        Ok(summary) => summary,
+                        Err(error) => {
+                            JSON_RPC_STATE
+                                .with_label_values(&["update_bridge_peer_count", "failed"])
+                                .inc();
+                            error!("unable to refresh sui bridge peer list: {error}");
+                            continue;
+                        }
+                    };
+                // iff both the `sui_validators` and `bridge_validators` lists are up to date, then clear and update the allow list
+                {
+                    let sui_validators = extract(sui_validators_summary);
+                    let bridge_validators = extract_bridge(bridge_validators_summary).await;
+
+                    let mut allow = nodes.write().unwrap();
+                    allow.clear();
+                    allow.extend(sui_validators);
+                    allow.extend(bridge_validators);
+
+                    info!("{} peers managed to make it on the allow list", allow.len());
+                    JSON_RPC_STATE
+                        .with_label_values(&["update_peer_count", "success"])
+                        .inc_by(allow.len() as f64);
                }


I'd really prefer NOT to have these here and instead be in their own route handlers. It's dangerous, in my opinion to mix these together. Why? because if either get_bridge_validators fails, or get_validators fails, we don't update any peers with which we'll communicate. We double our failure modes in a hot path, critical function. super bad.

It's only possible for the set of bridge or sui validators to change once per day (epoch change), while we poll for updates to the system state every 60s. I am comfortable with the risk that if our public rpc goes down for longer than a day, then we won't display metrics for newly joined validators

if the service restarts, you're dead in the water in that case. this is dangerous implementation, i'd implore you not to do it this way.

How would you suggest we recover in the case where the service restarts and we're unable to communicate with the backing fullnode?

So i don't think we necessarily need to separate these into their own "route handlers", i mean this isn't even a route handler, this is the background task that is responsible for updating the allowed peers, but i do think that this could be improved upon to be a bit more resilient to failures as well as to be a bit clearer about the source of each of the peers.

If the "AllowedPeers" set was able to curry with it a bit more information as to the source of each peer (static, SuiValidator, SuiBridgeNode, WalrusNode, etc) then we could handle the updates to each set independently of each other, and so that a failure to update one set doesn't impact an update to another set. This tagging could be done either by tagging each key, or having a separate set entirely for each type of peer (it doesn't matter which, the implementation is just going to be slightly different). Then we could have the update logic essentially be:

loop { update_sui_validator_set(); update_sui_bridge_set(); update_walrus_validator_set(); // ... others }

where on each tick we unconditionally try to update each type and a failure in one doesn't impact updates in the others.

the original design with the allowers was to enable the expansion of other routes to accommodate use cases like what john is doing here. I have to disagree with @bmwill on this. there isn't a legitimate reason to mix everything into one update loop and it is dangerous/flakey to do so.

the sui validators have a route they hit, with middleware they use and update loops that they depend on. the bridge should likewise, have its own route, use the same middleware when appropriate, and their own update loops. They definitely should not be integrated into the existing. I'm beating a dead horse here, but it's a poor design to do so. it leads to confusing code and needlessly mixes concerns of two different data flows.

The loop outlined here is even worse than what is already in this PR because it goes from an isolated update loop, to adding three (this pr adds 1, making 2). this is exactly what shouldn't happen. please just make a route for these and keep this logic isolated to their respective parts.

the original design with the allowers was to enable the expansion of other routes to accommodate use cases like what john is doing here.

If you're referring to the TLS verifier we have that allows connections based on the Allower, then yes John is doing exactly as expected to extend the sets of peers that we allow to be connected. There can only be one allower per network endpoint so unless we have a full complete deployment of this for each "type" (which i don't think is the case), then this is exactly how we should extend it. As far as which http "route" we expect metrics to be pushed to, I think that's a completely orthogonal discussion

bmwill · 2024-09-08T00:18:48Z

crates/sui-proxy/src/config.rs

-    /// friendly ip address we may see in metrics
-    pub p2p_address: String,


If its not uses for anything then i agree we should remove it

bmwill · 2024-09-08T00:40:23Z

crates/sui-proxy/src/peers.rs

+                let bridge_validators_summary =
+                    match Self::get_bridge_validators(rpc_url.to_owned()).await {
+                        Ok(summary) => summary,
+                        Err(error) => {
+                            JSON_RPC_STATE
+                                .with_label_values(&["update_bridge_peer_count", "failed"])
+                                .inc();
+                            error!("unable to refresh sui bridge peer list: {error}");
+                            continue;
+                        }
+                    };
+                // iff both the `sui_validators` and `bridge_validators` lists are up to date, then clear and update the allow list
+                {
+                    let sui_validators = extract(sui_validators_summary);
+                    let bridge_validators = extract_bridge(bridge_validators_summary).await;
+
+                    let mut allow = nodes.write().unwrap();
+                    allow.clear();
+                    allow.extend(sui_validators);
+                    allow.extend(bridge_validators);
+
+                    info!("{} peers managed to make it on the allow list", allow.len());
+                    JSON_RPC_STATE
+                        .with_label_values(&["update_peer_count", "success"])
+                        .inc_by(allow.len() as f64);
                }


So i don't think we necessarily need to separate these into their own "route handlers", i mean this isn't even a route handler, this is the background task that is responsible for updating the allowed peers, but i do think that this could be improved upon to be a bit more resilient to failures as well as to be a bit clearer about the source of each of the peers.

If the "AllowedPeers" set was able to curry with it a bit more information as to the source of each peer (static, SuiValidator, SuiBridgeNode, WalrusNode, etc) then we could handle the updates to each set independently of each other, and so that a failure to update one set doesn't impact an update to another set. This tagging could be done either by tagging each key, or having a separate set entirely for each type of peer (it doesn't matter which, the implementation is just going to be slightly different). Then we could have the update logic essentially be:

loop { update_sui_validator_set(); update_sui_bridge_set(); update_walrus_validator_set(); // ... others }

where on each tick we unconditionally try to update each type and a failure in one doesn't impact updates in the others.

bmwill · 2024-09-08T00:44:08Z

crates/sui-proxy/src/peers.rs

+                    let mut allow = nodes.write().unwrap();
+                    allow.clear();
+                    allow.extend(sui_validators);
+                    allow.extend(bridge_validators);


For updating the sui validator set i do think it makes sense to maybe nuke the whole thing and replace it wholesale, but for the bridge node, since we have to do this awkward fetching of their pubkeys as an additional step (which i still don't like and think we should force this info to be encoded onchain, @longbowlu can do we this instead?), I think we probably want to have a map from Bridge node -> Pubkey and we only remove a entry here if the bridge node is no longer in the committee and update if we're able to successfully reach that bridge node to fetch its key. if the node is still a part of the committee but we happen to fail fetching its pubkey on one tick, we likely don't want to remove it from the set as it could result in losing some small window of data from that node.

bmwill

Thanks for making these changes! lgtm

## Description Introduces the ability for sui proxy to accept metrics that are pushed to it from sui-bridge. Works by looking up the `http_rest_url` for each proxy in the on-chain metadata, and querying each endpoint for the metrics pub key Depends on #18877. ## Test plan Tested in testnet

johnjmartin requested review from bmwill, longbowlu and suiwombat August 22, 2024 21:52

vercel bot deployed to Preview – sui-docs August 22, 2024 21:59 View deployment

longbowlu reviewed Aug 26, 2024

View reviewed changes

johnjmartin force-pushed the allow-bridge-metrics branch from 8dc12ff to 803a6d4 Compare August 28, 2024 00:08

vercel bot deployed to Preview – sui-docs August 28, 2024 00:12 View deployment

longbowlu approved these changes Aug 28, 2024

View reviewed changes

johnjmartin force-pushed the allow-bridge-metrics branch from 803a6d4 to ba85962 Compare August 28, 2024 18:14

vercel bot deployed to Preview – sui-docs August 28, 2024 18:22 View deployment

johnjmartin force-pushed the allow-bridge-metrics branch from ba85962 to a2ebdba Compare August 28, 2024 19:12

vercel bot deployed to Preview – sui-docs August 28, 2024 19:17 View deployment

johnjmartin added 4 commits September 4, 2024 12:18

refactor: rename SuiPeer to AllowedPeer, remove unused p2p_address, r…

d60d00b

…emove unnecessary ip addr checks

feature: add bridge metrics key fetching to poll_peer_list

42f4f42

addressing review comments

52fa2da

refactor poll_peer_list to not wipe the allow list when rpc requests

0db4cf5

temporarily fail

johnjmartin force-pushed the allow-bridge-metrics branch from a2ebdba to 0db4cf5 Compare September 4, 2024 23:53

vercel bot deployed to Preview – sui-docs September 4, 2024 23:57 View deployment

suiwombat reviewed Sep 5, 2024

View reviewed changes

bmwill reviewed Sep 8, 2024

View reviewed changes

johnjmartin added 2 commits September 9, 2024 08:13

feedback: split sui nodes and bridge nodes into separate maps

54cce36

feedback: make bridge metrics key fetching less failure prone

238bb18

vercel bot deployed to Preview – sui-docs September 11, 2024 19:29 View deployment

bmwill approved these changes Sep 11, 2024

View reviewed changes

fallback_to_cached_key if invalid response + unit testing

7ab836b

johnjmartin force-pushed the allow-bridge-metrics branch from f4df136 to 7ab836b Compare September 12, 2024 16:58

vercel bot deployed to Preview – sui-docs September 12, 2024 17:03 View deployment

johnjmartin merged commit 3fd7845 into MystenLabs:main Sep 13, 2024
44 checks passed

johnjmartin deleted the allow-bridge-metrics branch September 13, 2024 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sui-proxy] Allow sui bridge validator metrics #19076

[sui-proxy] Allow sui bridge validator metrics #19076

johnjmartin commented Aug 22, 2024

vercel bot commented Aug 22, 2024

vercel bot commented Aug 22, 2024 •

edited

Loading

longbowlu Aug 26, 2024

johnjmartin Aug 26, 2024

johnjmartin Aug 26, 2024

bmwill Sep 8, 2024

longbowlu Aug 26, 2024

johnjmartin Aug 26, 2024

suiwombat left a comment

suiwombat Sep 5, 2024

johnjmartin Sep 5, 2024

suiwombat Sep 6, 2024 •

edited

Loading

johnjmartin Sep 6, 2024

bmwill Sep 8, 2024

suiwombat Sep 11, 2024

bmwill Sep 11, 2024

bmwill Sep 8, 2024

bmwill Sep 8, 2024

bmwill Sep 8, 2024

bmwill left a comment

		/// friendly ip address we may see in metrics
		pub p2p_address: String,

[sui-proxy] Allow sui bridge validator metrics #19076

[sui-proxy] Allow sui bridge validator metrics #19076

Conversation

johnjmartin commented Aug 22, 2024

Description

Test plan

vercel bot commented Aug 22, 2024

vercel bot commented Aug 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suiwombat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

suiwombat Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmwill left a comment

Choose a reason for hiding this comment

vercel bot commented Aug 22, 2024 •

edited

Loading

suiwombat Sep 6, 2024 •

edited

Loading