It is impossible to set a custom polling frequency in DNS resolver #1663

kop · 2017-11-10T10:11:19Z

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.7.2

What version of Go are you using (`go version`)?

go1.9.2 darwin/amd64

What did you do?

I was trying to play with new Load Balancing APIs for gRPC services running in K8S.

What did you expect to see?

I expected to see an option to configure default DNS resolver and set custom frequency of polling the DNS server.

What did you see instead?

Resolver has a hardcoded value of 30 minutes that is absolutely unusable for dynamic environments (like K8S).

The text was updated successfully, but these errors were encountered:

menghanl · 2017-11-10T18:52:40Z

With the current code, right, there's no way to set a custom polling frequency for DNS.
An easy workaround would be to add a new function dns.NewBuilderWithFreq(freq time.Duration) (similar to this). Users can create a new DNS builder with this, and register it to override the default DNS builder.

Another problem here is, it's very hard to find the perfect polling frequency.

I'm working on a change to make resolver re-resolve the name whenever a connection goes down. With this, we plan to remove the polling frequency in resolver.
But the resolver won't re-resolve if all connections are working. Does this work for K8S?

kop · 2017-11-10T19:01:28Z

An easy workaround would be to add a new function dns.NewBuilderWithFreq(freq time.Duration)

Yes, this sounds good.

I'm working on a change to make resolver re-resolve the name whenever a connection goes down.
But the resolver won't re-resolve if all connections are working. Does this work for K8S?

I'm not sure it will work great. I mean it will handle cases when pod is destroyed/recreated but it will not handle cases when service is scaled up and new pods are ready to serve requests, isnt?

I think polling frequency should stay. It will be possible to set much higher values for polling frequency if targets will be reevaluated on every disconnect, but we still need a mechanism to discover new instances of the service.

menghanl · 2017-11-10T19:22:12Z

it will not handle cases when service is scaled up and new pods are ready to serve requests

IMO a better solution for this would be to push the updates instead of polling. But that's out of the scope of DNS.

cc @ejona86 for more inputs.

kop · 2017-11-10T19:26:47Z

IMO a better solution for this would be to push the updates instead of polling.

Totally agreed.

But that's out of the scope of DNS.

DNS is a native service discovery mechanism in K8S and i think it would be really great if it will work with gRPC out of the box :)

ejona86 · 2017-11-10T19:38:59Z

I mean it will handle cases when pod is destroyed/recreated but it will not handle cases when service is scaled up and new pods are ready to serve requests, isnt?

The solution for this is MAX_CONNECTION_AGE. This has the advantage that the service owner controls the configuration. (It also works with L4 proxies where the address doesn't change.)

trusch · 2017-11-29T11:31:40Z

DNS in k8s only returns the Virtual IP of the service and not all IP's of the replicas. This will not work when reusing a single client conn.

I saw a k8s resolver implementation which resolves natively via the k8s api.

kop · 2017-11-29T21:06:32Z

@trusch but we can use Headless Services - they return multiple A records, this should solve the problem.

ejona86 · 2017-11-29T23:20:38Z

DNS in k8s only returns the Virtual IP of the service and not all IP's of the replicas. This will not work when reusing a single client conn.

@trusch, what do you mean by "not work?" It should "work" as in function, in a basic sense. It would also load balance multiple clients to multiple servers and rebalance the clients over time (via MAX_CONNECTION_AGE). But I would agree it does not help distributing load from a single client to multiple servers. Is this what you mean?

Our (internal-discussion only?) proposed solution there is to create multiple ClientConns and round-robin over them. To make this easier, we could enhance grpc to create the multiple ClientConns behind a single ClientConn so it becomes mostly transparent. The extra connections would not be guaranteed to get different backends, but that is a normal property of L4 LBs.

The discussions/work to do this stalled. But this work can be re-opened if necessary. But either being okay with what you have (because, let's say, you don't a single client causing a substantial amount of load) or exposing IPs to your clients to avoid the L4 LB sounds superior in this case.

trusch · 2017-11-30T16:58:00Z

@ejona86 yes, I got a setup with gateways and workers and would like to have one ClientConn to the workers which loadbalances on a per call basis. I think using headless services like @kop suggested and the new DNS resolver (in combination with MAX_CONN_AGE) will do the job for me.

dfawley · 2017-12-14T22:22:07Z

Resolving with "working as intended", since we don't intend to add support for custom polling frequencies to the default DNS resolver -- please let us know if you don't think that's appropriate.

HannesMvW · 2018-01-16T15:30:34Z

I agree with the OP, we'd benefit from setting the DNS polling frequency shorter than 30 minutes.
What is the alternative method to alert client(s) when new instenaces have been added behind a Headless service?

ejona86 · 2018-01-16T17:54:13Z

@HannesMvW, see #1663 (comment). The API is keepalive.ServerParameters.MaxConnectionAge.

HannesMvW · 2018-01-17T14:22:17Z

Thanks @ejona86 - how do I access this API? I'm using the "grpc" package from a Go application.

ejona86 · 2018-01-17T16:15:51Z

@HannesMvW, use grpc.KeepaliveParams() to create a ServerOption that's passed to grpc.NewServer()

HannesMvW · 2018-01-18T08:35:37Z

Thanks, @ejona86. My apologies for not being more specific - I'd like my gRPC clients (using client side load balancing) to refresh the list of servers (listed for a Kubernetes service name), via DNS, more frequently than every 30 minutes.

Of course, instead of polling, a prettier solution could somehow subscribe to DNS changes, but I haven't yet figured out if or how this is possible with gRPC.

PS. Thanks to your help I got MaxConnectionAge to work, and then realized it keeps killing my gRPC connections - which is not what I intended. I'd like to keep server connections up throughout the lifetime of the server, but also make use of new servers as quickly as possible (which I guess requires DNS polling).

ejona86 · 2018-01-18T16:21:32Z

@HannesMvW, when the server shuts down the connection due to age, the client will re-resolve DNS. This is very natural with pick-first load balancing. With round-robin it is slightly awkward, but it doesn't seem severe enough to warrant determining yet-another-lb-solution. It's also only a problem when using DNS as other naming systems tend to provide update notifications.

DNS polling is in general a bad idea. In the presence of round-robin DNS servers it would cause actual harm. High-frequency DNS polling can cause quite a bit of load to a critical system, which is dangerous. It is also client-side configuration which generally cannot be updated quickly. I'm right now hoping to kill the 30 minute polling, as Go is the only implementation that does that and there should not be a need.

The k8s documentation itself talks about them avoiding the same problems.

I'd like to keep server connections up throughout the lifetime of the server

What harm are you seeing by the re-creation of connections? I know of several possible issues, but I'm interested in what you're experiencing.

HannesMvW · 2018-01-18T16:36:29Z

Great, thanks @ejona86 for your feedback! I understand polling in general is a bad idea. In this case "a few minutes" should be good enough frequency, since we'd want the gRPC client side load balancing to pick up new servers reasonably quick.

My application is a small number of client side load balancers, working towards a huge set of servers, possibly in the 100's or 1000. (This is why I want to avoid tearing down gRPC connections unnecessarily.) When scaling up the number of servers, I'd like to see them go into use a bit quicker than 30 minutes...

I had a look at the gRPC source code, and I see no solution short of writing my own load balancer builder...all just to change "30m0s" to something like "3m0s". Or is there an easier way?

Or, perhaps we can also hook into dockerd to get notifications when new servers are added.

--

On a side note, I've noticed gRPC DNS for service "foo" does this:

Query DNS SRV record for "grpclb.tcp.foo"
Query DNS for each target returned from this query (think hundreds of DNS queries here...)
Query DNS for A record for "foo" (this returns the complete set of server IP addresses)
Query DNS for TXT record for "foo" (in hope of getting a service config, not supported by K8s)

I'd love to skip steps 1, 2, and 4, since 3 gives us the complete list of servers we're looking for.
Do I misconfigure/misuse gRPC client side load balancing somehow?
Although related to the thread above, I realize this is probably better off on a different thread.

ejona86 · 2018-01-18T16:51:36Z

Ah, I see. Your case is quite pathological. I'll try to have a quick conversation with some other devs today about this.

Worst-case, you could write your own name resolver "my-better-dns" that does exactly what you want. Most of the complexity of the current resolver is to do all those things you don't like anyway, so this probably isn't that bad. (Not to say it should be the solution, just that it isn't a bad "worst case.")

(1) shouldn't return any results for your case, so (2) will be a noop. Most of the time we wouldn't expect 100s of LB servers and instead you rely on a few virtual IPs or the like for large scale-out. Pick-first is used to the LB.

(4) shouldn't return any results for your case as well.

The steps 1, 2, 4 aren't really intended to be configured and it's unfortunate they are always done even when unnecessary, but this is sort of the state of the world. There's not many other options other than "use better service discovery." There's quite a lot in the pipeline to add to service config, so in the future I hope you'd be interested in it, although the "not supported by k8s" would need a resolution of some sort.

ejona86 · 2018-01-18T19:14:35Z

Discussed with @dfawley. @HannesMvW, we think writing your own name resolver is the way to go. Should be pretty easy and you can configure it exactly like you want.

You're actually in deep enough of a case where you may want "serious client-side load balancing." That would mean "gRPC LB." Unfortunately that still lacks a server and so has high barrier for entry, but it allows you to more seriously scale and adapt to load and changes. I expect you'd be happy enough with making your own resolver, but I thought I should mention that your case is getting into this other realm.

HannesMvW · 2018-01-18T19:43:42Z

Thanks s much for looking into this. Yes, I'm looking for really really high performance. Currently I've opted out of naming my service ports 'grpclb' to make the 1st lookup fail, which, yes, does skip the 2nd lookup, utilizing only the 3rd lookup. 4 always fails. (I've seen requests for K8s support for TXT records, for gRPC service configs - which I also would be interested in trying.)

raliste · 2018-06-21T14:18:49Z

Just out of curiosity, why is the go implementation different than the Java implementation? Scaling replicas up with grpc-go fails miserably as the subconn map is not properly updated. The Java implementation works flawlessly.

Our main issues are:

When all subconns are in transient failure, no resolution is requested, so we depend on the 30 minute interval.

menghanl · 2018-06-21T18:32:55Z

@raliste Are you using an old version of gRPC? We do try to re-resolve whenever a subconn goes into transient failure.

raliste · 2018-06-21T23:02:11Z

Thank you for your answer @menghanl

We were actually testing using and old version. After upgrading we saw good resolving but it appears than when all subconn go into transient failure no re-resolve is done.

raliste · 2018-06-21T23:07:07Z

grpc-java did implement name resolution if all subconns go into transient failure.

grpc/grpc-java#1591

menghanl · 2018-06-22T00:34:20Z

@raliste I'm not sure I understand the problem.
The current behavior is, whenever a subconn goes into transient failure, resolver does a re-resolve. This should cover the case where all subconns go into transient failure. Can you give more explanation?

raliste · 2018-06-22T01:15:57Z

@menghanl thank you. The actual problem is that when all subconns are in transient failure no re-resolve is performed. I understand that a re-resolve is done whenever a subconn goes into transient failure, but in our case, whenever all subconns go into transient failure (at once), no re-resolve is performed. The subconn map is only updated after 30 minutes.

rpc error: code = Unavailable desc = all SubConns are in TransientFailure

I think the problem is that you do re-resolve (one time), but you don't keep re-resolving until a new subconn is retrieved.

menghanl · 2018-06-22T23:25:59Z

@raliste When a subconn is in transient failure, it will keep retrying. So the state will go back to connecting, and then transient failure, and a re-resolve will happen.

If you still have the problem, can you give a bit more information about what you did? Like what's your environment setup? And are you using the default dns resolver?
It would also be good to file a new issue instead of discussing here. Thanks!

elvizlai · 2018-07-02T08:13:05Z

@menghanl look like it not works well in k8s. because some time, the client will up first, but not the server. So the dns record maybe empty, and it will work after 30 min.

apiVersion: v1
kind: Namespace
metadata:
  name: test

---

apiVersion: v1
kind: Service
metadata:
  namespace: test
  name: server
spec:
  clusterIP: None
  ports:
    - name: grpc
      port: 1234
  selector:
    app: server

---

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: server
  namespace: test
spec:
  replicas: 5
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: server
        ver: v1
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: server
        image: "sdrzlyz/istio-demo-server:v1"
        imagePullPolicy: IfNotPresent
        ports:
        - name: grpc-port
          containerPort: 1234
        readinessProbe:
          tcpSocket:
            port: grpc-port
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          tcpSocket:
            port: grpc-port
          initialDelaySeconds: 15
          periodSeconds: 20

---

apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: client
  namespace: test
spec:
  replicas: 1
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  template:
    metadata:
      labels:
        app: client
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: client
        image: "sdrzlyz/istio-demo-client"
        imagePullPolicy: Always
        env:
        - name: SADDR
          value: "dns:///server.test.svc.cluster.local:1234"

menghanl · 2018-07-02T17:05:21Z

@elvizlai What you described is tracked by #1795. Will have a fix shortly.

dfawley added P2 Type: Feature New features or improvements in behavior labels Nov 16, 2017

menghanl mentioned this issue Nov 29, 2017

default client balancer only returns one address #1694

Closed

dfawley closed this as completed Dec 14, 2017

dfawley added the Status: Working As Intended label Dec 14, 2017

robotlovesyou mentioned this issue Jan 15, 2018

DNS Resolver can get stuck with no connections #1795

Closed

lyuxuan mentioned this issue Jun 1, 2018

resolver/dns: create resolver builder with custom freq #2113

Closed

menghanl mentioned this issue Aug 30, 2018

resolver: dns resolver does not allow to set the lookup frequency #2279

Closed

dfawley mentioned this issue Sep 14, 2018

frequency cannot be configurable for DNS resolver #2306

Closed

lock bot locked as resolved and limited conversation to collaborators Dec 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It is impossible to set a custom polling frequency in DNS resolver #1663

It is impossible to set a custom polling frequency in DNS resolver #1663

kop commented Nov 10, 2017

menghanl commented Nov 10, 2017 •

edited

Loading

kop commented Nov 10, 2017

menghanl commented Nov 10, 2017

kop commented Nov 10, 2017

ejona86 commented Nov 10, 2017 •

edited

Loading

trusch commented Nov 29, 2017 •

edited

Loading

kop commented Nov 29, 2017

ejona86 commented Nov 29, 2017

trusch commented Nov 30, 2017

dfawley commented Dec 14, 2017

HannesMvW commented Jan 16, 2018

ejona86 commented Jan 16, 2018

HannesMvW commented Jan 17, 2018

ejona86 commented Jan 17, 2018

HannesMvW commented Jan 18, 2018 •

edited

Loading

ejona86 commented Jan 18, 2018

HannesMvW commented Jan 18, 2018 •

edited

Loading

ejona86 commented Jan 18, 2018

ejona86 commented Jan 18, 2018

HannesMvW commented Jan 18, 2018

raliste commented Jun 21, 2018 •

edited

Loading

menghanl commented Jun 21, 2018

raliste commented Jun 21, 2018

raliste commented Jun 21, 2018

menghanl commented Jun 22, 2018

raliste commented Jun 22, 2018 •

edited

Loading

menghanl commented Jun 22, 2018

elvizlai commented Jul 2, 2018 •

edited

Loading

menghanl commented Jul 2, 2018

It is impossible to set a custom polling frequency in DNS resolver #1663

It is impossible to set a custom polling frequency in DNS resolver #1663

Comments

kop commented Nov 10, 2017

What version of gRPC are you using?

What version of Go are you using (go version)?

What did you do?

What did you expect to see?

What did you see instead?

menghanl commented Nov 10, 2017 • edited Loading

kop commented Nov 10, 2017

menghanl commented Nov 10, 2017

kop commented Nov 10, 2017

ejona86 commented Nov 10, 2017 • edited Loading

trusch commented Nov 29, 2017 • edited Loading

kop commented Nov 29, 2017

ejona86 commented Nov 29, 2017

trusch commented Nov 30, 2017

dfawley commented Dec 14, 2017

HannesMvW commented Jan 16, 2018

ejona86 commented Jan 16, 2018

HannesMvW commented Jan 17, 2018

ejona86 commented Jan 17, 2018

HannesMvW commented Jan 18, 2018 • edited Loading

ejona86 commented Jan 18, 2018

HannesMvW commented Jan 18, 2018 • edited Loading

ejona86 commented Jan 18, 2018

ejona86 commented Jan 18, 2018

HannesMvW commented Jan 18, 2018

raliste commented Jun 21, 2018 • edited Loading

menghanl commented Jun 21, 2018

raliste commented Jun 21, 2018

raliste commented Jun 21, 2018

menghanl commented Jun 22, 2018

raliste commented Jun 22, 2018 • edited Loading

menghanl commented Jun 22, 2018

elvizlai commented Jul 2, 2018 • edited Loading

menghanl commented Jul 2, 2018

What version of Go are you using (`go version`)?

menghanl commented Nov 10, 2017 •

edited

Loading

ejona86 commented Nov 10, 2017 •

edited

Loading

trusch commented Nov 29, 2017 •

edited

Loading

HannesMvW commented Jan 18, 2018 •

edited

Loading

HannesMvW commented Jan 18, 2018 •

edited

Loading

raliste commented Jun 21, 2018 •

edited

Loading

raliste commented Jun 22, 2018 •

edited

Loading

elvizlai commented Jul 2, 2018 •

edited

Loading