Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul-template render empty data during consul server raft election/change #1131

Closed
vaLski opened this issue Aug 6, 2018 · 4 comments
Closed
Labels
bug consul Related to the Consul integration

Comments

@vaLski
Copy link
Contributor

vaLski commented Aug 6, 2018

During:

  • consul servers raft leader re-election caused by cluster outage/routine consul servers restart/etc
  • consul-template that is using ls /path/to/keyprefix inside ctmpl
  • is rendering blank data
  • instead of rendering the keys* under the specific key prefix
  • issue happens when using staled queries

I know that according to the docs consul-template usually should not render anything if it receive error from the consul server. In that case however it seems that it is not receiving error but rather valid empty response from the servers which leads to rendering blank data in the template destination.

I also experienced very similar and very bitter issue, with consul-replicate + stale queries. Under similar scenario (raft leader change/sync) the following data center received blank data from the parent leading to all KVs to be erased on the follower. The issue is not fixed yet and I mitigated it by re-configuring consul-replicate to never use staled queries. Detailed report is available hashicorp/consul-replicate#82

Recently the same issue, also happened with consul-template when used with staled queries. It is only happening during raft re-election/outage recovery etec. I suspect that it is more related to consul which is sending valid but blank answers to the long pooling KV queries during servers outage/raft change. I also reported it here to the consul project but they closed it as not consul related hashicorp/consul#3975

Not sure if this is a strictly consul server bug or issue with consul-template and consul-replicate that are sharing the same presumably buggy code especially while using stale queries. Unfortunately this issue is really, really bad, leading to data loss or incorrect / blank configuration files rendered to template destination. Any suggestions will be greatly appreciated.

Consul Template version

consul-template v0.18.1 (9c62737)
consul-template v0.19.5 (57b6c71)

Configuration

max_stale = "5m"

part of ctmpl file

allow_from = {{range $index, $kv := ls "/pub/server/mastermachines/pub/addr"}}{{if ne $index 0}} {{end}}{{$kv.Key}}{{end}}

Command

consul-template -log-level info \
  -kill-signal SIGTERM -reload-signal SIGHUP \
  -vault-renew-token=false \
  -max-stale 10m \
  -config=/etc/consul-template/configs \
  -config=/etc/consul-template/templates.hcl

Expected behavior

It should render like like that

allow_from = 1.1.1.1 2.2.2.2 3.3.3.3

Actual behavior

allow_from =

Steps to reproduce

  1. create several records
consul  kv put /pub/server/mastermachines/pub/addr/1.1.1.1 $(date +%s)
consul  kv put /pub/server/mastermachines/pub/addr/2.2.2.2 $(date +%s)
  1. create ctmpl file that is utilizing the prefix as follows:
allow_from = {{range $index, $kv := ls "/pub/server/mastermachines/pub/addr"}}{{if ne $index 0}} {{end}}{{$kv.Key}}{{end}}
  1. configure consul-template to use stale queries and start it as described above
  2. start killing / restarting consul server nodes so you can constantly trigger consul server re-elections.
  3. at certain instead of rendering ips, template destination will end up with blank allow_from line.

References

Are there any other GitHub issues (open or closed) that should
be linked here? For example:

@vaLski
Copy link
Contributor Author

vaLski commented Aug 7, 2018

I managed to successfully reproduce this issue in an isolated environment:

  • 2 consul servers with preloaded data in the KV store
  • 1 consul agent with consul-template installed on it that is using range ls queries
  • continuously but sequentially reading and writing random KV to/from the KV store on the agent node with 10 second sleep between iterations
  • 100ms backoff and 1s max backoff for consul-template
  • script that is forcing consul servers to be killed and then re-spawned by the supervisor
while true; do
  for i in sof2 sof3; do
    ssh ${i} 'perpctl k consul' & // like killall -s SIGKILL consul
  done
  sleep 15
done
  • simple script that is constantly removing template output and force consul-template reloads
file=/path/to/template/output
while true; do
	if [[ ! -f "${file}" ]]; then
		killall -HUP consul-template >> /dev/null 2>&1
		sleep 1
		continue
	fi
	misssing=0
	sum=$(md5sum "${file}" | awk '{print $1}')
	if [[ "${sum}" != '4944b8f85bbc9bb4b2e3cbaa830f463e' ]]; then
		echo "$(date +%s) - $(date) - ${file} sum differs ${sum}"
		exit 1
	fi
	unlink "${file}"
	sleep 0.2
done

@vaLski
Copy link
Contributor Author

vaLski commented Aug 8, 2018

2018-08-08 08:50:09.550011 2018/08/08 08:50:09.549983 [DEBUG] (runner) VALDEBUG: lsFunc kv.list(pub/team/adminteam/ssh/key)%!(EXTRA string={})
2018-08-08 08:50:09.550015 2018/08/08 08:50:09.549990 [DEBUG] (runner) VALDEBUG: lsFunc CASE1  []
*SNIP*
2018-08-08 08:50:09.551388 2018/08/08 08:50:09.551374 [INFO] (runner) rendered "/etc/consul-template/templates.ctmpl/_usr_local_1h_etc_lifesigns.conf.ctmpl" => "/usr/local/1h/etc/lifesigns.conf"
  • Instead it should look like this taken from previous successful runs above:
2018-08-08 08:49:58.687991 2018/08/08 08:49:58.662164 [DEBUG] (runner) VALDEBUG: lsFunc kv.list(pub/team/adminteam/ssh/key)%!(EXTRA string={})
2018-08-08 08:49:58.687993 2018/08/08 08:49:58.662189 [DEBUG] (runner) VALDEBUG: lsFunc CASE1  [
2018-08-08 08:49:58.687995   {
2018-08-08 08:49:58.688000     "Path": "pub/team/adminteam/ssh/key/id_rsa.pub",
2018-08-08 08:49:58.688001     "Key": "id_rsa.pub",
2018-08-08 08:49:58.688008     "Value": "ssh-rsa RSA_KEY_HERE"
2018-08-08 08:49:58.688011     "CreateIndex": 13316742,
2018-08-08 08:49:58.688013     "ModifyIndex": 16713907,
2018-08-08 08:49:58.688015     "LockIndex": 0,
2018-08-08 08:49:58.688016     "Flags": 0,
2018-08-08 08:49:58.688017     "Session": ""
2018-08-08 08:49:58.688019   }
2018-08-08 08:49:58.688020 ]
  • Traced it to this part of the consul-template code, where result is being returned as empty [] due to the fact that the received value is empty. append is never called in this case due to the fact that we don't have items to iterate on
        // Only return non-empty top-level keys
        if value, ok := b.Recall(d); ok {
            for _, pair := range value.([]*dep.KeyPair) {
                if pair.Key != "" && !strings.Contains(pair.Key, "/") {
                    result = append(result, pair)
                }
            }
            valStr, err := json.MarshalIndent(result, "", "  ")
            if err != nil {
                log.Println(err)
            }
            log.Printf("[DEBUG] (runner) VALDEBUG: lsFunc CASE1  %s", string(valStr))
            return result, nil
        }
  • I am sure that the monitored with the range ls KV prefix contains data. For still unknown reasons to me, 1 out of N times, instead returning the list, it is returning blank leading to the conditions described in this bug.
  • Still not sure if it is a consul-template bug or they way how consul respond to KV queries in certain states during failover

@vaLski
Copy link
Contributor Author

vaLski commented Aug 14, 2018

I confirm that I can reproduce this with the latest consul version 1.2.2 with raft version 3 with the following auto pilot settings.

{
	"autopilot": {
		"cleanup_dead_servers": true,
		"last_contact_threshold": "2000ms",
		"max_trailing_logs": 500,
		"server_stabilization_time": "60s"
	}
}

vaLski added a commit to vaLski/consul-template that referenced this issue Aug 15, 2018
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data.

This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data.

safels and safetree work in stale mode just as their ancestors but we get extra safety on top.

safels and safetree commands were born as an attempt to mitigate issues described here:

  hashicorp#1131
  hashicorp/consul#3975
  hashicorp/consul-replicate#82
vaLski added a commit to vaLski/consul-template that referenced this issue Aug 15, 2018
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data.

This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data.

safels and safetree work in stale mode just as their ancestors but we get extra safety on top.

safels and safetree commands were born as an attempt to mitigate issues described here:

  hashicorp#1131
  hashicorp/consul#3975
  hashicorp/consul-replicate#82
freddygv pushed a commit to hashicorp/consul that referenced this issue Aug 23, 2018
@pierresouchay
Copy link

FIxed by hashicorp/consul#4554

@eikenb eikenb added bug consul Related to the Consul integration labels Jun 14, 2019
@eikenb eikenb closed this as completed Jun 14, 2019
eikenb pushed a commit that referenced this issue Sep 10, 2019
safels and safetree behave exactly like the native ls and tree with one exception. They will *refuse* to render template, if KV prefix query return blank/empty data.

This is especially usefull for rendering mission critical files that do not tolerate ls/tree KV queries to return blank data.

safels and safetree work in stale mode just as their ancestors but we get extra safety on top.

safels and safetree commands were born as an attempt to mitigate issues described here:

  #1131
  hashicorp/consul#3975
  hashicorp/consul-replicate#82
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug consul Related to the Consul integration
Projects
None yet
Development

No branches or pull requests

3 participants