Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

notify: replace unfiltered with filtered alerts #595

Merged
merged 1 commit into from
Jan 9, 2017

Conversation

brancz
Copy link
Member

@brancz brancz commented Jan 4, 2017

Fixes #591

We we're actually testing the len(res) == 0 condition in the tests so we didn't notice the problem, as we always filtered all alerts in the tests.

@fabxc @brian-brazil @alexsomesan

@ddewaele you mentioned in #591 that you would volunteer to test a fix. I'm pretty confident it works correctly now, but won't hurt to try in your environment.

@ddewaele
Copy link

ddewaele commented Jan 4, 2017

Testing now....

Again using the following script :

curl -XPOST -d'[{"labels":{"alertname":"a1"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"a2"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"a3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b1"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b2"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; sleep 15
curl -XPOST -d'[{"labels":{"alertname":"b3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date ; echo "done... now looping b3 for keepAlive"
while true; do  curl -XPOST -d'[{"labels":{"alertname":"b3"}}]' http://localhost:9093/alertmanager/api/v1/alerts; date  ; sleep 15; done
echo "done"

webhook output

2017-01-04 16:29:46.807  payload {receiver=team-ixortalk-mails, status=firing, alerts=[{status=firing, labels={alertname=a1}, annotations={}, startsAt=2017-01-04T16:29:41.79736345+01:00, endsAt=0001-01-01T00:00:00Z, generatorURL=}], groupLabels={}, commonLabels={alertname=a1}, commonAnnotations={}, externalURL=http://localhost:9093/alertmanager, version=3, groupKey=14695981039346656037}
2017-01-04 16:31:46.832  payload {receiver=team-ixortalk-mails, status=firing, alerts=[{status=firing, labels={alertname=a3}, annotations={}, startsAt=2017-01-04T16:30:11.853243215+01:00, endsAt=0001-01-01T00:00:00Z, generatorURL=}, {status=firing, labels={alertname=b1}, annotations={}, startsAt=2017-01-04T16:30:26.880155317+01:00, endsAt=0001-01-01T00:00:00Z, generatorURL=}, {status=firing, labels={alertname=b2}, annotations={}, startsAt=2017-01-04T16:30:41.909161289+01:00, endsAt=0001-01-01T00:00:00Z, generatorURL=}, {status=firing, labels={alertname=b3}, annotations={}, startsAt=2017-01-04T16:30:56.94334743+01:00, endsAt=0001-01-01T00:00:00Z, generatorURL=}, {status=firing, labels={alertname=a2}, annotations={}, startsAt=2017-01-04T16:29:56.824101384+01:00, endsAt=0001-01-01T00:00:00Z, generatorURL=}], groupLabels={}, commonLabels={}, commonAnnotations={}, externalURL=http://localhost:9093/alertmanager, version=3, groupKey=14695981039346656037}
2017-01-04 16:33:46.818  payload {receiver=team-ixortalk-mails, status=firing, alerts=[{status=firing, labels={alertname=b3}, annotations={}, startsAt=2017-01-04T16:30:56.94334743+01:00, endsAt=0001-01-01T00:00:00Z, generatorURL=}], groupLabels={}, commonLabels={alertname=b3}, commonAnnotations={}, externalURL=http://localhost:9093/alertmanager, version=3, groupKey=14695981039346656037}
2017-01-04 16:35:46.822  payload {receiver=team-ixortalk-mails, status=firing, alerts=[{status=firing, labels={alertname=b3}, annotations={}, startsAt=2017-01-04T16:30:56.94334743+01:00, endsAt=0001-01-01T00:00:00Z, generatorURL=}], groupLabels={}, commonLabels={alertname=b3}, commonAnnotations={}, externalURL=http://localhost:9093/alertmanager, version=3, groupKey=14695981039346656037}

I don't get the RESOLVED sections anymore but might have found another issue.

Some questions :

  • It was already clear that b3 was firing in the second webhook but was repeated in the third / fourth webhook call.
  • In the second webhook, b3 startsAt=2017-01-04T16:30:26.880155317+01:00 , in the third webhook b3 startsAt=2017-01-04T16:30:56.94334743+01:00. (the script was sending b3 alerts every 15 seconds. Why did the startsAt changed ?)
  • Why are the last 2 webhooks posted ? Webhook 3 and 4 seem identical.

@brancz
Copy link
Member Author

brancz commented Jan 4, 2017

I am assuming this was produced with the config pasted in #591. In that case this is the correct behavior.

  1. It is repeated, because the alert is still firing against the Alertmanager and your group_interval is set to 2 minutes, meaning if an alert is still firing resent the notification in a 2 minute interval.
  2. The startsAt actually did not change, I'm guessing you looked at a different alertname than "b3" in the second log line.
  3. 3 and 4 are intentionally identical, as they are notifying you about the same problem that has not been resolved.

@brancz
Copy link
Member Author

brancz commented Jan 4, 2017

Actually rethinking that, I think you might be right, as the repeat_interval is set to 24h, but it seems it's rather based on the group_interval.

@ddewaele
Copy link

ddewaele commented Jan 4, 2017

Correct .... I guess alertmanager is sending out notifications when it thinks the alert content has changed within a group and within the group_interval. In my test case this is what causing the 4 notifications (where I think it should only be sending 2).

So although the fix here removes the resolved items from the alert list, I think they are still taken into account for deciding if alertmanager should send a notification (because they are in the group hash)

This comes from the needsUpdate function in notify.go :

	if !bytes.Equal(entry.GroupHash, hash) {
		return true, nil
	}

I guess the hash will not be sufficient to determine if an alert should be sent.

If all alerts (including the resolved ones) within a group are hashed it could explain the 4 notifications:

My results :
Notification 1

 a1[7a091fb][active]

Notification 2

a2[4ad4244][active]
a3[7732e17][active]
b1[99488d8][active]
b2[4a429ad][active]
b3[3a21090][active]
a1[7a091fb][resolved]

Notification 3

b3[3a21090][active]
a2[4ad4244][resolved]
a3[7732e17][resolved]
b1[99488d8][resolved]
b2[4a429ad][resolved]

Notification 4

b3[3a21090][active]

So from Alertmanager perspective, the content has changed each time.

The 5th run it will again see b3[3a21090][active] and will not send a notification

@ddewaele
Copy link

ddewaele commented Jan 6, 2017

Hi @brancz / @fabxc / @brian-brazil : Any thoughts on how AlertManager should handle this particular test case ? What would be the expected outcome here ? Only 2 notifications (the first first 2) instead of the actual 4 that we are getting ?

@brian-brazil
Copy link
Contributor

Per the docs it should just be 1&2, there's an argument for 1&2&3 but I think that'd end up spammy.

@ddewaele
Copy link

ddewaele commented Jan 6, 2017

@brian-brazil : agreed ... should this also be fixed in this PR or would a separate one be better ?

This PR will now have the side-effect of duplicate emails (before this fix they weren't considered duplicates because they would also included the resolved alerts).

The hash calculation was probably done for a reason (strictness / performance / ....) so I don't know how big of an impact a full fix would have. (but it seems to be isolated in the shouldUpdate call, where we just need to check if there are any new active alerts that haven't been sent previously.

@brian-brazil
Copy link
Contributor

It'd be good to fix it completely, but this PR brings us closer to correctness.

I imagine the full fix will be fairly involved.

@brancz
Copy link
Member Author

brancz commented Jan 6, 2017

Agreed. As this fixes one behavior we should have this reviewed, and then look at the other problem separately.

@ddewaele
Copy link

ddewaele commented Jan 6, 2017

ok ... I'll create a separate issue for the spammy emails

@fabxc
Copy link
Contributor

fabxc commented Jan 9, 2017

👍

@fabxc fabxc merged commit 45bd4fd into prometheus:master Jan 9, 2017
@brancz brancz deleted the fix-send-resolved branch January 9, 2017 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Notifications for both firing and resolved items doubts
4 participants