[00:25:25] FIRING: SystemdUnitFailed: benthos@ncredir.service on ncredir2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:25] RESOLVED: SystemdUnitFailed: benthos@ncredir.service on ncredir2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:44] 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016#9776559 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `install7001.wikimedia.org` - install7001.wikimedia.org (**PASS**) - Downti... [08:23:05] 06Traffic: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy - https://phabricator.wikimedia.org/T364354#9776599 (10Vgutierrez) a:03Vgutierrez that's right, this is a leftover from the migration from mtail to benthos on ncredir. We will take care of it ASAP. [08:34:40] FIRING: [3x] VarnishHighThreadCount: Varnish's thread count on cp5019:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:39:40] FIRING: [3x] VarnishHighThreadCount: Varnish's thread count on cp5019:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:44:40] FIRING: [4x] VarnishHighThreadCount: Varnish's thread count on cp5019:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:49:40] FIRING: [4x] VarnishHighThreadCount: Varnish's thread count on cp5019:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:04:40] RESOLVED: [2x] VarnishHighThreadCount: Varnish's thread count on cp5019:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [09:29:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on cp2033:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:34:48] FIRING: [5x] PuppetZeroResources: Puppet has failed generate resources on cp1104:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:39:48] FIRING: [7x] PuppetZeroResources: Puppet has failed generate resources on cp1104:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:44:48] FIRING: [10x] PuppetZeroResources: Puppet has failed generate resources on cp1104:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:49:48] FIRING: [13x] PuppetZeroResources: Puppet has failed generate resources on cp1102:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:50:03] FIRING: [13x] PuppetZeroResources: Puppet has failed generate resources on cp1102:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:54:48] FIRING: [15x] PuppetZeroResources: Puppet has failed generate resources on cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [09:56:46] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9776836 (10Ladsgroup) We already have numbers for those and they look not great for the switch: see T360589 and T211661#8377883 [09:58:12] godog: ^^ why are we getting alerts right now when the latest failed report is from ~12 minutes ago according to puppetboard? [09:59:48] FIRING: [16x] PuppetZeroResources: Puppet has failed generate resources on cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:00:41] vgutierrez: checking [10:01:46] issue was fixed by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1028574 [10:04:04] vgutierrez: yeah looks like a trickle of alerts as individual hosts failed puppet, starting at 09:29 and continuing as more hosts failed puppet I'm guessing racing with the change, should equally resolve at the next puppet run or forced puppet run [10:04:34] yeah [10:04:48] FIRING: [16x] PuppetZeroResources: Puppet has failed generate resources on cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:05:16] what's confussing is getting new alerts here at 12:04:48 CEST when the last puppet failed run was at 11:44:50 CEST [10:05:52] I read that and it looks like puppet just failed on cp1100 (and that's not the case at all) [10:07:25] I get it yeah, since there are multiple alerts grouped (the [15x] for example) the 'summary' annotation displayed is only for one alert of the group [10:07:47] one of the alerts of the group that is, you get it [10:07:51] sure [10:08:00] but no alert happened at 12:04:48 CEST though [10:08:58] checking [10:09:15] are we spooling alerts to avoid issues on the IRC bot? [10:09:48] FIRING: [15x] PuppetZeroResources: Puppet has failed generate resources on cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:11:32] in a sense yeah, alerts get grouped when they are coming in at the "same" time hence why you see the [15x] like above, in a slow trickle like above you can see the alert group getting more and more alerts and now it is going to get less as things recover [10:12:27] FWIW we have some additional preso/material to explain this behaviour better [10:12:38] in the pipeline that is [10:12:57] we need to improve somehow the IRC alerting [10:13:24] karma shows the latest one as 38 minutes ago, and here we are getting pinged at the moment [10:13:40] and we don't get any kind of recover here [10:14:05] *recovery [10:14:48] FIRING: [13x] PuppetZeroResources: Puppet has failed generate resources on cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:16:13] indeed I get what you are saying, it is confusing in this case since the notification says FIRING though the number of alerts in the group is going down and thus effectively recovering [10:16:57] I'm looking for a task we had to track this, which I'm not finding rn [10:17:11] 13x here, 0 on karma BTW [10:19:48] RESOLVED: [9x] PuppetZeroResources: Puppet has failed generate resources on cp1100:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:20:42] heh yeah now the resolved notification for the whole group [10:22:01] ok task being https://phabricator.wikimedia.org/T356994 I'll edit the description to mention what we just talked about [10:25:18] godog: thx <3 [10:28:17] 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016#9776921 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by filippo@cumin1002 for hosts: `prometheus7001.magru.wmnet` - prometheus7001.magru.wmnet (**WARN**) -... [10:46:02] 06Traffic, 06Data-Engineering, 10Observability-Logging: Benthos loses messages when under high load - https://phabricator.wikimedia.org/T364379 (10Fabfur) 03NEW [11:56:11] 06Traffic: Update fifo_log_demux puppet module to support new parameters - https://phabricator.wikimedia.org/T364383 (10Fabfur) 03NEW [12:56:00] 06Traffic, 13Patch-For-Review: An alert for "reduced availability for job ncredir in ops@codfw" fired even tho graphs look healthy - https://phabricator.wikimedia.org/T364354#9777278 (10Vgutierrez) 05Open→03Resolved https://gerrit.wikimedia.org/r/1028818 removed the prometheus jobs, alert should go awa... [12:57:29] 06Traffic: Remove mtail leftovers on ncredir puppetization and instances - https://phabricator.wikimedia.org/T364385 (10Vgutierrez) 03NEW [13:41:01] 06Traffic, 06Infrastructure-Foundations, 13Patch-For-Review: Q4:magru VM tracking task - https://phabricator.wikimedia.org/T364016#9777459 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by filippo@cumin1002 for hosts: `prometheus7001.magru.wmnet` - prometheus7001.magru.wmnet (**WARN**) -... [14:43:30] FIRING: HAProxyRestarted: HAProxy server restarted on cp7014:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/gQblbjtnk/haproxy-drilldown?orgId=1&var-site=magru%20prometheus/ops&var-instance=cp7014&viewPanel=10 - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:44:08] er [14:46:02] that's a bit odd since I see nothing in the journal [14:46:51] aaah, probably because prometheus7001 is being reimaged [14:48:29] sukhe: alert seems accurate: NRestarts=1 [14:48:30] FIRING: [16x] HAProxyRestarted: HAProxy server restarted on cp7001:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [14:49:40] yeah so this might be the one from April 30 that is being reported now [14:49:46] Apr 30 14:38:41 cp7001 update-ocsp-all[1260]: 139972199183680:error:2008F002:BIO routines:BIO_lookup_ex:system lib:../crypto/bio/b_addr.c:730:Temporary failure in name resolution [14:49:50] Apr 30 14:38:41 cp7001 update-ocsp-all[1225]: OCSP update failed for /etc/update-ocsp.d/digicert-2023-ecdsa-unified.conf [14:49:58] Apr 30 14:38:42 cp7001 systemd[1]: haproxy.service: Scheduled restart job, restart counter is at 1. [14:50:08] yeah [14:50:15] guessing we should restart haproxy in magru to clear these up [14:50:19] indeed [14:50:21] on it [14:54:53] 06Traffic, 06Data-Persistence, 06SRE, 10SRE-swift-storage, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9777706 (10Nosferattus) @Ladsgroup: Please excuse me if I'm wrong, but I don't see how those statistics are related to what I suggested. I read those stat... [14:55:33] cool, should be cleared [14:58:30] RESOLVED: [16x] HAProxyRestarted: HAProxy server restarted on cp7001:9100 - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://alerts.wikimedia.org/?q=alertname%3DHAProxyRestarted [15:11:31] lol [15:59:48] 06Traffic, 06MW-Interfaces-Team, 06serviceops: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400 (10daniel) 03NEW [16:00:26] 06Traffic, 06MW-Interfaces-Team, 06serviceops: map the /api/ prefix to /w/rest.php - https://phabricator.wikimedia.org/T364400#9778004 (10daniel) [19:12:48] 10Acme-chief: acmechief: add support for providing files with they private key before the public key - https://phabricator.wikimedia.org/T364424 (10jhathaway) 03NEW