[01:25:03] FIRING: PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:29:40] FIRING: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.34.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:25:03] FIRING: PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:29:40] FIRING: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.34.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:34] what is up with all these alerts and the ones in the -feed channel? They've been going on for days :-/ [06:18:45] I am going to switch s4 codfw master [06:57:14] btullis: We've had db1208 and dbstore1009 alerting for mysqld-prometheus-exporter for like a week, it is probably because of https://phabricator.wikimedia.org/T371049 (I don't know if there's more to it - I just came back from holidays). Can you please silence those alerts? [08:16:08] marostegui: I'll get to the ones on ceph/swift nodes later this morning. We do have a ticket open with observability about the noise (which I think has some workarounds including disabling the lot) [08:16:27] yeah, I think i've commented on that ticket once or twice [08:16:47] thanks though Emperor [08:20:49] my expectation is that it's disk failures (but I've not yet looked) [08:23:44] marostegui: I think that ben is out this week [08:24:10] volans: Thanks I will silence them myself then [08:24:42] I can too if you want :) I left them to not get forgot in case noone replied to the task [08:24:46] p.s. welcome back [08:25:45] done :) [09:25:03] FIRING: PuppetFailure: Puppet has failed on ms-be1056:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:29:40] FIRING: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.34.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:25] RESOLVED: SystemdUnitFailed: ceph-59ea825c-2a67-11ef-9c1c-bc97e1bbace4@osd.34.service on moss-be2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:27] T371234 for the Ceph disk failure [10:19:28] T371234: Disk (sdk) failed on moss-be2002 - https://phabricator.wikimedia.org/T371234 [10:27:37] The ms-be disk was sort-of auto-flagged as T371192 which I've updated to hopefully be useful to the DC team; I'll silence the alert shortly. [10:27:38] T371192: Disk (sdh) failed on ms-be1056 - https://phabricator.wikimedia.org/T371192 [10:28:17] Emperor: do you know why the automatic task for broken disk was not created? [10:41:24] volans: which? T371192 was auto-created, it just the content wasn't very useful [10:41:25] T371192: Disk (sdh) failed on ms-be1056 - https://phabricator.wikimedia.org/T371192 [10:42:07] I meant for moss-be2002 ( T371234 ) [10:42:07] T371234: Disk (sdk) failed on moss-be2002 - https://phabricator.wikimedia.org/T371234 [10:42:58] Oh, that's a JBOD system. In any case, it's my experience that task auto-creation is the exception rather than the rule because typically the RAID controller hasn't actually noticed the disk is bad in a useful manner [12:26:17] FYI (cc kwakuofori) the gas company will show up to debug/change the gas meter anytime today between 12 and 16 local. Murphy's dictates that because they didn't show up during my earlier meeting nor at lunch, they will clearly show up during the team meeting :D I might disappear suddenly... [12:37:11] https://phabricator.wikimedia.org/T371250 it works :P [12:38:22] Amir1: Great, I will disable mine then