[07:06:52] btullis: I created https://phabricator.wikimedia.org/T392980 but not sure if the tags are correct [07:11:12] marostegui: Great, thanks. I'll pick that up. [07:11:18] thanks! [08:04:15] Can I get a +1 to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140121 please? it crashed again overnight despite dc-ops best efforts, so I think the time has come to fail it out of the rings [08:05:33] thanks :) [08:05:37] np [08:40:29] Similarly (but it'll take a while before I can merge it) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140130 to do the remove-once-drained of this node, please? [08:41:25] FIRING: SystemdUnitFailed: swift_ring_manager.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:02] ^-- that was me, I stopped the running one so I could push through my immediate-only change [08:56:25] RESOLVED: SystemdUnitFailed: swift_ring_manager.service on ms-fe1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:48:05] regarding https://phabricator.wikimedia.org/T392627 (Puppet connection failure on db1178) it looks like the issue is entirely on the host side and not on the Puppet API, do we want to investigate further what happened on the host? [09:51:25] FIRING: SystemdUnitFailed: puppet-agent-timer.service on ms-be1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:09:45] ^-- that has resolved I think; the host is having some timeouts, which I hope isn't the sign of another ms-be node in eqiad losing its disk controller [10:11:25] RESOLVED: SystemdUnitFailed: puppet-agent-timer.service on ms-be1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:46:25] FIRING: SystemdUnitFailed: puppet-agent-timer.service on ms-be1063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:52:19] o.O [10:54:10] (xfs_admin timeout) [10:55:56] re-run went fine [11:01:25] RESOLVED: SystemdUnitFailed: puppet-agent-timer.service on ms-be1063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:56] Sorry for more code review, but: could I get a +1 for https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140206 please? changes for the new thanos-be nodes so they (hopefully!) install properly [16:41:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2162:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:51:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-mysqld-exporter.service on db2162:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:38:48] FIRING: PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:58:48] FIRING: [2x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:38:48] FIRING: [2x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:48:48] RESOLVED: [2x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:38:48] FIRING: PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:58:48] FIRING: [2x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:14:25] FIRING: SystemdUnitFailed: puppet-agent-timer.service on ms-be1063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:34:25] RESOLVED: SystemdUnitFailed: puppet-agent-timer.service on ms-be1063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed