[02:58:48] FIRING: [2x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:13:25] FIRING: SystemdUnitFailed: puppet-agent-timer.service on ms-be1063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:33:25] RESOLVED: SystemdUnitFailed: puppet-agent-timer.service on ms-be1063:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:58:48] FIRING: [2x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:45:08] hi folks, I'm still looking for a review of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140130 please (remove ms-be1060 from rings entirely once drained) [10:58:48] FIRING: [2x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:07:21] ^-- expected? [11:29:17] Emperor: it looks related to the gerrit outage as the repos are trying to be checked out [11:29:27] sobanski: ^ is everything supposed to be fixed? [11:43:11] That was the assessment yesterday [11:47:09] What repository is it checking out? [11:48:09] sobanski: https://puppetboard.wikimedia.org/report/db1155.eqiad.wmnet/1e4c6d2ce0537395c52a7620e0d6c403eb72a9b2 [11:48:37] Sadly, I no longer have access to puppetboard [11:48:58] Which I may have to rethink [11:51:44] /Stage[main]/Profile::Wmcs::Db::Scriptconfig/Git::Clone[operations/mediawiki-config]/Exec[git_pull_operations/mediawiki-config]/returns is the thing that's failing [11:51:57] '/usr/bin/git pull --recurse-submodules --quiet' returned 128 instead of one of [0] [11:54:46] So that's indeed the problematic repo from yesterday [11:57:48] running git pull of that repo from my laptop does work [11:59:19] Same here [12:00:14] Could you try from db1155? [12:30:30] so pace hashar's comment on that change, it looks like it _did_ get applied to some versions of that repo and then deployed onto some hosts. All of which will presumably be similarly stuck now [12:32:39] sobanski: this probably warrants a ticket to track; I //guess// the correct answer is to discard the erroneous commit on the impacted hosts and re-pull. [12:33:07] (but I'm no expect on the impact to the hosts of doing taht, rather than the git surgery, IYSWIM) [12:34:22] I captured it in the task we still have open for yesterday's issue. I'm meeting with Tyler later today so we'll chat about it [12:35:07] thanks :) [12:41:49] sobanski: you have the task handy? [12:42:12] https://phabricator.wikimedia.org/T393034 [12:48:29] Thanks [13:18:25] FIRING: SystemdUnitFailed: puppet-agent-timer.service on ms-be1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:38:25] RESOLVED: SystemdUnitFailed: puppet-agent-timer.service on ms-be1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:49] FIRING: [2x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:42:22] o/ any objections to me migrating purge_parsercache_p1 to Kubernetes ~nowish and I'll review how it behaved tomorrow morning (and revert if necessary)? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1139422 [15:42:54] based on how things have gone with migrations so far the odds of it behaving differently are very low [15:43:04] but I am aware it's later on a thursday :) [15:50:43] marostegui / federico3 : there are four db* servers and 5 clouddb* servers that need some git-fettling to unbreak puppet on them (see T393034 for details). I presume we should do this? Do you want to, would you like me to, should we be getting someone else to...? [15:50:44] T393034: Investigate out of date refs following gerrit switchover - https://phabricator.wikimedia.org/T393034 [18:33:49] FIRING: [3x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [22:33:49] FIRING: [3x] PuppetFailure: Puppet has failed on db1155:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure