[00:00:33] <Amir1>	 kamila_: thank you for writing it
[00:00:47] <kamila_>	 of course
[00:01:11] <kamila_>	 (actually, I should change it to say a few hours, I misread)
[00:04:35] <kamila_>	 also, I assume we should make an incident doc? if so, I'm going to go ahead
[00:12:19] <thcipriani>	 I can make a patch to output an error and return for namespaceDupes for the time being for wmf/1.42.0-wmf.3 if that's helpful?
[00:14:57] <kamila_>	 +1 to that
[00:18:03] <icinga-wm>	 PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:18:19] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 #page on db2171 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:19:22] <wikibugs>	 (03CR) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[00:19:37] <kamila_>	 thcipriani: do you know when you started the script? I am assuming approx 21:16, is that correct?
[00:20:37] <Kizule>	 I think it was later, since there was config patch related to math.
[00:20:51] <Kizule>	 Which was deployed before executing the script.
[00:20:53] <Kizule>	 Let me check.
[00:22:10] <thcipriani>	 kamila_: started ~20:54, hung ~20:59, killed ~21:32, replag started 22:25
[00:22:19] <kamila_>	 thanks!
[00:22:39] <wikibugs>	 (03CR) 10Gergő Tisza: mobile: Add MobileUrlCallback (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969401 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[00:23:19] <kamila_>	 Kizule: fyi, we're using UTC time :-)
[00:23:21] <wikibugs>	 (03PS11) 10Bartosz Dziewoński: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[00:23:40] <Kizule>	 kamila_: I know, I tried to refer to it.
[00:23:41] <wikibugs>	 (03CR) 10Bartosz Dziewoński: Generalize Meta/Commons exceptions for CentralAuth cookie handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[00:24:01] <kamila_>	 ok, sorry and thanks
[00:24:08] <kamila_>	 (timezones are hard :D)
[00:24:30] <Kizule>	 kamila_: No problem. How is replication going?
[00:24:38] <thcipriani>	 oh good. phan is flagging that I made code unreachable. Of course this is what I was trying to do.
[00:25:12] <Kizule>	 btw, I think it should go only in wmf.3 branch, not in master, right?
[00:25:31] <Kizule>	 And then next week to cherry-pick in wmf.4?
[00:25:56] <Kizule>	 We don't want to make it fatal for non-WMF wikis which are downloading wikis from master branch.
[00:25:58] <thcipriani>	 or merge to master for now, cherry pick to wmf.3 and deploy it. That way no one has to remember next week.
[00:25:59] <kamila_>	 Kizule: I'm not a database person, so I don't know much beyond the RECOVERY messages you see here 
[00:26:21] <kamila_>	 but it's going :D but there are quite a few shards left still
[00:26:44] <Kizule>	 You've already answered, even if you are not a database person.
[00:26:45] <Kizule>	 :D
[00:26:54] <Kizule>	 That's what I wanted to hear, that's it's going.
[00:27:07] <kamila_>	 yeah
[00:27:38] <kamila_>	 if I'm reading things correctly, we have 4 "major" ones left
[00:28:06] <kamila_>	 out of 10
[00:30:37] <icinga-wm>	 RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:33:51] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 #page on db2123 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:36:32] <wikibugs>	 (03PS1) 10Thcipriani: Disable namespaceDupes.php for now [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971287 (https://phabricator.wikimedia.org/T350443)
[00:39:04] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/970839
[00:39:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/970839 (owner: 10TrainBranchBot)
[00:40:11] <Kizule>	 Great to see that it's recovering even more. :)
[00:40:49] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 #page on db2111 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:44:37] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[00:46:42] <Amir1>	 now eqiad replicas are choking on it, it'll recover in an hour or so
[00:50:18] <Amir1>	 Time: 188 on db1200 🫠
[00:50:30] <kamila_>	 ugh -_-
[00:50:35] <kamila_>	 it's quite late over here, OK if I disappear?
[00:50:49] <kamila_>	 or is there anything else I can help with?
[00:50:52] <Amir1>	 yeah yeah
[00:50:55] <Amir1>	 go rest
[00:51:17] <kamila_>	 ok, fingers crossed that things will finish in a reasonable amount of time
[00:51:22] <kamila_>	 o/
[00:51:29] <Amir1>	 yeah, codfw has fully recovered 
[00:51:35] <kamila_>	 yep
[00:52:53] <thcipriani>	 alright. I'm going to deploy this wmf.3 change for the time being, so we stop anyone from running namespaceDupes.
[00:53:54] <Amir1>	 thcipriani: I +2'ed the master patch, thank you for making those
[00:53:57] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Disable namespaceDupes.php for now [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971287 (https://phabricator.wikimedia.org/T350443) (owner: 10Thcipriani)
[00:54:41] <thcipriani>	 sure thing
[00:55:39] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1183 is OK: OK slave_sql_lag Replication lag: 0.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:57:39] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/970839 (owner: 10TrainBranchBot)
[00:59:19] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1144 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 824.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:59:45] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1213 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 850.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:59:49] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1161 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 856.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:59:53] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1185 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:00:01] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1130 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 868.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:00:09] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1210 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 875.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:00:20] <Kizule>	 This is complaining about eqiad now, right?
[01:00:29] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1230 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 895.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:00:31] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s5 on db1200 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 897.60 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[01:01:32] <thcipriani>	 yes, these are all in eqiad, server names begin with a 1 instead of a 2, i.e., db1200
[01:04:37] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[01:06:18] <Kizule>	 thcipriani: Thanks for letting me know. Happy to learn more. :)
[01:06:24] <Kizule>	 I'm sorry for causing this mess.
[01:09:20] <thcipriani>	 people don't cause messes. failed systems cause messes.
[01:10:12] <thcipriani>	 it used to be much easier to break things, the more we broke things the stronger our systems have become over time.
[01:10:37] <thcipriani>	 one more thing to make harder to break.
[01:11:56] <wikibugs>	 (03Merged) 10jenkins-bot: Disable namespaceDupes.php for now [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971287 (https://phabricator.wikimedia.org/T350443) (owner: 10Thcipriani)
[01:13:05] <logmsgbot>	 !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:971287|Disable namespaceDupes.php for now (T350443)]]
[01:13:09] <stashbot>	 T350443: namespaceDupes.php doesn't have limit on write queries - https://phabricator.wikimedia.org/T350443
[01:14:25] <logmsgbot>	 !log thcipriani@deploy2002 thcipriani: Backport for [[gerrit:971287|Disable namespaceDupes.php for now (T350443)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[01:17:15] <Kizule>	 thcipriani: You are actually right, so I'll look this on "good thing".
[01:17:25] <Kizule>	 Good thing that we have discovered this earlier than it was late.
[01:18:11] <logmsgbot>	 !log thcipriani@deploy2002 thcipriani: Continuing with sync
[01:19:53] <Kizule>	 I've unscheduled my task from window.
[01:23:34] <logmsgbot>	 !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:971287|Disable namespaceDupes.php for now (T350443)]] (duration: 10m 29s)
[01:23:50] <stashbot>	 T350443: namespaceDupes.php doesn't have limit on write queries - https://phabricator.wikimedia.org/T350443
[01:24:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[01:27:06] <thcipriani>	 alright, with that deploy I'm stepping away. I'll continue to watch page lag and feel bad, but nothing else I can think to do at the moment.
[01:34:19] <kamila_>	 thcipriani: didn't you _just_ say it's not people's fault? ;-) don't feel bad, if anything, you might be eligible for a sticker, since you did fix it :-D
[01:34:52] <Kizule>	 Do I get sticker for discovering this as well? Or it only applies to fixing? ;)
[01:35:31] <kamila_>	 There's an "I broke Wikipedia and then I fixed it" sticker :-D
[01:36:06] <kamila_>	 But if you want generic wiki stickers, I have too many :-D
[01:37:16] <Kizule>	 I think I'm entitled to that one, since this wouldn't happen if there wasn't me who asked for running the script at first place. ;)
[01:40:36] <kamila_>	 That can be arranged :-D
[01:45:12] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[01:45:38] <Kizule>	 kamila_: That would be great! ;)
[01:45:49] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[01:53:05] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1216 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:01:37] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1210 is OK: OK slave_sql_lag Replication lag: 58.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:02:45] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1185 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:03:43] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:04:05] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1161 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:04:17] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:04:19] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:04:29] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1154 is OK: OK slave_sql_lag Replication lag: 0.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:07:47] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1144 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:10:21] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1230 is OK: OK slave_sql_lag Replication lag: 59.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:12:25] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1213 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:15:57] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on dbstore1003 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:21:07] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.371 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:22:23] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:28:29] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1145 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:34:01] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1200 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:38:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:42:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.186 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:44:49] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:54:11] <Kizule>	 db1130:9104 went up to 2 hours, I haven't seen any other replica having so long time.
[02:57:35] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.197 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:58:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:03:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:17:15] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.376 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:18:33] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:32:43] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.397 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:34:01] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:42:37] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.283 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:43:53] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:59:17] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s5 on db1130 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[04:49:37] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[05:09:37] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[05:27:39] <wikibugs>	 (03PS1) 10Andrea Denisse: pontoon: Set profile::base::additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665)
[05:29:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] pontoon: Set profile::base::additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665) (owner: 10Andrea Denisse)
[05:33:17] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.365 second response time https://wikitech.wikimedia.org/wiki/Swift
[05:35:59] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Swift
[05:45:27] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231103T0600)
[06:29:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:29:25] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1233 [puppet] - 10https://gerrit.wikimedia.org/r/971347
[06:34:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:36:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1233 [puppet] - 10https://gerrit.wikimedia.org/r/971347 (owner: 10Marostegui)
[06:38:24] <wikibugs>	 (03PS1) 10Marostegui: production-parsercache.sql.erb: Minor comment [puppet] - 10https://gerrit.wikimedia.org/r/971350
[06:41:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] production-parsercache.sql.erb: Minor comment [puppet] - 10https://gerrit.wikimedia.org/r/971350 (owner: 10Marostegui)
[06:54:46] <wikibugs>	 (03PS2) 10Andrea Denisse: pontoon: Set additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665)
[06:55:59] <wikibugs>	 (03PS3) 10Andrea Denisse: pontoon: Set additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665)
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231103T0700)
[07:24:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[07:33:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: Set additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665) (owner: 10Andrea Denisse)
[07:34:19] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] pontoon: Set additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665) (owner: 10Andrea Denisse)
[07:45:30] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: route o11y alerts [puppet] - 10https://gerrit.wikimedia.org/r/971357
[07:47:40] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/971357 (owner: 10Filippo Giunchedi)
[07:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:01:15] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "I know this is a massive patch, sorry :-) hopefully the PCC should make reviewing easier, since no hosts have any changes to system resour" [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah)
[08:02:37] <wikibugs>	 (03CR) 10Majavah: "profile::pontoon::provider::cloud_vps already installs a different DHCP client, does that mean that with this patch there will be two?" [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665) (owner: 10Andrea Denisse)
[08:04:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to root for dzahn - https://phabricator.wikimedia.org/T350435 (10MoritzMuehlenhoff) No need for an access request, you can simply make a revert of your original patch to drop your access? If you changed the SSH key you can simply send it to a colleagure via an ou...
[08:07:11] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.189 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:08:27] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:10:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend MOU for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/971386
[08:12:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/971386 (owner: 10Muehlenhoff)
[08:13:25] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10karapayneWMDE) Request approved by myself
[08:14:03] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.367 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:14:07] <wikibugs>	 (03PS2) 10Elukey: changeprop: set num_workers to zero [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950)
[08:15:21] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:21:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for cec [puppet] - 10https://gerrit.wikimedia.org/r/971389
[08:23:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for cec [puppet] - 10https://gerrit.wikimedia.org/r/971389 (owner: 10Muehlenhoff)
[08:34:06] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:36:29] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10JMeybohm) 05Open→03Stalled From the conversation in slack it seems unclear how this should be solved. @ottomata and @mpopov suggested changing t...
[08:40:41] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:41:04] <jynus>	 ^ jayme :-)
[08:41:17] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2035 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[08:43:36] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:48:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/971392
[08:48:30] <wikibugs>	 (03PS2) 10Muehlenhoff: Failover to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/971392
[08:52:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Failover to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/971392 (owner: 10Muehlenhoff)
[08:54:37] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[08:58:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:59:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] profile::ci::php Also add the icu67 component following what was done for prod [puppet] - 10https://gerrit.wikimedia.org/r/971195 (https://phabricator.wikimedia.org/T345561) (owner: 10Muehlenhoff)
[09:04:27] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.279 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:05:45] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Swift
[09:10:48] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Revert "admin: remove old ssh key from user dzahn" [puppet] - 10https://gerrit.wikimedia.org/r/971167 (owner: 10Dzahn)
[09:11:36] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to root for dzahn - https://phabricator.wikimedia.org/T350435 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I did merge your revert. Welcome back!
[09:14:37] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[09:40:55] <wikibugs>	 10SRE: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10fgiunchedi)
[09:45:27] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[09:55:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert "Failover to testreduce1002" [dns] - 10https://gerrit.wikimedia.org/r/971397
[09:57:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert "Failover to testreduce1002" [dns] - 10https://gerrit.wikimedia.org/r/971397 (owner: 10Muehlenhoff)
[09:59:10] <Emperor>	 !log roll-restart swift frontends
[09:59:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:46] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe
[10:02:47] <wikibugs>	 10SRE-swift-storage, 10API Platform, 10Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Ammarpad)
[10:03:33] <wikibugs>	 10SRE-swift-storage, 10API Platform, 10Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Soda)
[10:08:41] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe
[10:29:57] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "looking good" [puppet] - 10https://gerrit.wikimedia.org/r/957720 (owner: 10Majavah)
[10:34:23] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1011.eqiad.wmnet with OS bookworm
[10:36:04] <wikibugs>	 (03PS2) 10Majavah: acme_chief: Make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/957720
[10:36:06] <wikibugs>	 (03PS2) 10Majavah: acme_chief: remove backwards compat [puppet] - 10https://gerrit.wikimedia.org/r/957721
[10:37:58] <wikibugs>	 (03PS3) 10Majavah: acme_chief: Make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/957720
[10:38:00] <wikibugs>	 (03PS3) 10Majavah: acme_chief: remove backwards compat [puppet] - 10https://gerrit.wikimedia.org/r/957721
[10:41:27] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] acme_chief: Make http_proxy optional (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957720 (owner: 10Majavah)
[10:50:29] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1011.eqiad.wmnet with reason: host reimage
[10:53:00] <wikibugs>	 (03PS1) 10Physikerwelt: mathoid: update version [deployment-charts] - 10https://gerrit.wikimedia.org/r/971400 (https://phabricator.wikimedia.org/T350004)
[10:53:14] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1011.eqiad.wmnet with reason: host reimage
[10:58:39] <wikibugs>	 (03PS1) 10Majavah: hieradata: cloudgw: drop nfs-maps [puppet] - 10https://gerrit.wikimedia.org/r/971401 (https://phabricator.wikimedia.org/T350259)
[11:07:54] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1011.eqiad.wmnet with OS bookworm
[11:08:44] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bookworm
[11:12:03] <wikibugs>	 10SRE, 10SRE-Unowned: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10JMeybohm)
[11:13:08] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bookworm
[11:21:29] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage
[11:24:13] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage
[11:26:06] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/970844
[11:26:08] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/970845
[11:29:33] <wikibugs>	 (03CR) 10EoghanGaffney: "This change is ready for review." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971187 (https://phabricator.wikimedia.org/T347593) (owner: 10EoghanGaffney)
[11:30:10] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage
[11:33:20] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage
[11:44:47] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 03+1] "Looks good to me technically." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 (https://phabricator.wikimedia.org/T311620) (owner: 10Physikerwelt)
[11:49:19] <logmsgbot>	 !log jynus@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1010.eqiad.wmnet with OS bookworm
[11:49:41] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bookworm
[11:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:54:46] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1006.eqiad.wmnet with OS bookworm
[12:01:18] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:01:44] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:03:59] <wikibugs>	 (03PS1) 10Majavah: interface: attempt to resolve ordering issues with tagged interfaces [puppet] - 10https://gerrit.wikimedia.org/r/971406
[12:04:47] <wikibugs>	 (03PS1) 10Hnowlan: wikifeeds: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971407 (https://phabricator.wikimedia.org/T349517)
[12:08:31] <wikibugs>	 (03PS5) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324
[12:10:48] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:11:48] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:11:58] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:12:26] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:13:13] <wikibugs>	 (03PS1) 10Jbond: resolvconf: add nameservr_ips [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008)
[12:13:15] <wikibugs>	 (03PS1) 10Jbond: dynamicproxy: update to pull ips from proile::resolving [puppet] - 10https://gerrit.wikimedia.org/r/971410 (https://phabricator.wikimedia.org/T350008)
[12:13:40] <wikibugs>	 (03CR) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff)
[12:15:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] resolvconf: add nameservr_ips [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:15:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dynamicproxy: update to pull ips from proile::resolving [puppet] - 10https://gerrit.wikimedia.org/r/971410 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:17:07] <wikibugs>	 (03PS6) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324
[12:17:27] <wikibugs>	 (03CR) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff)
[12:17:28] <logmsgbot>	 !log jynus@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host backup1010.eqiad.wmnet with OS bookworm
[12:17:50] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bookworm
[12:23:18] <wikibugs>	 (03PS1) 10Jbond: acme_chief::cloud: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008)
[12:23:53] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] acme_chief::cloud: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:27:09] <wikibugs>	 (03PS1) 10Muehlenhoff: mailman: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971412
[12:27:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Netbox PuppetDB Import Script Failing for cloudnet2006 - https://phabricator.wikimedia.org/T350479 (10cmooney) p:05Triage→03Medium
[12:27:56] <wikibugs>	 (03PS1) 10Jbond: toolforge::docker::registry: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971413 (https://phabricator.wikimedia.org/T350008)
[12:29:50] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] toolforge::docker::registry: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971413 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:30:53] <wikibugs>	 (03PS1) 10Jbond: toolforge::legacy_redirector: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971414 (https://phabricator.wikimedia.org/T350008)
[12:32:34] <wikibugs>	 (03PS1) 10Jbond: toolforge::static: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971415 (https://phabricator.wikimedia.org/T350008)
[12:32:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] toolforge::legacy_redirector: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971414 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:34:23] <wikibugs>	 (03PS1) 10Jbond: labs::ores::redisproxy: drop unused role [puppet] - 10https://gerrit.wikimedia.org/r/971417 (https://phabricator.wikimedia.org/T350008)
[12:34:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] toolforge::static: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971415 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:35:03] <wikibugs>	 (03PS1) 10Majavah: openstack: neutron: remove unnecessary refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/971418
[12:36:00] <wikibugs>	 (03PS1) 10Jbond: scap::target:  update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971419 (https://phabricator.wikimedia.org/T350008)
[12:37:20] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/971418 (owner: 10Majavah)
[12:38:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] scap::target:  update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971419 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[12:39:13] <wikibugs>	 (03PS1) 10Jbond: wikilabels::db_proxy: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971422 (https://phabricator.wikimedia.org/T350008)
[12:39:15] <wikibugs>	 (03PS1) 10Jbond: realm.pp: drop namservers global [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008)
[12:42:01] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::openstack::base::puppetmaster::frontend: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971447
[12:42:16] <wikibugs>	 (03PS2) 10Muehlenhoff: profile::openstack::base::puppetmaster::frontend: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971447
[12:42:16] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage
[12:42:27] <wikibugs>	 (03CR) 10Jbond: "not tested but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff)
[12:44:21] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:45:15] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage
[12:48:04] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1006.eqiad.wmnet
[12:54:16] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1006.eqiad.wmnet
[12:54:48] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/971162 (owner: 10Majavah)
[12:56:52] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::openstack::base::designate::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971449
[12:57:19] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack::base: fix project_grants ordering [puppet] - 10https://gerrit.wikimedia.org/r/971162 (owner: 10Majavah)
[12:57:41] <wikibugs>	 (03PS1) 10EoghanGaffney: [apt-staging] Add apt_staging role to new staging vm [puppet] - 10https://gerrit.wikimedia.org/r/971450 (https://phabricator.wikimedia.org/T347004)
[12:58:27] <wikibugs>	 (03PS2) 10Majavah: openstack: neutron: remove unnecessary refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/971418
[12:58:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/971450 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney)
[12:58:45] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:59:04] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] openstack: neutron: remove unnecessary refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/971418 (owner: 10Majavah)
[12:59:19] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/313/console" [puppet] - 10https://gerrit.wikimedia.org/r/971450 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney)
[12:59:37] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[12:59:57] <wikibugs>	 (03PS2) 10Majavah: interface: attempt to resolve ordering issues with tagged interfaces [puppet] - 10https://gerrit.wikimedia.org/r/971406
[13:00:12] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/971406/312/" [puppet] - 10https://gerrit.wikimedia.org/r/971406 (owner: 10Majavah)
[13:00:44] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1010.eqiad.wmnet with OS bookworm
[13:01:33] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/971447 (owner: 10Muehlenhoff)
[13:05:22] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff)
[13:10:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: setup in progress
[13:10:56] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: setup in progress
[13:34:16] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] labs::ores::redisproxy: drop unused role [puppet] - 10https://gerrit.wikimedia.org/r/971417 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[13:35:59] <wikibugs>	 (03CR) 10Herron: [C: 03+1] alertmanager: route o11y alerts [puppet] - 10https://gerrit.wikimedia.org/r/971357 (owner: 10Filippo Giunchedi)
[13:36:31] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971455 (owner: 10Muehlenhoff)
[13:43:26] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "This change removes the old runner registration workflow and uses the new v4 api workflow. Unfortunately some of the settings can no longe" [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[13:45:27] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[13:48:07] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] rest-gateway: change how AQS URLs enforce wikimedia.org domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/971456 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan)
[13:50:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudelastic1005.wikimedia.org
[13:56:39] <wikibugs>	 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10Jclark-ctr) 05Open→03Resolved
[13:56:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10Jclark-ctr)
[14:02:58] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudelastic1005.wikimedia.org
[14:03:26] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1005 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:04:46] <inflatador>	 ^^ Not sure about this one, I'm not seeing any failed units
[14:06:16] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:42] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff)
[14:16:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney) p:05Triage→03Medium
[14:22:20] <wikibugs>	 (03PS1) 10Muehlenhoff: grafana: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971458
[14:27:07] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm
[14:32:34] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] "Good find!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[14:36:54] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] changeprop: allow to define Kafka settings for Job Queues [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[14:38:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:40:46] <topranks>	 !log adding irb interface in  private1-a-codfw vlan to ssw1-a1-codfw T347191
[14:40:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:40:50] <stashbot>	 T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191
[14:43:36] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[14:44:57] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr)
[14:46:07] <wikibugs>	 (03PS1) 10Jbond: realm.pp: drop $other_site global [puppet] - 10https://gerrit.wikimedia.org/r/971461 (https://phabricator.wikimedia.org/T350008)
[14:49:07] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Add apt_staging role to new staging vm [puppet] - 10https://gerrit.wikimedia.org/r/971450 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney)
[14:50:05] <topranks>	 !log moving cr1-codfw <-> ssw1-a1-codfw EBGP session to private1-b-codfw IPs T347191
[14:50:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:50:10] <stashbot>	 T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191
[14:50:56] <icinga-wm>	 PROBLEM - BGP status on ssw1-a1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:51:07] <wikibugs>	 (03CR) 10Hnowlan: Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle)
[14:51:14] <topranks>	 ^^ that's me, acking for now 
[14:51:15] <wikibugs>	 (03PS1) 10Jbond: realm.pp: drop use_puppetdb global [puppet] - 10https://gerrit.wikimedia.org/r/971463 (https://phabricator.wikimedia.org/T350008)
[14:51:17] <wikibugs>	 (03PS1) 10Jbond: realm.pp: remove old comments [puppet] - 10https://gerrit.wikimedia.org/r/971464 (https://phabricator.wikimedia.org/T350008)
[14:51:31] <sukhe>	 topranks: wait, you can ACK BGP alerts? please share the wisdom!
[14:51:45] <sukhe>	 ah ACK, I misread as silence, ok :) 
[14:52:14] <topranks>	 nah just regular ACK in alertmanager or icinga 
[14:52:27] <sukhe>	 I misread that because we usually don't ACK it so my mind rushed to "silence"
[14:53:40] <wikibugs>	 (03PS1) 10DLynch: DiscussionTools visual enhancements on pages with __NEWSECTIONLINK__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971465 (https://phabricator.wikimedia.org/T331635)
[14:53:46] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:01] <wikibugs>	 (03CR) 10Ahmon Dancy: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[14:55:08] <wikibugs>	 (03PS6) 10DLynch: Turn off DiscussionTools A/B test, and enable features on those wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) (owner: 10Esanders)
[14:55:36] <wikibugs>	 (03PS1) 10Cwhite: remove loki image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971427 (https://phabricator.wikimedia.org/T350366)
[14:59:19] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: increase concurrency for thumbnailrender job [deployment-charts] - 10https://gerrit.wikimedia.org/r/971467
[15:01:08] <wikibugs>	 (03PS3) 10Jelto: gitlab_runner: Migrate to new runner registration scheme [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951)
[15:01:54] <wikibugs>	 (03CR) 10Jelto: gitlab_runner: Migrate to new runner registration scheme (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[15:04:40] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: Migrate to new runner registration scheme [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[15:05:03] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[15:05:43] <wikibugs>	 (03PS1) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008)
[15:06:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond)
[15:08:17] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[15:09:10] <logmsgbot>	 !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4052.ulsfo.wmnet with OS bookworm
[15:10:49] <wikibugs>	 (03PS1) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008)
[15:12:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971458 (owner: 10Muehlenhoff)
[15:14:52] <icinga-wm>	 RECOVERY - BGP status on ssw1-a1-codfw.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:19:07] <wikibugs>	 (03PS1) 10Jbond: airflow: convert to pull mail_smarthosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008)
[15:19:09] <wikibugs>	 (03PS1) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008)
[15:20:23] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db2188.codfw.wmnet with reason: reimage via T343674
[15:20:27] <stashbot>	 T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674
[15:20:37] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db2188.codfw.wmnet with reason: reimage via T343674
[15:20:51] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db2188.codfw.wmnet with OS bookworm
[15:22:11] <wikibugs>	 (03PS1) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008)
[15:24:11] <wikibugs>	 10SRE, 10SRE-Unowned: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492 (10fgiunchedi)
[15:26:02] <wikibugs>	 10SRE, 10SRE-Unowned, 10User-fgiunchedi: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10fgiunchedi)
[15:26:11] <wikibugs>	 10SRE, 10SRE-Unowned, 10User-fgiunchedi: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492 (10fgiunchedi)
[15:29:01] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/971412 (owner: 10Muehlenhoff)
[15:31:41] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:33:45] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:36:15] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2010.codfw.wmnet with OS bookworm
[15:36:15] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] cassandra: Avoid Ferm-specific syntax and simplify analytics access [puppet] - 10https://gerrit.wikimedia.org/r/970799 (owner: 10Muehlenhoff)
[15:38:21] <wikibugs>	 (03CR) 10Bking: [C: 03+2] search: simplify flink parallelism configuration [alerts] - 10https://gerrit.wikimedia.org/r/961020 (https://phabricator.wikimedia.org/T346456) (owner: 10DCausse)
[15:38:22] <wikibugs>	 10SRE, 10serviceops-radar, 10Patch-For-Review: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Jdforrester-WMF)
[15:39:02] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2188.codfw.wmnet with reason: host reimage
[15:39:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10User-fgiunchedi: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10MoritzMuehlenhoff)
[15:42:07] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2188.codfw.wmnet with reason: host reimage
[15:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:57:14] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2188.codfw.wmnet with OS bookworm
[15:57:49] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:58:41] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] gitlab_runner: Migrate to new runner registration scheme [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[15:59:05] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:03:03] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10EBernhardson) This dataset is derived from `event.wdqs_external_sparql_query` which is probably considered PII, as a direct log of queries issued ag...
[16:16:45] <logmsgbot>	 !log jynus@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup2010.codfw.wmnet with OS bookworm
[16:17:45] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2010.codfw.wmnet with OS bookworm
[16:22:50] <wikibugs>	 (03PS1) 10EoghanGaffney: [apt-staging] Add dns names for apt-staging.wm.o and discovery.w [dns] - 10https://gerrit.wikimedia.org/r/971486
[16:30:07] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew)
[16:31:01] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: sre: add check for inodes free [alerts] - 10https://gerrit.wikimedia.org/r/904675 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[16:34:29] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:34:32] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove specific TTL values from server BGP groups [homer/public] - 10https://gerrit.wikimedia.org/r/971488 (https://phabricator.wikimedia.org/T350488)
[16:45:45] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:46:47] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2011.codfw.wmnet with OS bookworm
[16:51:46] <wikibugs>	 (03PS1) 10Cathal Mooney: Change Bird multihop command to use default system TTL [puppet] - 10https://gerrit.wikimedia.org/r/971490 (https://phabricator.wikimedia.org/T350488)
[16:53:29] <wikibugs>	 (03PS1) 10FNegri: P:openstack:codfw1dev enable prom exporter [puppet] - 10https://gerrit.wikimedia.org/r/971491 (https://phabricator.wikimedia.org/T350154)
[16:55:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:openstack:codfw1dev enable prom exporter [puppet] - 10https://gerrit.wikimedia.org/r/971491 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri)
[17:04:37] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[17:08:07] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[17:08:17] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3128: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[17:08:23] <icinga-wm>	 PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS
[17:08:31] <icinga-wm>	 PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[17:08:39] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[17:09:23] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3120: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[17:09:35] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[17:09:39] <wikibugs>	 (03PS1) 10Cathal Mooney: Block incoming packets on the edge for CR loopbacks on TCP 179 [homer/public] - 10https://gerrit.wikimedia.org/r/971498 (https://phabricator.wikimedia.org/T350488)
[17:09:45] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3123: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[17:10:17] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3121: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[17:10:17] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3125: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[17:10:39] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3122: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[17:11:07] <taavi>	 sukhe: ^ expired downtime?
[17:12:27] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3127: Connection refused https://wikitech.wikimedia.org/wiki/Varnish
[17:15:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney)
[17:15:47] <brett>	 Yeah, cp4052 isn't pooled so nothing to worry about
[17:15:51] <brett>	 I'll fix it
[17:17:47] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 29 days, 4:00:00 on cp4052.ulsfo.wmnet with reason: testing instance
[17:18:02] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 29 days, 4:00:00 on cp4052.ulsfo.wmnet with reason: testing instance
[17:23:35] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to root for dzahn - https://phabricator.wikimedia.org/T350435 (10Dzahn) >>! In T350435#9304170, @MoritzMuehlenhoff wrote: > No need for an access request, you can simply make a revert of your original patch to drop your access?  I did make the revert and was told...
[17:24:37] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[17:27:37] <wikibugs>	 (03PS1) 10Ahmon Dancy: Halve profile::gitlab::runner::buildkitd_gckeepstorage [puppet] - 10https://gerrit.wikimedia.org/r/971502 (https://phabricator.wikimedia.org/T350478)
[17:27:57] <sukhe>	 thanks taavi and brett
[17:28:02] <sukhe>	 yes expired downtine
[17:28:05] <sukhe>	 time
[17:45:27] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[18:02:11] <wikibugs>	 (03CR) 10Tchanders: [WIP] ipoid: Set an initialImport cron job (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967245 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan)
[18:08:47] <logmsgbot>	 !log jynus@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2011.codfw.wmnet with OS bookworm
[18:10:41] <wikibugs>	 (03CR) 10Dzahn: "thanks! confirmed I have prod ssh access again" [puppet] - 10https://gerrit.wikimedia.org/r/971167 (owner: 10Dzahn)
[18:11:01] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:12:27] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:16:51] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:17:48] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10RobH)
[18:17:56] <wikibugs>	 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10RobH)
[18:18:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10collaboration-services: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10Dzahn)  ` [mx1001:/var/log/exim4] $ grep -ri "otrs@ticke...
[18:18:42] <wikibugs>	 (03PS4) 10Cathal Mooney: Adjust reimage cookbook config for DHCP binding clear workaround [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421)
[18:19:41] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:23:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Adjust reimage cookbook config for DHCP binding clear workaround [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney)
[18:23:57] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:25:21] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:43:47] <wikibugs>	 (03PS1) 10Eevans: cassandra: password for mediawiki_services_mobileapps role [labs/private] - 10https://gerrit.wikimedia.org/r/971504 (https://phabricator.wikimedia.org/T348993)
[18:53:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:04:45] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:05:53] <wikibugs>	 (03PS1) 10Ahmon Dancy: docker::gc: Add timeout parameter [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478)
[19:06:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] docker::gc: Add timeout parameter [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478) (owner: 10Ahmon Dancy)
[19:07:18] <wikibugs>	 (03PS2) 10Ahmon Dancy: docker::gc: Add timeout parameter [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478)
[19:10:11] <wikibugs>	 (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478) (owner: 10Ahmon Dancy)
[19:14:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Phabricator monthly email: Remove Differential user activity stats [puppet] - 10https://gerrit.wikimedia.org/r/969430 (https://phabricator.wikimedia.org/T324131) (owner: 10Aklapper)
[19:15:59] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:59:46] <wikibugs>	 (03PS1) 10Superpes15: [bnwikisource] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971517 (https://phabricator.wikimedia.org/T350482)
[20:01:06] <quiddity>	 Hi. I'm writing Tech News, and I'm not sure how to summarize yesterday's issue with s5. -- I think it's something like this, and I'd appreciate any corrections/improvements (whilst keeping it very simple and easily translatable), or just tell me the number of hours and a thumbs-up: -- 
[20:01:06] <quiddity>	 "Last week, there was a problem displaying some recent edits on a few wikis[link to s5], for XX?? hours. The edits were saved but not immediately shown. This was due to a database problem."
[20:06:41] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:08:05] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:38:24] <wikibugs>	 (03PS1) 10Superpes15: [plwiki] Add 'abusefilter-log-private' flag  to sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971518 (https://phabricator.wikimedia.org/T350509)
[20:41:14] <wikibugs>	 (03PS2) 10Superpes15: [plwiki] Add 'abusefilter-log-private' flag  to sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971518 (https://phabricator.wikimedia.org/T350509)
[20:41:32] <wikibugs>	 (03PS2) 10Superpes15: [bnwikisource] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971517 (https://phabricator.wikimedia.org/T350482)
[20:48:51] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:53:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] wikistats:wikia: pause updates while changes are made to table [puppet] - 10https://gerrit.wikimedia.org/r/971526 (https://phabricator.wikimedia.org/T215534) (owner: 10RhinosF1)
[21:00:09] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:01:46] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478) (owner: 10Ahmon Dancy)
[21:04:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:09:37] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[21:15:43] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:29:37] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[21:45:27] <jinxer-wm>	 (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity
[21:55:10] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] alertmanager: route o11y alerts [puppet] - 10https://gerrit.wikimedia.org/r/971357 (owner: 10Filippo Giunchedi)
[22:53:46] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:51:17] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure