[00:00:33] kamila_: thank you for writing it [00:00:47] of course [00:01:11] (actually, I should change it to say a few hours, I misread) [00:04:35] also, I assume we should make an incident doc? if so, I'm going to go ahead [00:12:19] I can make a patch to output an error and return for namespaceDupes for the time being for wmf/1.42.0-wmf.3 if that's helpful? [00:14:57] +1 to that [00:18:03] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:19] RECOVERY - MariaDB Replica Lag: s5 #page on db2171 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:19:22] (03CR) 10Gergő Tisza: Generalize Meta/Commons exceptions for CentralAuth cookie handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [00:19:37] thcipriani: do you know when you started the script? I am assuming approx 21:16, is that correct? [00:20:37] I think it was later, since there was config patch related to math. [00:20:51] Which was deployed before executing the script. [00:20:53] Let me check. [00:22:10] kamila_: started ~20:54, hung ~20:59, killed ~21:32, replag started 22:25 [00:22:19] thanks! [00:22:39] (03CR) 10Gergő Tisza: mobile: Add MobileUrlCallback (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969401 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [00:23:19] Kizule: fyi, we're using UTC time :-) [00:23:21] (03PS11) 10Bartosz Dziewoński: Generalize Meta/Commons exceptions for CentralAuth cookie handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [00:23:40] kamila_: I know, I tried to refer to it. [00:23:41] (03CR) 10Bartosz Dziewoński: Generalize Meta/Commons exceptions for CentralAuth cookie handling (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [00:24:01] ok, sorry and thanks [00:24:08] (timezones are hard :D) [00:24:30] kamila_: No problem. How is replication going? [00:24:38] oh good. phan is flagging that I made code unreachable. Of course this is what I was trying to do. [00:25:12] btw, I think it should go only in wmf.3 branch, not in master, right? [00:25:31] And then next week to cherry-pick in wmf.4? [00:25:56] We don't want to make it fatal for non-WMF wikis which are downloading wikis from master branch. [00:25:58] or merge to master for now, cherry pick to wmf.3 and deploy it. That way no one has to remember next week. [00:25:59] Kizule: I'm not a database person, so I don't know much beyond the RECOVERY messages you see here [00:26:21] but it's going :D but there are quite a few shards left still [00:26:44] You've already answered, even if you are not a database person. [00:26:45] :D [00:26:54] That's what I wanted to hear, that's it's going. [00:27:07] yeah [00:27:38] if I'm reading things correctly, we have 4 "major" ones left [00:28:06] out of 10 [00:30:37] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:51] RECOVERY - MariaDB Replica Lag: s5 #page on db2123 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:36:32] (03PS1) 10Thcipriani: Disable namespaceDupes.php for now [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971287 (https://phabricator.wikimedia.org/T350443) [00:39:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/970839 [00:39:10] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/970839 (owner: 10TrainBranchBot) [00:40:11] Great to see that it's recovering even more. :) [00:40:49] RECOVERY - MariaDB Replica Lag: s5 #page on db2111 is OK: OK slave_sql_lag Replication lag: 0.48 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:44:37] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:46:42] now eqiad replicas are choking on it, it'll recover in an hour or so [00:50:18] Time: 188 on db1200 🫠 [00:50:30] ugh -_- [00:50:35] it's quite late over here, OK if I disappear? [00:50:49] or is there anything else I can help with? [00:50:52] yeah yeah [00:50:55] go rest [00:51:17] ok, fingers crossed that things will finish in a reasonable amount of time [00:51:22] o/ [00:51:29] yeah, codfw has fully recovered [00:51:35] yep [00:52:53] alright. I'm going to deploy this wmf.3 change for the time being, so we stop anyone from running namespaceDupes. [00:53:54] thcipriani: I +2'ed the master patch, thank you for making those [00:53:57] (03CR) 10Thcipriani: [C: 03+2] Disable namespaceDupes.php for now [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971287 (https://phabricator.wikimedia.org/T350443) (owner: 10Thcipriani) [00:54:41] sure thing [00:55:39] RECOVERY - MariaDB Replica Lag: s5 on db1183 is OK: OK slave_sql_lag Replication lag: 0.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:57:39] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/970839 (owner: 10TrainBranchBot) [00:59:19] PROBLEM - MariaDB Replica Lag: s5 on db1144 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 824.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:59:45] PROBLEM - MariaDB Replica Lag: s5 on db1213 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 850.89 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:59:49] PROBLEM - MariaDB Replica Lag: s5 on db1161 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 856.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:59:53] PROBLEM - MariaDB Replica Lag: s5 on db1185 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:00:01] PROBLEM - MariaDB Replica Lag: s5 on db1130 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 868.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:00:09] PROBLEM - MariaDB Replica Lag: s5 on db1210 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 875.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:00:20] This is complaining about eqiad now, right? [01:00:29] PROBLEM - MariaDB Replica Lag: s5 on db1230 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 895.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:00:31] PROBLEM - MariaDB Replica Lag: s5 on db1200 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 897.60 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:01:32] yes, these are all in eqiad, server names begin with a 1 instead of a 2, i.e., db1200 [01:04:37] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [01:06:18] thcipriani: Thanks for letting me know. Happy to learn more. :) [01:06:24] I'm sorry for causing this mess. [01:09:20] people don't cause messes. failed systems cause messes. [01:10:12] it used to be much easier to break things, the more we broke things the stronger our systems have become over time. [01:10:37] one more thing to make harder to break. [01:11:56] (03Merged) 10jenkins-bot: Disable namespaceDupes.php for now [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/971287 (https://phabricator.wikimedia.org/T350443) (owner: 10Thcipriani) [01:13:05] !log thcipriani@deploy2002 Started scap: Backport for [[gerrit:971287|Disable namespaceDupes.php for now (T350443)]] [01:13:09] T350443: namespaceDupes.php doesn't have limit on write queries - https://phabricator.wikimedia.org/T350443 [01:14:25] !log thcipriani@deploy2002 thcipriani: Backport for [[gerrit:971287|Disable namespaceDupes.php for now (T350443)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [01:17:15] thcipriani: You are actually right, so I'll look this on "good thing". [01:17:25] Good thing that we have discovered this earlier than it was late. [01:18:11] !log thcipriani@deploy2002 thcipriani: Continuing with sync [01:19:53] I've unscheduled my task from window. [01:23:34] !log thcipriani@deploy2002 Finished scap: Backport for [[gerrit:971287|Disable namespaceDupes.php for now (T350443)]] (duration: 10m 29s) [01:23:50] T350443: namespaceDupes.php doesn't have limit on write queries - https://phabricator.wikimedia.org/T350443 [01:24:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:27:06] alright, with that deploy I'm stepping away. I'll continue to watch page lag and feel bad, but nothing else I can think to do at the moment. [01:34:19] thcipriani: didn't you _just_ say it's not people's fault? ;-) don't feel bad, if anything, you might be eligible for a sticker, since you did fix it :-D [01:34:52] Do I get sticker for discovering this as well? Or it only applies to fixing? ;) [01:35:31] There's an "I broke Wikipedia and then I fixed it" sticker :-D [01:36:06] But if you want generic wiki stickers, I have too many :-D [01:37:16] I think I'm entitled to that one, since this wouldn't happen if there wasn't me who asked for running the script at first place. ;) [01:40:36] That can be arranged :-D [01:45:12] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [01:45:38] kamila_: That would be great! ;) [01:45:49] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [01:53:05] RECOVERY - MariaDB Replica Lag: s5 on db1216 is OK: OK slave_sql_lag Replication lag: 0.46 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:01:37] RECOVERY - MariaDB Replica Lag: s5 on db1210 is OK: OK slave_sql_lag Replication lag: 58.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:02:45] RECOVERY - MariaDB Replica Lag: s5 on db1185 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:03:43] RECOVERY - MariaDB Replica Lag: s5 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:04:05] RECOVERY - MariaDB Replica Lag: s5 on db1161 is OK: OK slave_sql_lag Replication lag: 0.20 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:04:17] RECOVERY - MariaDB Replica Lag: s5 on clouddb1016 is OK: OK slave_sql_lag Replication lag: 0.34 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:04:19] RECOVERY - MariaDB Replica Lag: s5 on clouddb1020 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:04:29] RECOVERY - MariaDB Replica Lag: s5 on db1154 is OK: OK slave_sql_lag Replication lag: 0.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:07:47] RECOVERY - MariaDB Replica Lag: s5 on db1144 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:10:21] RECOVERY - MariaDB Replica Lag: s5 on db1230 is OK: OK slave_sql_lag Replication lag: 59.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:12:25] RECOVERY - MariaDB Replica Lag: s5 on db1213 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:15:57] RECOVERY - MariaDB Replica Lag: s5 on dbstore1003 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:21:07] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.371 second response time https://wikitech.wikimedia.org/wiki/Swift [02:22:23] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Swift [02:28:29] RECOVERY - MariaDB Replica Lag: s5 on db1145 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:34:01] RECOVERY - MariaDB Replica Lag: s5 on db1200 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:05] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.186 second response time https://wikitech.wikimedia.org/wiki/Swift [02:44:49] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Swift [02:54:11] db1130:9104 went up to 2 hours, I haven't seen any other replica having so long time. [02:57:35] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.197 second response time https://wikitech.wikimedia.org/wiki/Swift [02:58:53] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Swift [03:03:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:17:15] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.376 second response time https://wikitech.wikimedia.org/wiki/Swift [03:18:33] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.132 second response time https://wikitech.wikimedia.org/wiki/Swift [03:32:43] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.397 second response time https://wikitech.wikimedia.org/wiki/Swift [03:34:01] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Swift [03:42:37] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.283 second response time https://wikitech.wikimedia.org/wiki/Swift [03:43:53] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Swift [03:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:59:17] RECOVERY - MariaDB Replica Lag: s5 on db1130 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:49:37] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:09:37] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [05:27:39] (03PS1) 10Andrea Denisse: pontoon: Set profile::base::additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665) [05:29:09] (03CR) 10CI reject: [V: 04-1] pontoon: Set profile::base::additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665) (owner: 10Andrea Denisse) [05:33:17] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.365 second response time https://wikitech.wikimedia.org/wiki/Swift [05:35:59] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Swift [05:45:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231103T0600) [06:29:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:29:25] (03PS1) 10Marostegui: install_server: Do not reimage db1233 [puppet] - 10https://gerrit.wikimedia.org/r/971347 [06:34:06] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:36:50] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1233 [puppet] - 10https://gerrit.wikimedia.org/r/971347 (owner: 10Marostegui) [06:38:24] (03PS1) 10Marostegui: production-parsercache.sql.erb: Minor comment [puppet] - 10https://gerrit.wikimedia.org/r/971350 [06:41:32] (03CR) 10Marostegui: [C: 03+2] production-parsercache.sql.erb: Minor comment [puppet] - 10https://gerrit.wikimedia.org/r/971350 (owner: 10Marostegui) [06:54:46] (03PS2) 10Andrea Denisse: pontoon: Set additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665) [06:55:59] (03PS3) 10Andrea Denisse: pontoon: Set additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231103T0700) [07:24:06] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:33:57] (03CR) 10Filippo Giunchedi: [C: 03+1] pontoon: Set additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665) (owner: 10Andrea Denisse) [07:34:19] (03CR) 10Andrea Denisse: [C: 03+2] pontoon: Set additional_purged_packages to be empty [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665) (owner: 10Andrea Denisse) [07:45:30] (03PS1) 10Filippo Giunchedi: alertmanager: route o11y alerts [puppet] - 10https://gerrit.wikimedia.org/r/971357 [07:47:40] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/971357 (owner: 10Filippo Giunchedi) [07:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:01:15] (03CR) 10Majavah: [V: 03+1] "I know this is a massive patch, sorry :-) hopefully the PCC should make reviewing easier, since no hosts have any changes to system resour" [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [08:02:37] (03CR) 10Majavah: "profile::pontoon::provider::cloud_vps already installs a different DHCP client, does that mean that with this patch there will be two?" [puppet] - 10https://gerrit.wikimedia.org/r/971345 (https://phabricator.wikimedia.org/T347665) (owner: 10Andrea Denisse) [08:04:51] 10SRE, 10SRE-Access-Requests: Requesting access to root for dzahn - https://phabricator.wikimedia.org/T350435 (10MoritzMuehlenhoff) No need for an access request, you can simply make a revert of your original patch to drop your access? If you changed the SSH key you can simply send it to a colleagure via an ou... [08:07:11] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.189 second response time https://wikitech.wikimedia.org/wiki/Swift [08:08:27] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.141 second response time https://wikitech.wikimedia.org/wiki/Swift [08:10:14] (03PS1) 10Muehlenhoff: Extend MOU for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/971386 [08:12:37] (03CR) 10Muehlenhoff: [C: 03+2] Extend MOU for aitolkyn [puppet] - 10https://gerrit.wikimedia.org/r/971386 (owner: 10Muehlenhoff) [08:13:25] 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10karapayneWMDE) Request approved by myself [08:14:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.367 second response time https://wikitech.wikimedia.org/wiki/Swift [08:14:07] (03PS2) 10Elukey: changeprop: set num_workers to zero [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950) [08:15:21] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Swift [08:21:03] (03PS1) 10Muehlenhoff: Remove access for cec [puppet] - 10https://gerrit.wikimedia.org/r/971389 [08:23:32] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for cec [puppet] - 10https://gerrit.wikimedia.org/r/971389 (owner: 10Muehlenhoff) [08:34:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:36:29] 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10JMeybohm) 05Open→03Stalled From the conversation in slack it seems unclear how this should be solved. @ottomata and @mpopov suggested changing t... [08:40:41] RECOVERY - Check systemd state on kubernetes2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:41:04] ^ jayme :-) [08:41:17] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2035 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:43:36] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:48:08] (03PS1) 10Muehlenhoff: Failover to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/971392 [08:48:30] (03PS2) 10Muehlenhoff: Failover to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/971392 [08:52:33] (03CR) 10Muehlenhoff: [C: 03+2] Failover to testreduce1002 [dns] - 10https://gerrit.wikimedia.org/r/971392 (owner: 10Muehlenhoff) [08:54:37] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [08:58:45] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:59:01] (03CR) 10Muehlenhoff: [C: 03+2] profile::ci::php Also add the icu67 component following what was done for prod [puppet] - 10https://gerrit.wikimedia.org/r/971195 (https://phabricator.wikimedia.org/T345561) (owner: 10Muehlenhoff) [09:04:27] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.279 second response time https://wikitech.wikimedia.org/wiki/Swift [09:05:45] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Swift [09:10:48] (03CR) 10JMeybohm: [C: 03+2] Revert "admin: remove old ssh key from user dzahn" [puppet] - 10https://gerrit.wikimedia.org/r/971167 (owner: 10Dzahn) [09:11:36] 10SRE, 10SRE-Access-Requests: Requesting access to root for dzahn - https://phabricator.wikimedia.org/T350435 (10JMeybohm) 05Open→03Resolved a:03JMeybohm I did merge your revert. Welcome back! [09:14:37] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [09:40:55] 10SRE: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10fgiunchedi) [09:45:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [09:55:14] (03PS1) 10Muehlenhoff: Revert "Failover to testreduce1002" [dns] - 10https://gerrit.wikimedia.org/r/971397 [09:57:13] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Failover to testreduce1002" [dns] - 10https://gerrit.wikimedia.org/r/971397 (owner: 10Muehlenhoff) [09:59:10] !log roll-restart swift frontends [09:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:46] !log mvernon@cumin1001 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe [10:02:47] 10SRE-swift-storage, 10API Platform, 10Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Ammarpad) [10:03:33] 10SRE-swift-storage, 10API Platform, 10Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Soda) [10:08:41] !log mvernon@cumin1001 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe [10:29:57] (03CR) 10Vgutierrez: [C: 03+1] "looking good" [puppet] - 10https://gerrit.wikimedia.org/r/957720 (owner: 10Majavah) [10:34:23] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1011.eqiad.wmnet with OS bookworm [10:36:04] (03PS2) 10Majavah: acme_chief: Make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/957720 [10:36:06] (03PS2) 10Majavah: acme_chief: remove backwards compat [puppet] - 10https://gerrit.wikimedia.org/r/957721 [10:37:58] (03PS3) 10Majavah: acme_chief: Make http_proxy optional [puppet] - 10https://gerrit.wikimedia.org/r/957720 [10:38:00] (03PS3) 10Majavah: acme_chief: remove backwards compat [puppet] - 10https://gerrit.wikimedia.org/r/957721 [10:41:27] (03CR) 10Majavah: [C: 03+2] acme_chief: Make http_proxy optional (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957720 (owner: 10Majavah) [10:50:29] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1011.eqiad.wmnet with reason: host reimage [10:53:00] (03PS1) 10Physikerwelt: mathoid: update version [deployment-charts] - 10https://gerrit.wikimedia.org/r/971400 (https://phabricator.wikimedia.org/T350004) [10:53:14] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1011.eqiad.wmnet with reason: host reimage [10:58:39] (03PS1) 10Majavah: hieradata: cloudgw: drop nfs-maps [puppet] - 10https://gerrit.wikimedia.org/r/971401 (https://phabricator.wikimedia.org/T350259) [11:07:54] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1011.eqiad.wmnet with OS bookworm [11:08:44] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1006.eqiad.wmnet with OS bookworm [11:12:03] 10SRE, 10SRE-Unowned: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10JMeybohm) [11:13:08] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bookworm [11:21:29] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [11:24:13] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1006.eqiad.wmnet with reason: host reimage [11:26:06] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/970844 [11:26:08] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/970845 [11:29:33] (03CR) 10EoghanGaffney: "This change is ready for review." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971187 (https://phabricator.wikimedia.org/T347593) (owner: 10EoghanGaffney) [11:30:10] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage [11:33:20] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage [11:44:47] (03CR) 10Daniel Kinzler: [C: 03+1] "Looks good to me technically." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 (https://phabricator.wikimedia.org/T311620) (owner: 10Physikerwelt) [11:49:19] !log jynus@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup1010.eqiad.wmnet with OS bookworm [11:49:41] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bookworm [11:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:54:46] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1006.eqiad.wmnet with OS bookworm [12:01:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:01:44] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:03:59] (03PS1) 10Majavah: interface: attempt to resolve ordering issues with tagged interfaces [puppet] - 10https://gerrit.wikimedia.org/r/971406 [12:04:47] (03PS1) 10Hnowlan: wikifeeds: bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971407 (https://phabricator.wikimedia.org/T349517) [12:08:31] (03PS5) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 [12:10:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:11:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:11:58] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:12:26] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.063 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:13:13] (03PS1) 10Jbond: resolvconf: add nameservr_ips [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008) [12:13:15] (03PS1) 10Jbond: dynamicproxy: update to pull ips from proile::resolving [puppet] - 10https://gerrit.wikimedia.org/r/971410 (https://phabricator.wikimedia.org/T350008) [12:13:40] (03CR) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff) [12:15:07] (03CR) 10CI reject: [V: 04-1] resolvconf: add nameservr_ips [puppet] - 10https://gerrit.wikimedia.org/r/971409 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:15:30] (03CR) 10CI reject: [V: 04-1] dynamicproxy: update to pull ips from proile::resolving [puppet] - 10https://gerrit.wikimedia.org/r/971410 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:17:07] (03PS6) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 [12:17:27] (03CR) 10Muehlenhoff: Provide a script to determine whether a given Puppet node can be swithed to nft (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff) [12:17:28] !log jynus@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host backup1010.eqiad.wmnet with OS bookworm [12:17:50] !log jynus@cumin1001 START - Cookbook sre.hosts.reimage for host backup1010.eqiad.wmnet with OS bookworm [12:23:18] (03PS1) 10Jbond: acme_chief::cloud: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008) [12:23:53] (03CR) 10CI reject: [V: 04-1] acme_chief::cloud: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971411 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:27:09] (03PS1) 10Muehlenhoff: mailman: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971412 [12:27:10] 10SRE, 10Infrastructure-Foundations, 10netops: Netbox PuppetDB Import Script Failing for cloudnet2006 - https://phabricator.wikimedia.org/T350479 (10cmooney) p:05Triage→03Medium [12:27:56] (03PS1) 10Jbond: toolforge::docker::registry: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971413 (https://phabricator.wikimedia.org/T350008) [12:29:50] (03CR) 10CI reject: [V: 04-1] toolforge::docker::registry: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971413 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:30:53] (03PS1) 10Jbond: toolforge::legacy_redirector: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971414 (https://phabricator.wikimedia.org/T350008) [12:32:34] (03PS1) 10Jbond: toolforge::static: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971415 (https://phabricator.wikimedia.org/T350008) [12:32:43] (03CR) 10CI reject: [V: 04-1] toolforge::legacy_redirector: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971414 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:34:23] (03PS1) 10Jbond: labs::ores::redisproxy: drop unused role [puppet] - 10https://gerrit.wikimedia.org/r/971417 (https://phabricator.wikimedia.org/T350008) [12:34:51] (03CR) 10CI reject: [V: 04-1] toolforge::static: don't use the nameservers global [puppet] - 10https://gerrit.wikimedia.org/r/971415 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:35:03] (03PS1) 10Majavah: openstack: neutron: remove unnecessary refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/971418 [12:36:00] (03PS1) 10Jbond: scap::target: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971419 (https://phabricator.wikimedia.org/T350008) [12:37:20] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/971418 (owner: 10Majavah) [12:38:23] (03CR) 10CI reject: [V: 04-1] scap::target: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971419 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [12:39:13] (03PS1) 10Jbond: wikilabels::db_proxy: update to use dnsquery::a [puppet] - 10https://gerrit.wikimedia.org/r/971422 (https://phabricator.wikimedia.org/T350008) [12:39:15] (03PS1) 10Jbond: realm.pp: drop namservers global [puppet] - 10https://gerrit.wikimedia.org/r/971423 (https://phabricator.wikimedia.org/T350008) [12:42:01] (03PS1) 10Muehlenhoff: profile::openstack::base::puppetmaster::frontend: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971447 [12:42:16] (03PS2) 10Muehlenhoff: profile::openstack::base::puppetmaster::frontend: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971447 [12:42:16] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage [12:42:27] (03CR) 10Jbond: "not tested but lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff) [12:44:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:45:15] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on backup1010.eqiad.wmnet with reason: host reimage [12:48:04] !log fnegri@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudnet1006.eqiad.wmnet [12:54:16] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudnet1006.eqiad.wmnet [12:54:48] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/971162 (owner: 10Majavah) [12:56:52] (03PS1) 10Muehlenhoff: profile::openstack::base::designate::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971449 [12:57:19] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack::base: fix project_grants ordering [puppet] - 10https://gerrit.wikimedia.org/r/971162 (owner: 10Majavah) [12:57:41] (03PS1) 10EoghanGaffney: [apt-staging] Add apt_staging role to new staging vm [puppet] - 10https://gerrit.wikimedia.org/r/971450 (https://phabricator.wikimedia.org/T347004) [12:58:27] (03PS2) 10Majavah: openstack: neutron: remove unnecessary refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/971418 [12:58:32] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/971450 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [12:58:45] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:59:04] (03CR) 10Majavah: [C: 03+2] openstack: neutron: remove unnecessary refreshonly [puppet] - 10https://gerrit.wikimedia.org/r/971418 (owner: 10Majavah) [12:59:19] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/313/console" [puppet] - 10https://gerrit.wikimedia.org/r/971450 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [12:59:37] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [12:59:57] (03PS2) 10Majavah: interface: attempt to resolve ordering issues with tagged interfaces [puppet] - 10https://gerrit.wikimedia.org/r/971406 [13:00:12] (03CR) 10Majavah: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/971406/312/" [puppet] - 10https://gerrit.wikimedia.org/r/971406 (owner: 10Majavah) [13:00:44] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host backup1010.eqiad.wmnet with OS bookworm [13:01:33] (03CR) 10Majavah: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/971447 (owner: 10Muehlenhoff) [13:05:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971449 (owner: 10Muehlenhoff) [13:10:40] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: setup in progress [13:10:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: setup in progress [13:34:16] (03CR) 10Elukey: [C: 03+1] labs::ores::redisproxy: drop unused role [puppet] - 10https://gerrit.wikimedia.org/r/971417 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [13:35:59] (03CR) 10Herron: [C: 03+1] alertmanager: route o11y alerts [puppet] - 10https://gerrit.wikimedia.org/r/971357 (owner: 10Filippo Giunchedi) [13:36:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971455 (owner: 10Muehlenhoff) [13:43:26] (03CR) 10Jelto: [V: 03+1] "This change removes the old runner registration workflow and uses the new v4 api workflow. Unfortunately some of the settings can no longe" [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [13:45:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [13:48:07] (03CR) 10Kamila Součková: [C: 03+1] rest-gateway: change how AQS URLs enforce wikimedia.org domain [deployment-charts] - 10https://gerrit.wikimedia.org/r/971456 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan) [13:50:20] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudelastic1005.wikimedia.org [13:56:39] 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10Jclark-ctr) 05Open→03Resolved [13:56:43] 10SRE, 10Infrastructure-Foundations, 10netops: Put Dell SONiC switches in production - https://phabricator.wikimedia.org/T335028 (10Jclark-ctr) [14:02:58] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cloudelastic1005.wikimedia.org [14:03:26] PROBLEM - Check systemd state on cloudelastic1005 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:46] ^^ Not sure about this one, I'm not seeing any failed units [14:06:16] RECOVERY - Check systemd state on cloudelastic1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:42] (03CR) 10Muehlenhoff: [C: 03+2] Provide a script to determine whether a given Puppet node can be swithed to nft [puppet] - 10https://gerrit.wikimedia.org/r/969324 (owner: 10Muehlenhoff) [14:16:59] 10SRE, 10Infrastructure-Foundations, 10netops: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney) p:05Triage→03Medium [14:22:20] (03PS1) 10Muehlenhoff: grafana: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971458 [14:27:07] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bookworm [14:32:34] (03CR) 10Hnowlan: [C: 03+1] "Good find!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [14:36:54] (03CR) 10Hnowlan: [C: 03+1] changeprop: allow to define Kafka settings for Job Queues [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [14:38:46] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:46] !log adding irb interface in private1-a-codfw vlan to ssw1-a1-codfw T347191 [14:40:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:50] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [14:43:36] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:44:57] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [14:46:07] (03PS1) 10Jbond: realm.pp: drop $other_site global [puppet] - 10https://gerrit.wikimedia.org/r/971461 (https://phabricator.wikimedia.org/T350008) [14:49:07] (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Add apt_staging role to new staging vm [puppet] - 10https://gerrit.wikimedia.org/r/971450 (https://phabricator.wikimedia.org/T347004) (owner: 10EoghanGaffney) [14:50:05] !log moving cr1-codfw <-> ssw1-a1-codfw EBGP session to private1-b-codfw IPs T347191 [14:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:10] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [14:50:56] PROBLEM - BGP status on ssw1-a1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS14907/IPv4: Connect - wmf_public_asn, AS14907/IPv6: Idle - wmf_public_asn https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:51:07] (03CR) 10Hnowlan: Reconfigure the PageViewInfo extension to use AQS 2.0 via the REST Gateway (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle) [14:51:14] ^^ that's me, acking for now [14:51:15] (03PS1) 10Jbond: realm.pp: drop use_puppetdb global [puppet] - 10https://gerrit.wikimedia.org/r/971463 (https://phabricator.wikimedia.org/T350008) [14:51:17] (03PS1) 10Jbond: realm.pp: remove old comments [puppet] - 10https://gerrit.wikimedia.org/r/971464 (https://phabricator.wikimedia.org/T350008) [14:51:31] topranks: wait, you can ACK BGP alerts? please share the wisdom! [14:51:45] ah ACK, I misread as silence, ok :) [14:52:14] nah just regular ACK in alertmanager or icinga [14:52:27] I misread that because we usually don't ACK it so my mind rushed to "silence" [14:53:40] (03PS1) 10DLynch: DiscussionTools visual enhancements on pages with __NEWSECTIONLINK__ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971465 (https://phabricator.wikimedia.org/T331635) [14:53:46] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:01] (03CR) 10Ahmon Dancy: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [14:55:08] (03PS6) 10DLynch: Turn off DiscussionTools A/B test, and enable features on those wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) (owner: 10Esanders) [14:55:36] (03PS1) 10Cwhite: remove loki image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971427 (https://phabricator.wikimedia.org/T350366) [14:59:19] (03PS1) 10Hnowlan: jobqueue: increase concurrency for thumbnailrender job [deployment-charts] - 10https://gerrit.wikimedia.org/r/971467 [15:01:08] (03PS3) 10Jelto: gitlab_runner: Migrate to new runner registration scheme [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) [15:01:54] (03CR) 10Jelto: gitlab_runner: Migrate to new runner registration scheme (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [15:04:40] (03CR) 10Ahmon Dancy: [C: 03+1] gitlab_runner: Migrate to new runner registration scheme [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [15:05:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [15:05:43] (03PS1) 10Jbond: sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) [15:06:17] (03CR) 10CI reject: [V: 04-1] sanitarium_multiinstance: over private_wiki and private_tables vars to hiera [puppet] - 10https://gerrit.wikimedia.org/r/971468 (https://phabricator.wikimedia.org/T350008) (owner: 10Jbond) [15:08:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [15:09:10] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp4052.ulsfo.wmnet with OS bookworm [15:10:49] (03PS1) 10Jbond: realm.pp: drop wikimail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971469 (https://phabricator.wikimedia.org/T350008) [15:12:13] (03CR) 10Filippo Giunchedi: [C: 03+1] grafana: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/971458 (owner: 10Muehlenhoff) [15:14:52] RECOVERY - BGP status on ssw1-a1-codfw.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:19:07] (03PS1) 10Jbond: airflow: convert to pull mail_smarthosts from hiera [puppet] - 10https://gerrit.wikimedia.org/r/971471 (https://phabricator.wikimedia.org/T350008) [15:19:09] (03PS1) 10Jbond: realm: drop mail_smarthost global [puppet] - 10https://gerrit.wikimedia.org/r/971472 (https://phabricator.wikimedia.org/T350008) [15:20:23] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on db2188.codfw.wmnet with reason: reimage via T343674 [15:20:27] T343674: Productionize db21[88-95] - https://phabricator.wikimedia.org/T343674 [15:20:37] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on db2188.codfw.wmnet with reason: reimage via T343674 [15:20:51] !log arnaudb@cumin1001 START - Cookbook sre.hosts.reimage for host db2188.codfw.wmnet with OS bookworm [15:22:11] (03PS1) 10Jbond: realm.pp: drop ntp_peers [puppet] - 10https://gerrit.wikimedia.org/r/971476 (https://phabricator.wikimedia.org/T350008) [15:24:11] 10SRE, 10SRE-Unowned: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492 (10fgiunchedi) [15:26:02] 10SRE, 10SRE-Unowned, 10User-fgiunchedi: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10fgiunchedi) [15:26:11] 10SRE, 10SRE-Unowned, 10User-fgiunchedi: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492 (10fgiunchedi) [15:29:01] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/971412 (owner: 10Muehlenhoff) [15:31:41] PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:33:45] RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:36:15] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2010.codfw.wmnet with OS bookworm [15:36:15] (03CR) 10Eevans: [C: 03+1] cassandra: Avoid Ferm-specific syntax and simplify analytics access [puppet] - 10https://gerrit.wikimedia.org/r/970799 (owner: 10Muehlenhoff) [15:38:21] (03CR) 10Bking: [C: 03+2] search: simplify flink parallelism configuration [alerts] - 10https://gerrit.wikimedia.org/r/961020 (https://phabricator.wikimedia.org/T346456) (owner: 10DCausse) [15:38:22] 10SRE, 10serviceops-radar, 10Patch-For-Review: ICU transition towards ICU 67 - https://phabricator.wikimedia.org/T329491 (10Jdforrester-WMF) [15:39:02] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2188.codfw.wmnet with reason: host reimage [15:39:30] 10SRE, 10Infrastructure-Foundations, 10User-fgiunchedi: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10MoritzMuehlenhoff) [15:42:07] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2188.codfw.wmnet with reason: host reimage [15:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:57:14] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2188.codfw.wmnet with OS bookworm [15:57:49] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:58:41] (03CR) 10EoghanGaffney: [C: 03+1] gitlab_runner: Migrate to new runner registration scheme [puppet] - 10https://gerrit.wikimedia.org/r/968988 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto) [15:59:05] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:03:03] 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10EBernhardson) This dataset is derived from `event.wdqs_external_sparql_query` which is probably considered PII, as a direct log of queries issued ag... [16:16:45] !log jynus@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host backup2010.codfw.wmnet with OS bookworm [16:17:45] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2010.codfw.wmnet with OS bookworm [16:22:50] (03PS1) 10EoghanGaffney: [apt-staging] Add dns names for apt-staging.wm.o and discovery.w [dns] - 10https://gerrit.wikimedia.org/r/971486 [16:30:07] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [16:31:01] (03Abandoned) 10Filippo Giunchedi: sre: add check for inodes free [alerts] - 10https://gerrit.wikimedia.org/r/904675 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi) [16:34:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:34:32] (03PS1) 10Cathal Mooney: Remove specific TTL values from server BGP groups [homer/public] - 10https://gerrit.wikimedia.org/r/971488 (https://phabricator.wikimedia.org/T350488) [16:45:45] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:47] !log jynus@cumin2002 START - Cookbook sre.hosts.reimage for host backup2011.codfw.wmnet with OS bookworm [16:51:46] (03PS1) 10Cathal Mooney: Change Bird multihop command to use default system TTL [puppet] - 10https://gerrit.wikimedia.org/r/971490 (https://phabricator.wikimedia.org/T350488) [16:53:29] (03PS1) 10FNegri: P:openstack:codfw1dev enable prom exporter [puppet] - 10https://gerrit.wikimedia.org/r/971491 (https://phabricator.wikimedia.org/T350154) [16:55:41] (03CR) 10CI reject: [V: 04-1] P:openstack:codfw1dev enable prom exporter [puppet] - 10https://gerrit.wikimedia.org/r/971491 (https://phabricator.wikimedia.org/T350154) (owner: 10FNegri) [17:04:37] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:08:07] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [17:08:17] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3128: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:08:23] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [17:08:31] PROBLEM - Ensure traffic_exporter for the backend instance binds on port 9122 and responds to HTTP requests on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 9122: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [17:08:39] PROBLEM - Varnish HTTP upload-frontend - port 3124 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3124: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:09:23] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3120: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:09:35] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3126: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:09:39] (03PS1) 10Cathal Mooney: Block incoming packets on the edge for CR loopbacks on TCP 179 [homer/public] - 10https://gerrit.wikimedia.org/r/971498 (https://phabricator.wikimedia.org/T350488) [17:09:45] PROBLEM - Varnish HTTP upload-frontend - port 3123 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3123: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:10:17] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3121: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:10:17] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3125: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:10:39] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3122: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:11:07] sukhe: ^ expired downtime? [17:12:27] PROBLEM - Varnish HTTP upload-frontend - port 3127 on cp4052 is CRITICAL: connect to address 10.128.0.12 and port 3127: Connection refused https://wikitech.wikimedia.org/wiki/Varnish [17:15:24] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney) [17:15:47] Yeah, cp4052 isn't pooled so nothing to worry about [17:15:51] I'll fix it [17:17:47] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 29 days, 4:00:00 on cp4052.ulsfo.wmnet with reason: testing instance [17:18:02] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 29 days, 4:00:00 on cp4052.ulsfo.wmnet with reason: testing instance [17:23:35] 10SRE, 10SRE-Access-Requests: Requesting access to root for dzahn - https://phabricator.wikimedia.org/T350435 (10Dzahn) >>! In T350435#9304170, @MoritzMuehlenhoff wrote: > No need for an access request, you can simply make a revert of your original patch to drop your access? I did make the revert and was told... [17:24:37] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [17:27:37] (03PS1) 10Ahmon Dancy: Halve profile::gitlab::runner::buildkitd_gckeepstorage [puppet] - 10https://gerrit.wikimedia.org/r/971502 (https://phabricator.wikimedia.org/T350478) [17:27:57] thanks taavi and brett [17:28:02] yes expired downtine [17:28:05] time [17:45:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [18:02:11] (03CR) 10Tchanders: [WIP] ipoid: Set an initialImport cron job (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/967245 (https://phabricator.wikimedia.org/T346861) (owner: 10Kosta Harlan) [18:08:47] !log jynus@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host backup2011.codfw.wmnet with OS bookworm [18:10:41] (03CR) 10Dzahn: "thanks! confirmed I have prod ssh access again" [puppet] - 10https://gerrit.wikimedia.org/r/971167 (owner: 10Dzahn) [18:11:01] PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:27] RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:51] PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:17:48] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10RobH) [18:17:56] 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: eqiad: Connect IC-374549 - https://phabricator.wikimedia.org/T350504 (10RobH) [18:18:25] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Znuny, 10collaboration-services: OTRS/mail: investigate why "T=remote_smtp_signed: all hosts for 'ticket.wikimedia.org' have been failing for a long time" - https://phabricator.wikimedia.org/T297160 (10Dzahn) ` [mx1001:/var/log/exim4] $ grep -ri "otrs@ticke... [18:18:42] (03PS4) 10Cathal Mooney: Adjust reimage cookbook config for DHCP binding clear workaround [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) [18:19:41] RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:23:18] (03CR) 10CI reject: [V: 04-1] Adjust reimage cookbook config for DHCP binding clear workaround [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [18:23:57] PROBLEM - Check systemd state on gitlab-runner2004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:21] RECOVERY - Check systemd state on gitlab-runner2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:43:47] (03PS1) 10Eevans: cassandra: password for mediawiki_services_mobileapps role [labs/private] - 10https://gerrit.wikimedia.org/r/971504 (https://phabricator.wikimedia.org/T348993) [18:53:46] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:04:45] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:05:53] (03PS1) 10Ahmon Dancy: docker::gc: Add timeout parameter [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478) [19:06:25] (03CR) 10CI reject: [V: 04-1] docker::gc: Add timeout parameter [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478) (owner: 10Ahmon Dancy) [19:07:18] (03PS2) 10Ahmon Dancy: docker::gc: Add timeout parameter [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478) [19:10:11] (03CR) 10Ahmon Dancy: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478) (owner: 10Ahmon Dancy) [19:14:33] (03CR) 10Dzahn: [C: 03+2] Phabricator monthly email: Remove Differential user activity stats [puppet] - 10https://gerrit.wikimedia.org/r/969430 (https://phabricator.wikimedia.org/T324131) (owner: 10Aklapper) [19:15:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:59:46] (03PS1) 10Superpes15: [bnwikisource] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971517 (https://phabricator.wikimedia.org/T350482) [20:01:06] Hi. I'm writing Tech News, and I'm not sure how to summarize yesterday's issue with s5. -- I think it's something like this, and I'd appreciate any corrections/improvements (whilst keeping it very simple and easily translatable), or just tell me the number of hours and a thumbs-up: -- [20:01:06] "Last week, there was a problem displaying some recent edits on a few wikis[link to s5], for XX?? hours. The edits were saved but not immediately shown. This was due to a database problem." [20:06:41] PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:08:05] RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:38:24] (03PS1) 10Superpes15: [plwiki] Add 'abusefilter-log-private' flag to sysop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971518 (https://phabricator.wikimedia.org/T350509) [20:41:14] (03PS2) 10Superpes15: [plwiki] Add 'abusefilter-log-private' flag to sysops [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971518 (https://phabricator.wikimedia.org/T350509) [20:41:32] (03PS2) 10Superpes15: [bnwikisource] Change the wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971517 (https://phabricator.wikimedia.org/T350482) [20:48:51] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:53:05] (03CR) 10Dzahn: [C: 03+2] wikistats:wikia: pause updates while changes are made to table [puppet] - 10https://gerrit.wikimedia.org/r/971526 (https://phabricator.wikimedia.org/T215534) (owner: 10RhinosF1) [21:00:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:01:46] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971514 (https://phabricator.wikimedia.org/T350478) (owner: 10Ahmon Dancy) [21:04:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:09:37] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:15:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:29:37] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [21:45:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [21:55:10] (03CR) 10Cwhite: [C: 03+1] alertmanager: route o11y alerts [puppet] - 10https://gerrit.wikimedia.org/r/971357 (owner: 10Filippo Giunchedi) [22:53:46] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure