[00:00:24] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:00:34] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:00:48] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:04:16] <icinga-wm>	 PROBLEM - Check systemd state on datahubsearch1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:48] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:07:52] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.048 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:08:00] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 2.853 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:08:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P50212 and previous config saved to /var/cache/conftool/dbconfig/20230809-000804-ladsgroup.json
[00:15:32] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:15:40] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:15:52] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:23:02] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.692 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:23:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P50213 and previous config saved to /var/cache/conftool/dbconfig/20230809-002310-ladsgroup.json
[00:23:12] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 3.953 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:23:18] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:30:50] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:31:00] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:31:08] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:37:14] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:38:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50214 and previous config saved to /var/cache/conftool/dbconfig/20230809-003817-ladsgroup.json
[00:38:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[00:38:21] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[00:38:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance
[00:38:40] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945843
[00:38:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945843 (owner: 10TrainBranchBot)
[00:39:52] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:40:00] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:43:00] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:46:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T342617)', diff saved to https://phabricator.wikimedia.org/P50215 and previous config saved to /var/cache/conftool/dbconfig/20230809-004605-ladsgroup.json
[00:46:10] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[00:47:30] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:50:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:50:42] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:50:48] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:53:38] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945843 (owner: 10TrainBranchBot)
[00:55:08] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.042 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:55:10] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:56:34] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.127 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:01:10] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:01:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P50216 and previous config saved to /var/cache/conftool/dbconfig/20230809-010112-ladsgroup.json
[01:01:16] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:02:44] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:07:14] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:07:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:07:26] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:08:50] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:10:06] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:10:18] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.800 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:14:54] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:16:18] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:16:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P50217 and previous config saved to /var/cache/conftool/dbconfig/20230809-011618-ladsgroup.json
[01:16:30] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:17:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[01:20:50] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.940 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:20:52] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:20:56] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 9.552 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:30:02] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:30:06] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:31:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T342617)', diff saved to https://phabricator.wikimedia.org/P50218 and previous config saved to /var/cache/conftool/dbconfig/20230809-013124-ladsgroup.json
[01:31:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[01:31:28] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[01:31:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance
[01:31:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50219 and previous config saved to /var/cache/conftool/dbconfig/20230809-013145-ladsgroup.json
[01:34:36] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.420 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:34:38] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.481 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:42:08] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[01:42:26] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on cloudweb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[01:54:06] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 25998 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[01:54:22] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on cloudweb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 25998 bytes in 0.299 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[02:06:33] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:27:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:31:33] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:32:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[02:56:50] <wikibugs>	 (03PS2) 10KartikMistry: testwiki: Enable Section Translation for 7 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946852 (https://phabricator.wikimedia.org/T343211)
[03:27:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:37:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:57:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:02:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:56:12] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1119 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/947197 (https://phabricator.wikimedia.org/T335080)
[05:57:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1119 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/947197 (https://phabricator.wikimedia.org/T335080) (owner: 10Marostegui)
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T0600)
[06:03:33] <wikibugs>	 (03PS1) 10Marostegui: Revert "cloudbackup200[12]: remove some spurious config from the last patch" [puppet] - 10https://gerrit.wikimedia.org/r/946654
[06:03:47] <wikibugs>	 (03PS1) 10Marostegui: Revert "Correct the role for the new hadoop workers" [puppet] - 10https://gerrit.wikimedia.org/r/946655
[06:04:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "cloudbackup200[12]: remove some spurious config from the last patch" [puppet] - 10https://gerrit.wikimedia.org/r/946654 (owner: 10Marostegui)
[06:04:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "Correct the role for the new hadoop workers" [puppet] - 10https://gerrit.wikimedia.org/r/946655 (owner: 10Marostegui)
[06:06:00] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[06:07:21] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove db1119 from s1 [puppet] - 10https://gerrit.wikimedia.org/r/947198
[06:08:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db1119 from s1 [puppet] - 10https://gerrit.wikimedia.org/r/947198 (owner: 10Marostegui)
[06:12:36] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:13:14] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:14:27] <wikibugs>	 (03PS1) 10Marostegui: install_server: Add db12[34-49] to reimage list [puppet] - 10https://gerrit.wikimedia.org/r/947199 (https://phabricator.wikimedia.org/T342166)
[06:15:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Add db12[34-49] to reimage list [puppet] - 10https://gerrit.wikimedia.org/r/947199 (https://phabricator.wikimedia.org/T342166) (owner: 10Marostegui)
[06:15:49] <marostegui>	 haproxy alerts are expected
[06:18:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[06:18:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[06:18:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50222 and previous config saved to /var/cache/conftool/dbconfig/20230809-061826-ladsgroup.json
[06:18:30] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[06:18:30] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:19:32] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:19:34] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:19:48] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:20:56] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:21:04] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:21:06] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:21:20] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:22:05] <wikibugs>	 (03CR) 10Marostegui: Drop old externallinks columns (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/946556 (https://phabricator.wikimedia.org/T343718) (owner: 10Ladsgroup)
[06:22:26] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:23:06] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:24:05] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Add db12[26-33] [puppet] - 10https://gerrit.wikimedia.org/r/947200 (https://phabricator.wikimedia.org/T342176)
[06:24:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Add db12[26-33] [puppet] - 10https://gerrit.wikimedia.org/r/947200 (https://phabricator.wikimedia.org/T342176) (owner: 10Marostegui)
[06:28:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201
[06:29:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201 (owner: 10Muehlenhoff)
[06:32:55] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[06:33:55] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:35:31] <wikibugs>	 (03PS2) 10Muehlenhoff: Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201
[06:36:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201 (owner: 10Muehlenhoff)
[06:40:51] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201
[06:42:33] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:43:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:44:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201 (owner: 10Muehlenhoff)
[06:45:19] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[06:46:23] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jmads out of all services on: 1309 hosts
[06:46:27] <logmsgbot>	 !log root@cumin2002 END (FAIL) - Cookbook sre.idm.logout (exit_code=99) Logging Jmads out of all services on: 1309 hosts
[06:46:51] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jkieserman out of all services on: 1309 hosts
[06:47:26] <logmsgbot>	 !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jkieserman out of all services on: 1309 hosts
[06:48:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[06:51:26] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jkieserman out of all services on: 716 hosts
[06:51:39] <logmsgbot>	 !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jkieserman out of all services on: 716 hosts
[06:51:48] <logmsgbot>	 !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jkieserman out of all services on: 33 hosts
[06:52:05] <logmsgbot>	 !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jkieserman out of all services on: 33 hosts
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T0700).
[07:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:17] <taavi>	 o/
[07:00:23] * kart_ is here
[07:00:24] <taavi>	 kart_: I assume you'll self-deploy?
[07:00:33] <kart_>	 taavi: yes :)
[07:00:43] <kart_>	 Starting deployment..
[07:01:02] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946852 (https://phabricator.wikimedia.org/T343211) (owner: 10KartikMistry)
[07:01:45] <wikibugs>	 (03Merged) 10jenkins-bot: testwiki: Enable Section Translation for 7 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946852 (https://phabricator.wikimedia.org/T343211) (owner: 10KartikMistry)
[07:02:13] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:946852|testwiki: Enable Section Translation for 7 Wikipedias (T343211)]]
[07:02:27] <stashbot>	 T343211: Enable Content and Section translation on 12 Wikipedias - https://phabricator.wikimedia.org/T343211
[07:03:42] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:946852|testwiki: Enable Section Translation for 7 Wikipedias (T343211)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:05:47] <logmsgbot>	 !log kartik@deploy1002 kartik: Continuing with sync
[07:12:11] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:946852|testwiki: Enable Section Translation for 7 Wikipedias (T343211)]] (duration: 09m 58s)
[07:12:15] <stashbot>	 T343211: Enable Content and Section translation on 12 Wikipedias - https://phabricator.wikimedia.org/T343211
[07:12:21] <kart_>	 taavi: I'm done.
[07:12:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:12:57] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on cloudweb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:13:43] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:17:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:19:37] <vgutierrez>	 hmmm something it's going on with wikimedia-static
[07:20:04] <vgutierrez>	 *wikitech-static
[07:20:07] <vgutierrez>	 E_COFFEE
[07:23:15] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on debmonitor2003 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-maintenance-gc.service,uwsgi-debmonitor.service,wmf_auto_restart_uwsgi-debmonitor.service Jcrespo WIP host https://phabricator.wikimedia.org/T241049 - The acknowledgement expires at: 2023-09-06 08:00:00. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:23:15] <icinga-wm>	 ACKNOWLEDGEMENT - debmonitor.discovery.wmnet:443 internal on debmonitor2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable Jcrespo WIP host https://phabricator.wikimedia.org/T241049 - The acknowledgement expires at: 2023-09-06 08:00:00. https://wikitech.wikimedia.org/wiki/Debmonitor
[07:23:27] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on cloudweb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 26067 bytes in 9.194 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:24:01] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 26067 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[07:52:17] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be1003.eqiad.wmnet
[07:56:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:58:35] <icinga-wm>	 RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops
[07:58:35] <icinga-wm>	 RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:58:54] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1003.eqiad.wmnet
[07:59:43] <wikibugs>	 (03CR) 10Muehlenhoff: thanos: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945752 (owner: 10Muehlenhoff)
[08:01:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:09:22] <wikibugs>	 (03PS1) 10Muehlenhoff: Add a Firewall::Portrange define [puppet] - 10https://gerrit.wikimedia.org/r/947316
[08:15:17] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947316 (owner: 10Muehlenhoff)
[08:19:40] <wikibugs>	 (03CR) 10Ladsgroup: Drop old externallinks columns (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/946556 (https://phabricator.wikimedia.org/T343718) (owner: 10Ladsgroup)
[08:20:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] Drop old externallinks columns [software/schema-changes] - 10https://gerrit.wikimedia.org/r/946556 (https://phabricator.wikimedia.org/T343718) (owner: 10Ladsgroup)
[08:21:29] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] "\o/" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/946556 (https://phabricator.wikimedia.org/T343718) (owner: 10Ladsgroup)
[08:21:52] <wikibugs>	 (03Merged) 10jenkins-bot: Drop old externallinks columns [software/schema-changes] - 10https://gerrit.wikimedia.org/r/946556 (https://phabricator.wikimedia.org/T343718) (owner: 10Ladsgroup)
[08:28:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/947022 (https://phabricator.wikimedia.org/T342972) (owner: 10Eevans)
[08:29:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/946632 (https://phabricator.wikimedia.org/T342968) (owner: 10Eevans)
[08:32:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[08:32:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[08:32:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[08:32:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[08:32:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:33:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[08:33:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T342617)', diff saved to https://phabricator.wikimedia.org/P50223 and previous config saved to /var/cache/conftool/dbconfig/20230809-083319-ladsgroup.json
[08:33:23] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[08:34:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[08:34:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[08:37:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50224 and previous config saved to /var/cache/conftool/dbconfig/20230809-083738-ladsgroup.json
[08:49:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Nicholas as approver for wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947319
[08:52:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P50225 and previous config saved to /var/cache/conftool/dbconfig/20230809-085244-ladsgroup.json
[09:02:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[09:02:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[09:05:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[09:05:14] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[09:05:31] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[09:05:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[09:07:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P50226 and previous config saved to /var/cache/conftool/dbconfig/20230809-090750-ladsgroup.json
[09:09:43] <wikibugs>	 (03PS1) 10Elukey: istio: increase max resources for envoy in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/947322
[09:13:57] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:14:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:14:29] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:16:58] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, one comment inline. You can also remove the threedtopng classes, they are also only used by Thumbor." [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan)
[09:18:47] <wikibugs>	 (03PS8) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691)
[09:20:39] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42801/console" [puppet] - 10https://gerrit.wikimedia.org/r/946981 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:22:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50227 and previous config saved to /var/cache/conftool/dbconfig/20230809-092258-ladsgroup.json
[09:23:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[09:23:02] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[09:23:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[09:23:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50228 and previous config saved to /var/cache/conftool/dbconfig/20230809-092319-ladsgroup.json
[09:25:02] <wikibugs>	 (03PS2) 10JMeybohm: deployment_server::general: Globally enable mesh.certmanager [puppet] - 10https://gerrit.wikimedia.org/r/946981 (https://phabricator.wikimedia.org/T300033)
[09:25:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[09:25:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[09:25:48] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[09:26:04] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance
[09:26:58] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42802/console" [puppet] - 10https://gerrit.wikimedia.org/r/946981 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:30:34] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: route wikifeeds requests via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/945558 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan)
[09:31:15] <wikibugs>	 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10TheresNoTime) nb. [[ https://docs.google.com/document/d/1sOQ-b7Z4SLMevGEo9ar8B_8PksGhpXiGopAPLGnfzhk/edit | followup (docs)]] from {T343294}
[09:31:21] <hnowlan>	 !log disabling puppet on A:cp to test 945558
[09:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:31:55] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] trafficserver: route wikifeeds requests via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/945558 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan)
[09:33:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[09:33:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[09:33:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T342617)', diff saved to https://phabricator.wikimedia.org/P50229 and previous config saved to /var/cache/conftool/dbconfig/20230809-093341-ladsgroup.json
[09:33:45] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[09:33:46] <hnowlan>	 hmm, confctl didn't log here 
[09:34:06] <wikibugs>	 (03PS1) 10David Caro: cloud.haproxy: avoid keep-alive for stats scrapers [puppet] - 10https://gerrit.wikimedia.org/r/947325
[09:34:49] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1086 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:37:00] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[09:37:03] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[09:37:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[09:37:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[09:37:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T342617)', diff saved to https://phabricator.wikimedia.org/P50230 and previous config saved to /var/cache/conftool/dbconfig/20230809-093715-ladsgroup.json
[09:37:19] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server::general: Globally enable mesh.certmanager [puppet] - 10https://gerrit.wikimedia.org/r/946981 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:39:28] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42803/console" [puppet] - 10https://gerrit.wikimedia.org/r/947325 (owner: 10David Caro)
[09:39:52] <wikibugs>	 (03PS12) 10JMeybohm: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[09:41:04] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42804/console" [puppet] - 10https://gerrit.wikimedia.org/r/947325 (owner: 10David Caro)
[09:41:37] <wikibugs>	 (03PS1) 10Hnowlan: Revert "trafficserver: route wikifeeds requests via the rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/946656
[09:42:02] <wikibugs>	 (03PS4) 10JMeybohm: Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033)
[09:43:24] <wikibugs>	 (03CR) 10JMeybohm: mediawiki: set requests based on php.workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[09:43:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] cloud.haproxy: avoid keep-alive for stats scrapers [puppet] - 10https://gerrit.wikimedia.org/r/947325 (owner: 10David Caro)
[09:43:42] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] cloud.haproxy: avoid keep-alive for stats scrapers [puppet] - 10https://gerrit.wikimedia.org/r/947325 (owner: 10David Caro)
[09:44:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:45:49] <wikibugs>	 (03Merged) 10jenkins-bot: Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:48:46] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply
[09:48:56] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply
[09:49:09] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Revert "trafficserver: route wikifeeds requests via the rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/946656 (owner: 10Hnowlan)
[09:49:49] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:53:59] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting access to  analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Gehel)
[09:54:23] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1084.eqiad.wmnet with OS bullseye
[09:55:13] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[09:55:36] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[09:55:37] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[09:55:56] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[09:56:03] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] istio: increase max resources for envoy in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/947322 (owner: 10Elukey)
[09:57:11] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[09:57:25] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[09:57:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P50231 and previous config saved to /var/cache/conftool/dbconfig/20230809-095730-ladsgroup.json
[09:57:34] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[09:58:56] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-master1002.eqiad.wmnet
[09:59:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P50232 and previous config saved to /var/cache/conftool/dbconfig/20230809-095938-ladsgroup.json
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1000)
[10:05:51] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1002.eqiad.wmnet
[10:07:07] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[10:07:13] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[10:08:23] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-coord1002.eqiad.wmnet
[10:08:46] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Revert "trafficserver: route wikifeeds requests via the rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/946656 (owner: 10Hnowlan)
[10:09:14] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1084.eqiad.wmnet with reason: host reimage
[10:12:23] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1084.eqiad.wmnet with reason: host reimage
[10:14:29] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1002.eqiad.wmnet
[10:14:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P50233 and previous config saved to /var/cache/conftool/dbconfig/20230809-101444-ladsgroup.json
[10:14:59] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "There an error in a fixture I think." [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[10:19:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[10:19:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[10:19:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T342617)', diff saved to https://phabricator.wikimedia.org/P50234 and previous config saved to /var/cache/conftool/dbconfig/20230809-101946-ladsgroup.json
[10:19:50] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[10:26:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T342617)', diff saved to https://phabricator.wikimedia.org/P50235 and previous config saved to /var/cache/conftool/dbconfig/20230809-102622-ladsgroup.json
[10:26:27] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[10:27:48] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to  releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10RickiJay-WMDE) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDxBiO5uB5mMR7mWih5KHZ3d9I0UhDiVI7AZ1/i8/LqMuuWSJ2Nf40a2vKmXzKPj2bIiV1PVHqr6+JO8X8PkVoKjl4DFg90IbXKO4CJOmy1Bs7FBTsf+yyFcP8C...
[10:29:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P50236 and previous config saved to /var/cache/conftool/dbconfig/20230809-102951-ladsgroup.json
[10:36:52] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1084.eqiad.wmnet with OS bullseye
[10:41:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P50237 and previous config saved to /var/cache/conftool/dbconfig/20230809-104128-ladsgroup.json
[10:44:29] <_joe_>	 !log ran requestctl commit, which removed the comma removal from the requestctl output as per T305582
[10:44:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:44:33] <stashbot>	 T305582: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582
[10:44:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P50238 and previous config saved to /var/cache/conftool/dbconfig/20230809-104457-ladsgroup.json
[10:44:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[10:45:01] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[10:45:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[10:45:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T343718)', diff saved to https://phabricator.wikimedia.org/P50239 and previous config saved to /var/cache/conftool/dbconfig/20230809-104518-ladsgroup.json
[10:46:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343718)', diff saved to https://phabricator.wikimedia.org/P50240 and previous config saved to /var/cache/conftool/dbconfig/20230809-104625-ladsgroup.json
[10:48:48] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945844
[10:52:20] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945844 (owner: 10PipelineBot)
[10:53:04] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945844 (owner: 10PipelineBot)
[10:54:38] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[10:54:55] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[10:55:00] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[10:55:36] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[10:55:43] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[10:56:16] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[10:56:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P50241 and previous config saved to /var/cache/conftool/dbconfig/20230809-105635-ladsgroup.json
[11:01:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P50242 and previous config saved to /var/cache/conftool/dbconfig/20230809-110132-ladsgroup.json
[11:02:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[11:02:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[11:06:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T342617)', diff saved to https://phabricator.wikimedia.org/P50243 and previous config saved to /var/cache/conftool/dbconfig/20230809-110647-ladsgroup.json
[11:06:51] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[11:07:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] istio: increase max resources for envoy in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/947322 (owner: 10Elukey)
[11:08:04] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) 05Resolved→03Open >>! In T341546#9074281, @Jhancock.wm wrote: > @MoritzMuehlenhoff you should be okay to repool it now. but feel free to reopen the ticket if you need to (knocks on wood)  Thanks, th...
[11:08:16] <icinga-wm>	 PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:09:46] <icinga-wm>	 RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:11:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T342617)', diff saved to https://phabricator.wikimedia.org/P50244 and previous config saved to /var/cache/conftool/dbconfig/20230809-111141-ladsgroup.json
[11:11:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[11:11:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[11:14:40] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:15:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[11:15:48] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:15:50] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:16:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P50245 and previous config saved to /var/cache/conftool/dbconfig/20230809-111638-ladsgroup.json
[11:20:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[11:20:19] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1085.eqiad.wmnet with OS bullseye
[11:20:46] <wikibugs>	 (03PS1) 10AikoChou: changeprop: filter sourceswiki from stream for outlink LW service [deployment-charts] - 10https://gerrit.wikimedia.org/r/947328 (https://phabricator.wikimedia.org/T343740)
[11:21:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P50246 and previous config saved to /var/cache/conftool/dbconfig/20230809-112153-ladsgroup.json
[11:29:11] <icinga-wm>	 RECOVERY - Check systemd state on datahubsearch1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:31:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343718)', diff saved to https://phabricator.wikimedia.org/P50247 and previous config saved to /var/cache/conftool/dbconfig/20230809-113144-ladsgroup.json
[11:31:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[11:31:48] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[11:31:59] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[11:32:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P50248 and previous config saved to /var/cache/conftool/dbconfig/20230809-113205-ladsgroup.json
[11:33:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P50249 and previous config saved to /var/cache/conftool/dbconfig/20230809-113312-ladsgroup.json
[11:34:02] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) This has been reported to Google. We're waiting for them to get back.
[11:35:13] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1085.eqiad.wmnet with reason: host reimage
[11:37:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P50250 and previous config saved to /var/cache/conftool/dbconfig/20230809-113659-ladsgroup.json
[11:37:26] <Amir1>	 jouncebot: nowandnext
[11:37:27] <jouncebot>	 No deployments scheduled for the next 2 hour(s) and 22 minute(s)
[11:37:27] <jouncebot>	 In 2 hour(s) and 22 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1400)
[11:38:20] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1085.eqiad.wmnet with reason: host reimage
[11:38:33] <wikibugs>	 (03PS10) 10Ladsgroup: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[11:38:51] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[11:39:33] <wikibugs>	 (03Merged) 10jenkins-bot: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti)
[11:39:51] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:937922|sdwiki: set 'wgTranslateNumerals' to false (T268203)]]
[11:39:54] <stashbot>	 T268203: Set $digitTransformTable to use english-style 0123456789 digits on sdwiki - https://phabricator.wikimedia.org/T268203
[11:41:31] <logmsgbot>	 !log ladsgroup@deploy1002 kaleembhatti and ladsgroup: Backport for [[gerrit:937922|sdwiki: set 'wgTranslateNumerals' to false (T268203)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[11:41:35] <logmsgbot>	 !log ladsgroup@deploy1002 kaleembhatti and ladsgroup: Continuing with sync
[11:46:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:47:07] <jinxer-wm>	 (ProbeDown) firing: (2) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:47:07] <jinxer-wm>	 (ProbeDown) firing: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:47:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance
[11:47:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance
[11:47:37] <jynus>	 mmmm
[11:47:44] <jynus>	 is that you, Amir1 ?
[11:47:53] <Amir1>	 which?
[11:48:11] <Amir1>	 I'm doing schema changes right now
[11:48:16] <Amir1>	 https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance
[11:48:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P50251 and previous config saved to /var/cache/conftool/dbconfig/20230809-114819-ladsgroup.json
[11:48:22] <jayme>	 !incidents
[11:48:22] <sirenbot>	 3936 (UNACKED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[11:48:29] <jayme>	 !ack 3936
[11:48:29] <sirenbot>	 3936 (ACKED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[11:48:37] <jynus>	 are those old?
[11:48:41] <Amir1>	 I don't think that's me
[11:48:49] <jynus>	 no, sorry
[11:48:56] <jynus>	 I just checked last deploys
[11:49:13] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:937922|sdwiki: set 'wgTranslateNumerals' to false (T268203)]] (duration: 09m 22s)
[11:49:17] <stashbot>	 T268203: Set $digitTransformTable to use english-style 0123456789 digits on sdwiki - https://phabricator.wikimedia.org/T268203
[11:50:25] <jynus>	 do you see something, jayme?
[11:51:09] <jayme>	 https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dsre says probes failing on some mw-* services
[11:51:22] <jynus>	 I see a spike on api 5xx but it is very small
[11:51:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PUT certificates) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:52:05] <jayme>	 there was a mw deployment to k8s ~10min ago
[11:52:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T342617)', diff saved to https://phabricator.wikimedia.org/P50252 and previous config saved to /var/cache/conftool/dbconfig/20230809-115206-ladsgroup.json
[11:52:07] <jinxer-wm>	 (ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:52:07] <jinxer-wm>	 (ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:52:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[11:52:16] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[11:52:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[11:52:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T342617)', diff saved to https://phabricator.wikimedia.org/P50253 and previous config saved to /var/cache/conftool/dbconfig/20230809-115227-ladsgroup.json
[11:52:45] <jynus>	 by whom?
[11:53:11] <jayme>	 not sure, I just saw the pod age in k8s directly
[11:53:36] <Amir1>	 I did a config deploy (which would deploy to k8s too) ten minutes ago https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/937922/10
[11:53:44] <Amir1>	 but this is quite tame
[11:54:24] <jynus>	 logs look clean
[11:54:36] <jynus>	 mw ones, I mean
[11:54:56] <jynus>	 maybe some issue on k8s? I will check how varnish sees that
[11:55:15] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[11:55:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[11:55:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50254 and previous config saved to /var/cache/conftool/dbconfig/20230809-115534-ladsgroup.json
[11:56:19] <jayme>	 jynus: I think it's k8s related, let me check something
[11:56:36] <jynus>	 that would explain if only a 1% of traffic is affected
[11:56:43] <jynus>	 which means low to no user impact
[11:58:02] <jayme>	 !incidents
[11:58:03] <sirenbot>	 3936 (ACKED)  [2x] ProbeDown sre (ip4 probes/service eqiad)
[11:58:39] <jayme>	 I think it's my fault actually
[11:58:47] <jynus>	 ?
[11:59:03] <jayme>	 I'll explain in a bit, let me fix first
[11:59:08] <jynus>	 sure
[11:59:09] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:01:13] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1085.eqiad.wmnet with OS bullseye
[12:01:39] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1086.eqiad.wmnet with OS bullseye
[12:02:48] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:03:07] <jynus>	 clearly it was k8s: https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&refresh=30s
[12:03:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P50255 and previous config saved to /var/cache/conftool/dbconfig/20230809-120325-ladsgroup.json
[12:03:46] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:04:09] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:05:04] <wikibugs>	 (03PS1) 10JMeybohm: Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947331 (https://phabricator.wikimedia.org/T300033)
[12:06:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947331 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:07:03] <wikibugs>	 (03Merged) 10jenkins-bot: Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947331 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:07:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497)
[12:08:08] <wikibugs>	 (03PS2) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497)
[12:08:22] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[12:08:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:08:47] <jayme>	 jynus: re-deploying mw on k8s now
[12:08:49] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[12:08:54] <wikibugs>	 (03CR) 10Jelto: "looks mostly good. But I guess you also want to disable the restore on gitlab1003 then? https://gerrit.wikimedia.org/r/plugins/gitiles/ope" [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney)
[12:08:59] <jynus>	 ok, checking graphs
[12:09:38] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[12:10:10] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:10:35] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[12:11:02] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[12:11:30] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[12:11:35] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[12:11:43] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[12:11:57] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[12:12:07] <jinxer-wm>	 (ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:12:07] <jinxer-wm>	 (ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:12:44] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[12:12:49] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[12:13:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:13:48] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[12:14:27] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1086.eqiad.wmnet with reason: host reimage
[12:14:58] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:15:13] <jayme>	 jynus: we should be good again
[12:15:26] <jynus>	 waiting for recoveries
[12:15:46] <jayme>	 I've restarted the httpbb systemd units
[12:15:48] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:15:59] <jynus>	 if not, let's please depool that stack
[12:16:24] <jayme>	 traffic is back already it seems
[12:17:07] <jinxer-wm>	 (ProbeDown) resolved: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:17:07] <jinxer-wm>	 (ProbeDown) resolved: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:17:14] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1086.eqiad.wmnet with reason: host reimage
[12:17:23] <jynus>	 there it is :-D
[12:17:42] <jayme>	 mw-debug is still to fit
[12:17:43] <jayme>	 *fix
[12:17:52] <icinga-wm>	 RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2023-08-08 00:00:06 (4677 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[12:18:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:18:26] <jynus>	 interesting, wdqs may just use the k8s endpoint?
[12:18:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P50256 and previous config saved to /var/cache/conftool/dbconfig/20230809-121831-ladsgroup.json
[12:18:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync
[12:18:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync
[12:18:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[12:18:35] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[12:18:44] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync
[12:18:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[12:18:50] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[12:18:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P50257 and previous config saved to /var/cache/conftool/dbconfig/20230809-121852-ladsgroup.json
[12:18:58] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync
[12:18:58] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync
[12:19:08] <jynus>	 So Amir triggered the issue (obviously, unknowingly to him) right?
[12:19:14] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync
[12:19:26] <jynus>	 basically it failed on next deploy, right?
[12:19:31] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync
[12:20:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P50258 and previous config saved to /var/cache/conftool/dbconfig/20230809-122000-ladsgroup.json
[12:20:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:20:30] <jynus>	 and that caused some certificate issue?
[12:20:41] <wikibugs>	 (03PS1) 10JMeybohm: Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947333 (https://phabricator.wikimedia.org/T300033)
[12:20:42] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:21:31] <gehel>	 jynus: I'm just back from lunch. Not sure I understand the link with wdqs. What made you think that?
[12:21:50] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:22:02] <jynus>	 gehel: It looked at first it also failed while the main issue was ongoing, but now I see it may be just a coincidende
[12:22:31] <wikibugs>	 (03Merged) 10jenkins-bot: Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:22:33] <gehel>	 The allocator decreasing alert?
[12:22:44] <jynus>	 yeah, is it just noisy?
[12:23:11] <jayme>	 jynus: I flipped a switch in some global configuration that made the mw-on-k8s deployments use a different TLS cert on the next deploy (triggered by Amir.1)
[12:23:36] <jayme>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/946981/
[12:23:37] <jynus>	 so that is why I asked Amir, as it lined up with that, but obviously not his fault
[12:23:48] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:24:07] <jayme>	 completely my fault, not Amir.1's!
[12:24:27] <gehel>	 jynus: no, it's a real issue. It doesn't have to be addressed right away, but definitely soon! Some Blazegraph internals are going crazy and we'll need to recover the data from a different host eventually 
[12:24:55] <gehel>	 inflatador, ryankemper : see above 
[12:25:32] <jynus>	 I think getting a report, even a light one could be interesting, not as much for user impact but to avoid something like that when there was more traffic pct
[12:25:42] <jayme>	 I think this only happend because I never got that t-shirt after breaking wikipedia the first time 😇
[12:25:46] <jynus>	 one question, did k8s depool automatically?
[12:25:58] <jynus>	 o just the traffic was very small?
[12:26:13] <jynus>	 because the amount of errors was close to noise levels
[12:26:13] <jayme>	 the total traffic to k8s is only 1% currently
[12:26:27] <wikibugs>	 (03PS7) 10Ayounsi: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638)
[12:27:01] <jayme>	 jynus: I'll write an incident report in wikitech
[12:27:03] <jynus>	 because if I might have taken a different route- depool it now tha we can, and with time fix the issues
[12:28:02] <jynus>	 I think a deploy can cause more isses than the actual issue
[12:28:15] <jynus>	 as in, the cache wiping and that
[12:28:48] <jayme>	 in some cases yes. I this case I just re-deployed the k8s part, so it's more like restarting the mediawiki appservers
[12:28:59] <jayme>	 no mw-version change or something
[12:29:25] <jayme>	 the actuall issue was the tls terminating component of the mw deployments, now mw itself obviously
[12:30:04] <jynus>	 let me help with the doc https://grafana.wikimedia.org/goto/Mu3BrT64k?orgId=1
[12:35:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P50259 and previous config saved to /var/cache/conftool/dbconfig/20230809-123506-ladsgroup.json
[12:36:58] <wikibugs>	 (03PS4) 10EoghanGaffney: gitlab: Configure object storage for gitlab1003 on Swift [puppet] - 10https://gerrit.wikimedia.org/r/947016
[12:37:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] gitlab: Configure object storage for gitlab1003 on Swift [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney)
[12:39:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T342617)', diff saved to https://phabricator.wikimedia.org/P50260 and previous config saved to /var/cache/conftool/dbconfig/20230809-123906-ladsgroup.json
[12:39:10] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[12:39:18] <wikibugs>	 (03PS1) 10ArielGlenn: Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336
[12:39:21] <wikibugs>	 (03PS5) 10EoghanGaffney: gitlab: Configure object storage for gitlab1003 on Swift [puppet] - 10https://gerrit.wikimedia.org/r/947016
[12:39:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (owner: 10ArielGlenn)
[12:39:51] <wikibugs>	 (03PS3) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497)
[12:40:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:40:24] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1086.eqiad.wmnet with OS bullseye
[12:42:06] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[12:42:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[12:42:38] <wikibugs>	 (03PS1) 10Btullis: Revert "Revert "Correct the role for the new hadoop workers"" [puppet] - 10https://gerrit.wikimedia.org/r/946660
[12:43:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Revert "Correct the role for the new hadoop workers"" [puppet] - 10https://gerrit.wikimedia.org/r/946660 (owner: 10Btullis)
[12:43:03] <wikibugs>	 (03PS1) 10Ayounsi: Paramiko: remove version pin [software/homer] - 10https://gerrit.wikimedia.org/r/947337
[12:43:39] <wikibugs>	 (03PS2) 10Btullis: Revert "Revert "Correct the role for the new hadoop workers"" [puppet] - 10https://gerrit.wikimedia.org/r/946660
[12:43:46] <wikibugs>	 (03PS4) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497)
[12:44:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:44:30] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert "Revert "Correct the role for the new hadoop workers"" [puppet] - 10https://gerrit.wikimedia.org/r/946660 (owner: 10Btullis)
[12:47:13] <jayme>	 jynus: https://wikitech.wikimedia.org/wiki/Incidents/2023-08-09_mw-on-k8s_outage_due_to_wrong_tls_cert
[12:47:30] <wikibugs>	 (03PS2) 10ArielGlenn: Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882)
[12:47:34] <wikibugs>	 (03PS5) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497)
[12:47:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882) (owner: 10ArielGlenn)
[12:48:14] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[12:48:34] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[12:48:44] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[12:49:12] <dcausse>	 !log restarting blazegraph on wdqs1007 (BlazegraphFreeAllocatorsDecreasingRapidly)
[12:49:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:21] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[12:50:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[12:50:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P50261 and previous config saved to /var/cache/conftool/dbconfig/20230809-125012-ladsgroup.json
[12:50:29] <wikibugs>	 (03PS1) 10Majavah: openstack: wmcs-enc-cli: allow loading data from stdin or file [puppet] - 10https://gerrit.wikimedia.org/r/947339 (https://phabricator.wikimedia.org/T343869)
[12:51:49] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply
[12:51:53] <jynus>	 jayme: a couple of extra views https://grafana.wikimedia.org/goto/GD1uCo6Vk?orgId=1 and https://logstash.wikimedia.org/goto/77727db3eb0ec80c9a80f64cef14ca06
[12:52:14] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply
[12:52:41] <wikibugs>	 (03PS3) 10ArielGlenn: Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882)
[12:53:02] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply
[12:53:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882) (owner: 10ArielGlenn)
[12:53:21] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply
[12:53:44] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, one typo I noticed." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918518 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[12:54:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P50262 and previous config saved to /var/cache/conftool/dbconfig/20230809-125412-ladsgroup.json
[12:54:20] <wikibugs>	 (03PS4) 10ArielGlenn: Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882)
[12:54:41] <TheresNoTime>	 looking for a +1 on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/947338 (core-namespaces: Remove dupe wikifunctions alias) — I can't *think* of a reason why the dupe would be needed
[12:55:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50263 and previous config saved to /var/cache/conftool/dbconfig/20230809-125555-ladsgroup.json
[12:56:02] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[12:57:29] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:00:00] <jayme>	 jynus: thanks. I'll add a screenshot of the ATS graph
[13:02:02] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42806/console" [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney)
[13:05:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P50264 and previous config saved to /var/cache/conftool/dbconfig/20230809-130518-ladsgroup.json
[13:05:21] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[13:05:23] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[13:05:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[13:05:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:05:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:05:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P50265 and previous config saved to /var/cache/conftool/dbconfig/20230809-130557-ladsgroup.json
[13:06:10] <wikibugs>	 (03PS1) 10Btullis: Preseed debian-installer not to prompt for additional firmware [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106)
[13:07:02] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney)
[13:08:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P50266 and previous config saved to /var/cache/conftool/dbconfig/20230809-130805-ladsgroup.json
[13:09:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P50267 and previous config saved to /var/cache/conftool/dbconfig/20230809-130918-ladsgroup.json
[13:09:27] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42807/console" [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney)
[13:10:03] <wikibugs>	 (03CR) 10Btullis: "I have verified that the question still exists in bookworm and appears to work in the same way: https://www.debian.org/releases/bookworm/e" [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106) (owner: 10Btullis)
[13:10:12] <wikibugs>	 (03PS13) 10JMeybohm: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[13:11:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P50268 and previous config saved to /var/cache/conftool/dbconfig/20230809-131103-ladsgroup.json
[13:11:38] <wikibugs>	 (03PS8) 10Volans: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi)
[13:12:22] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1087.eqiad.wmnet with OS bullseye
[13:15:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance
[13:15:58] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance
[13:20:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50269 and previous config saved to /var/cache/conftool/dbconfig/20230809-132012-ladsgroup.json
[13:20:15] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[13:22:58] <wikibugs>	 (03PS14) 10JMeybohm: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[13:23:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P50270 and previous config saved to /var/cache/conftool/dbconfig/20230809-132312-ladsgroup.json
[13:23:45] <wikibugs>	 (03CR) 10JMeybohm: mediawiki: set requests based on php.workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert)
[13:24:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T342617)', diff saved to https://phabricator.wikimedia.org/P50271 and previous config saved to /var/cache/conftool/dbconfig/20230809-132424-ladsgroup.json
[13:24:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[13:24:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[13:24:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T342617)', diff saved to https://phabricator.wikimedia.org/P50272 and previous config saved to /var/cache/conftool/dbconfig/20230809-132446-ladsgroup.json
[13:26:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P50273 and previous config saved to /var/cache/conftool/dbconfig/20230809-132609-ladsgroup.json
[13:28:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:29:41] <wikibugs>	 (03PS1) 10Ayounsi: Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747)
[13:29:44] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1087.eqiad.wmnet with reason: host reimage
[13:31:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi)
[13:31:57] <wikibugs>	 (03PS1) 10David Caro: haproxy_exporter: allow setting as absent [puppet] - 10https://gerrit.wikimedia.org/r/947353 (https://phabricator.wikimedia.org/T343885)
[13:31:59] <wikibugs>	 (03PS1) 10David Caro: prometheus: gather stats from haproxy for openstack and cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885)
[13:32:53] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1087.eqiad.wmnet with reason: host reimage
[13:32:57] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/947337 (owner: 10Ayounsi)
[13:33:04] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:33:13] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-master1002.eqiad.wmnet with OS bullseye
[13:34:41] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, but we don't need this for bookworm; in Bookworm the whole firmware in d-i handling was revised since firmware is now allowed " [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106) (owner: 10Btullis)
[13:34:47] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Improve Homer output when Juniper device rejects config - https://phabricator.wikimedia.org/T328747 (10ayounsi) a:03ayounsi
[13:35:19] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I haven't tested it but the idea to have more info seems good to me. Unit tests need some tweaking" [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi)
[13:35:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P50274 and previous config saved to /var/cache/conftool/dbconfig/20230809-133518-ladsgroup.json
[13:35:29] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Paramiko: remove version pin [software/homer] - 10https://gerrit.wikimedia.org/r/947337 (owner: 10Ayounsi)
[13:36:46] <wikibugs>	 (03PS2) 10Btullis: Preseed debian-installer not to prompt for additional firmware [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106)
[13:37:03] <wikibugs>	 (03CR) 10Ayounsi: Junos: Add more info on commit errors (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi)
[13:37:09] <wikibugs>	 (03Merged) 10jenkins-bot: Paramiko: remove version pin [software/homer] - 10https://gerrit.wikimedia.org/r/947337 (owner: 10Ayounsi)
[13:38:00] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/947339 (https://phabricator.wikimedia.org/T343869) (owner: 10Majavah)
[13:38:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P50275 and previous config saved to /var/cache/conftool/dbconfig/20230809-133818-ladsgroup.json
[13:38:35] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42808/console" [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro)
[13:39:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10MoritzMuehlenhoff) 05Resolved→03Open @adee_wmde You are using the same key to access Wikimedia Cloud Services and Wikimedia production, please generate a separate SSH key for a...
[13:39:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106) (owner: 10Btullis)
[13:40:39] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Preseed debian-installer not to prompt for additional firmware [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106) (owner: 10Btullis)
[13:41:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50276 and previous config saved to /var/cache/conftool/dbconfig/20230809-134115-ladsgroup.json
[13:41:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[13:41:20] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[13:41:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[13:41:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T342617)', diff saved to https://phabricator.wikimedia.org/P50277 and previous config saved to /var/cache/conftool/dbconfig/20230809-134136-ladsgroup.json
[13:42:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) I sent a new email to Juniper yesterday to ask again about the best next steps here.
[13:43:37] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack: wmcs-enc-cli: allow loading data from stdin or file [puppet] - 10https://gerrit.wikimedia.org/r/947339 (https://phabricator.wikimedia.org/T343869) (owner: 10Majavah)
[13:43:45] <wikibugs>	 (03CR) 10David Caro: openstack: wmcs-enc-cli: allow loading data from stdin or file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947339 (https://phabricator.wikimedia.org/T343869) (owner: 10Majavah)
[13:44:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) 05Resolved→03Declined
[13:44:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi)
[13:44:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi)
[13:45:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) 05Stalled→03Resolved a:03ayounsi Boldly closing this as Katran will solve some if not all those limitations.
[13:47:02] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: host reimage
[13:47:25] <wikibugs>	 (03PS2) 10David Caro: prometheus: gather stats from haproxy for openstack and cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885)
[13:47:28] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] changeprop: filter sourceswiki from stream for outlink LW service [deployment-charts] - 10https://gerrit.wikimedia.org/r/947328 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou)
[13:48:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) @Vgutierrez do you know how the future L4LB will handle ICMP PTB packets? Can it route it to the proper source host?
[13:49:47] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: host reimage
[13:50:03] <wikibugs>	 (03PS1) 10Ssingh: P:bird::anycast: require anycast setup on the bird class [puppet] - 10https://gerrit.wikimedia.org/r/947357
[13:50:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P50278 and previous config saved to /var/cache/conftool/dbconfig/20230809-135024-ladsgroup.json
[13:50:50] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Looks ok, did not test it though" [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott)
[13:51:11] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42810/console" [puppet] - 10https://gerrit.wikimedia.org/r/947357 (owner: 10Ssingh)
[13:51:15] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] add volumes functionality to wmcs-backup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott)
[13:52:33] <moritzm>	 !log installing tiff security updates
[13:52:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[13:52:52] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42809/console" [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro)
[13:52:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[13:53:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P50279 and previous config saved to /var/cache/conftool/dbconfig/20230809-135324-ladsgroup.json
[13:53:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[13:53:28] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[13:53:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[13:53:52] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:bird::anycast: require anycast setup on the bird class [puppet] - 10https://gerrit.wikimedia.org/r/947357 (owner: 10Ssingh)
[13:53:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P50280 and previous config saved to /var/cache/conftool/dbconfig/20230809-135356-ladsgroup.json
[13:54:26] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[13:54:36] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS bookworm
[13:54:46] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[13:55:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P50281 and previous config saved to /var/cache/conftool/dbconfig/20230809-135503-ladsgroup.json
[13:56:01] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1087.eqiad.wmnet with OS bullseye
[13:58:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10adee_wmde)
[14:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1400)
[14:00:30] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10adee_wmde) >>! In T342969#9080463, @MoritzMuehlenhoff wrote: > @adee_wmde You are using the same key to access Wikimedia Cloud Services and Wikimedia production, please generate a...
[14:02:47] <wikibugs>	 10SRE, 10Research, 10Wikimedia-Mailing-lists: Create research-engineering-alerts list - https://phabricator.wikimedia.org/T342833 (10fkaelin) Thank you kindly.
[14:05:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50282 and previous config saved to /var/cache/conftool/dbconfig/20230809-140531-ladsgroup.json
[14:05:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[14:05:40] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[14:05:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance
[14:05:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50283 and previous config saved to /var/cache/conftool/dbconfig/20230809-140551-ladsgroup.json
[14:07:30] <moritzm>	 !log restarting FPM on mediawiki canaries to pick up tiff update
[14:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:35] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:58] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-master1002.eqiad.wmnet with OS bullseye
[14:10:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P50284 and previous config saved to /var/cache/conftool/dbconfig/20230809-141009-ladsgroup.json
[14:11:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T342617)', diff saved to https://phabricator.wikimedia.org/P50285 and previous config saved to /var/cache/conftool/dbconfig/20230809-141134-ladsgroup.json
[14:11:38] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[14:11:47] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] gitlab: Configure object storage for gitlab1003 on Swift [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney)
[14:12:41] <taavi>	 urbanecm: do you remember if the account creation is global or per-wiki?
[14:13:04] <urbanecm>	 taavi: you mean, the acc creation throttle? 
[14:13:16] <urbanecm>	 afaik it's counted across all sites, but the value can be different depending on the project you sign on.
[14:14:00] <taavi>	 yeah. so for T343595 I'm wondering if 'value' => 250 (for example) means 250 accounts total or 250 accounts per-wiki
[14:14:00] <stashbot>	 T343595: Increase account creation at Wikimania 2023 August 14-20 [Note: incomplete IP list] - https://phabricator.wikimedia.org/T343595
[14:17:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10Vgutierrez) >>! In T253732#9080504, @ayounsi wrote: > @Vgutierrez do you know how the future L4LB will handle ICMP PTB packets? Can...
[14:17:35] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:56] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
[14:18:05] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1088.eqiad.wmnet with OS bullseye
[14:18:33] <icinga-wm>	 PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:19:26] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri)
[14:19:35] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) p:05Triage→03Low
[14:20:17] <icinga-wm>	 PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:21:30] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage
[14:22:05] <wikibugs>	 (03PS1) 10Majavah: throttle: remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947361
[14:22:07] <wikibugs>	 (03PS1) 10Majavah: throttle: add rules for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595)
[14:22:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] throttle: add rules for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595) (owner: 10Majavah)
[14:23:17] <wikibugs>	 (03PS2) 10Majavah: throttle: add rules for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595)
[14:24:53] <moritzm>	 !log installing sudo bugfix updates from Bookworm 12.1 point release
[14:24:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:56] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "good catch! let me write a test for this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947338 (https://phabricator.wikimedia.org/T342964) (owner: 10Samtar)
[14:25:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P50287 and previous config saved to /var/cache/conftool/dbconfig/20230809-142515-ladsgroup.json
[14:26:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P50288 and previous config saved to /var/cache/conftool/dbconfig/20230809-142640-ladsgroup.json
[14:28:45] <wikibugs>	 (03PS1) 10Urbanecm: [tests] Ensure each config has at most one value per wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365
[14:30:24] <urbanecm>	 taavi: it means 250 accounts total. the counter is stored as a global memcached key, so it's fleet-wide
[14:30:34] <TheresNoTime>	 jouncebot: nowandnext
[14:30:34] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1400)
[14:30:34] <jouncebot>	 In 2 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1700)
[14:30:44] <taavi>	 oh perfect, thanks
[14:31:10] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1088.eqiad.wmnet with reason: host reimage
[14:32:33] <TheresNoTime>	 I intend to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/947338
[14:32:33] <wikibugs>	 10sre-alert-triage, 10Platform Engineering: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10JMeybohm)
[14:33:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947338 (https://phabricator.wikimedia.org/T342964) (owner: 10Samtar)
[14:34:05] <wikibugs>	 (03Merged) 10jenkins-bot: core-namespaces: Remove dupe wikifunctions alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947338 (https://phabricator.wikimedia.org/T342964) (owner: 10Samtar)
[14:34:16] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1088.eqiad.wmnet with reason: host reimage
[14:34:23] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:947338|core-namespaces: Remove dupe wikifunctions alias (T342964)]]
[14:34:27] <stashbot>	 T342964: Add WF: as an alias of Wikifunctions namespace - https://phabricator.wikimedia.org/T342964
[14:36:05] <logmsgbot>	 !log samtar@deploy1002 samtar: Backport for [[gerrit:947338|core-namespaces: Remove dupe wikifunctions alias (T342964)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[14:36:15] * TheresNoTime testing
[14:37:13] <TheresNoTime>	 oh! The Wikimedia debug extension doesn't recognise `wikifunctions.org` ?
[14:38:56] <urbanecm>	 TheresNoTime: apparently not yet
[14:40:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P50289 and previous config saved to /var/cache/conftool/dbconfig/20230809-144022-ladsgroup.json
[14:40:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:40:26] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[14:40:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:41:05] <TheresNoTime>	 `curl -H 'X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet' https://www.wikifunctions.org/wiki/WF:Main_Page` is returning nothing, whereas `curl -H 'X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet' https://www.wikifunctions.org/wiki/Wikifunctions:Main_Page` is returning as expected
[14:41:10] <TheresNoTime>	 hmm
[14:41:40] <wikibugs>	 (03PS1) 10Btullis: Remove the manual check of reuse recipe on an-test-master hosts [puppet] - 10https://gerrit.wikimedia.org/r/947366 (https://phabricator.wikimedia.org/T329363)
[14:41:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P50290 and previous config saved to /var/cache/conftool/dbconfig/20230809-144147-ladsgroup.json
[14:41:57] <TheresNoTime>	 oops, forgot `-L` for the redirect
[14:42:04] <urbanecm>	 TheresNoTime: yup :)
[14:42:07] <urbanecm>	 or `-I` to see headers
[14:42:13] <logmsgbot>	 !log samtar@deploy1002 samtar: Continuing with sync
[14:42:22] <TheresNoTime>	 lgtm then, syncing :D
[14:42:32] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) 05Open→03Declined Thanks, then like {T253666} I'm going to boldly close this task.
[14:42:42] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Remove the manual check of reuse recipe on an-test-master hosts [puppet] - 10https://gerrit.wikimedia.org/r/947366 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[14:42:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi)
[14:43:40] <wikibugs>	 (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff)
[14:43:42] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) 05In progress→03Resolved > I'll leave the last page to you.  That last page was https://wikitech.wikimedia.org/wiki/Wikim...
[14:43:48] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri)
[14:43:52] <urbanecm>	 TheresNoTime: ad the extension, we have https://gerrit.wikimedia.org/r/c/performance/WikimediaDebug/+/941883 merged
[14:43:56] <wikibugs>	 10sre-alert-triage, 10Infrastructure-Foundations: Alert triage: overdue alert [warning] Systemd units failing on debmonitor2003 - https://phabricator.wikimedia.org/T343897 (10JMeybohm)
[14:43:57] <urbanecm>	 we just need someone to release the extension
[14:44:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Detect IP address collisions - https://phabricator.wikimedia.org/T189522 (10ayounsi) 05Open→03Resolved a:03ayounsi We have a working solution for the mgmt network (until it's time to split mgmt into smaller subnets). And for production, automation and per...
[14:44:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Uncomment sysctl-userns alias [puppet] - 10https://gerrit.wikimedia.org/r/945812 (owner: 10Muehlenhoff)
[14:44:18] <TheresNoTime>	 urbanecm: ah I see
[14:44:23] <wikibugs>	 (03PS1) 10Ssingh: Revert "P:bird::anycast: require anycast setup on the bird class" [puppet] - 10https://gerrit.wikimedia.org/r/946664
[14:45:12] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:45:16] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:46:06] <wikibugs>	 10sre-alert-triage, 10Infrastructure-Foundations: Alert triage: overdue alert [warning] puppet fails on idp-test1002 - https://phabricator.wikimedia.org/T343898 (10JMeybohm)
[14:47:15] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "P:bird::anycast: require anycast setup on the bird class" [puppet] - 10https://gerrit.wikimedia.org/r/946664 (owner: 10Ssingh)
[14:47:30] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: wmcs.spicerack: Setup a host to run cookbooks from prod network - https://phabricator.wikimedia.org/T276440 (10fnegri)
[14:47:57] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Epic, and 2 others: Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) 05In progress→03Resolved
[14:48:45] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:947338|core-namespaces: Remove dupe wikifunctions alias (T342964)]] (duration: 14m 21s)
[14:48:49] <stashbot>	 T342964: Add WF: as an alias of Wikifunctions namespace - https://phabricator.wikimedia.org/T342964
[14:48:57] <wikibugs>	 (03CR) 10Muehlenhoff: profile::mirrors::serve: Remove Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff)
[14:49:30] <TheresNoTime>	 !log `[samtar@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki wikifunctionswiki --fix` for T342964
[14:49:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:35] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff)
[14:50:35] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10MoritzMuehlenhoff)
[14:52:22] <wikibugs>	 10sre-alert-triage, 10Machine-Learning-Team: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10JMeybohm)
[14:53:11] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) 05Open→03Resolved @Jgreen  @Dwisehaupt  AFAICT this should be resolved and you should be able to merge DNS changes just fine....
[14:55:10] <icinga-wm>	 RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:55:32] <icinga-wm>	 RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:56:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T342617)', diff saved to https://phabricator.wikimedia.org/P50291 and previous config saved to /var/cache/conftool/dbconfig/20230809-145653-ladsgroup.json
[14:56:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[14:56:57] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[14:57:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[14:57:12] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1088.eqiad.wmnet with OS bullseye
[14:57:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T342617)', diff saved to https://phabricator.wikimedia.org/P50292 and previous config saved to /var/cache/conftool/dbconfig/20230809-145714-ladsgroup.json
[14:57:40] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1089.eqiad.wmnet with OS bullseye
[14:58:35] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS bookworm
[15:04:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[15:04:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
[15:04:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50293 and previous config saved to /var/cache/conftool/dbconfig/20230809-150443-ladsgroup.json
[15:04:49] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[15:05:37] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] haproxy_exporter: allow setting as absent [puppet] - 10https://gerrit.wikimedia.org/r/947353 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro)
[15:05:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T342617)', diff saved to https://phabricator.wikimedia.org/P50294 and previous config saved to /var/cache/conftool/dbconfig/20230809-150547-ladsgroup.json
[15:05:51] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[15:05:58] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50295 and previous config saved to /var/cache/conftool/dbconfig/20230809-150557-ladsgroup.json
[15:06:05] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-master1001.eqiad.wmnet with OS bullseye
[15:06:34] <wikibugs>	 (03PS9) 10Ayounsi: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638)
[15:09:15] <wikibugs>	 (03CR) 10FNegri: [C: 03+1] prometheus: gather stats from haproxy for openstack and cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro)
[15:14:17] <wikibugs>	 (03CR) 10Urbanecm: [C: 04-1] "of course this doesn't work. php silently merges the duplicate key when require'ing core-Namespaces.php. hmm..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 (owner: 10Urbanecm)
[15:15:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] haproxy_exporter: allow setting as absent [puppet] - 10https://gerrit.wikimedia.org/r/947353 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro)
[15:16:34] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] profile::mirrors::serve: Remove Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff)
[15:17:52] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1089.eqiad.wmnet with reason: host reimage
[15:19:10] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: route wikifeeds [puppet] - 10https://gerrit.wikimedia.org/r/947372 (https://phabricator.wikimedia.org/T339119)
[15:20:16] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:20:24] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[15:20:42] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1089.eqiad.wmnet with reason: host reimage
[15:20:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P50297 and previous config saved to /var/cache/conftool/dbconfig/20230809-152053-ladsgroup.json
[15:21:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50298 and previous config saved to /var/cache/conftool/dbconfig/20230809-152103-ladsgroup.json
[15:22:44] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[15:22:48] <wikibugs>	 (03PS1) 10Effie Mouzeli: Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033)
[15:23:34] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 26067 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[15:26:10] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] trafficserver: route wikifeeds [puppet] - 10https://gerrit.wikimedia.org/r/947372 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan)
[15:26:53] <wikibugs>	 10sre-alert-triage, 10Machine-Learning-Team: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10elukey) Very interesting:  ` Aug 09 00:09:33 ml-serve1001 kubelet[3980749]: E0809 00:09:33.603646 3980749...
[15:27:43] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] trafficserver: route wikifeeds [puppet] - 10https://gerrit.wikimedia.org/r/947372 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan)
[15:27:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Interactive firmware prompts on Bullseye with some Broadcom NICs - https://phabricator.wikimedia.org/T308106 (10BTullis) I think that this is now fixed. My test host was an-worker1088 and it was asking for firmware for the bnx2x NIC. After the change it went past this point...
[15:28:27] <wikibugs>	 (03PS2) 10Effie Mouzeli: Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033)
[15:29:08] <hnowlan>	 !log disabling puppet on A:cp to test r/947372
[15:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:27] <wikibugs>	 (03PS1) 10Elukey: admin_ng: increase resources for calico kube-controllers in ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/947376 (https://phabricator.wikimedia.org/T343900)
[15:36:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P50299 and previous config saved to /var/cache/conftool/dbconfig/20230809-153600-ladsgroup.json
[15:36:10] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50300 and previous config saved to /var/cache/conftool/dbconfig/20230809-153610-ladsgroup.json
[15:36:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: increase resources for calico kube-controllers in ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/947376 (https://phabricator.wikimedia.org/T343900) (owner: 10Elukey)
[15:43:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T342617)', diff saved to https://phabricator.wikimedia.org/P50301 and previous config saved to /var/cache/conftool/dbconfig/20230809-154317-ladsgroup.json
[15:43:25] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[15:43:36] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1089.eqiad.wmnet with OS bullseye
[15:44:07] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[15:45:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[15:47:20] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: host reimage
[15:47:30] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[15:47:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[15:48:33] <wikibugs>	 (03PS3) 10Effie Mouzeli: Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033)
[15:48:58] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[15:49:08] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[15:50:47] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: host reimage
[15:51:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T342617)', diff saved to https://phabricator.wikimedia.org/P50302 and previous config saved to /var/cache/conftool/dbconfig/20230809-155106-ladsgroup.json
[15:51:08] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[15:51:10] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[15:51:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50303 and previous config saved to /var/cache/conftool/dbconfig/20230809-155116-ladsgroup.json
[15:51:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[15:51:20] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[15:51:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[15:51:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T342617)', diff saved to https://phabricator.wikimedia.org/P50304 and previous config saved to /var/cache/conftool/dbconfig/20230809-155127-ladsgroup.json
[15:51:32] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
[15:51:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P50305 and previous config saved to /var/cache/conftool/dbconfig/20230809-155137-ladsgroup.json
[15:53:01] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: admin: update my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/947379
[15:56:51] <wikibugs>	 (03PS1) 10Ssingh: P:bird::anycast: require anycast setup on the bird class (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/947381
[15:57:14] <wikibugs>	 (03PS2) 10Ssingh: P:bird::anycast: require anycast setup on the bird service (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/947381
[15:58:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P50306 and previous config saved to /var/cache/conftool/dbconfig/20230809-155824-ladsgroup.json
[15:59:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947361 (owner: 10Majavah)
[15:59:52] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42811/console" [puppet] - 10https://gerrit.wikimedia.org/r/947381 (owner: 10Ssingh)
[16:00:24] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "lgtm, 250 should be more than enough." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595) (owner: 10Majavah)
[16:02:38] <wikibugs>	 10sre-alert-triage, 10Machine-Learning-Team, 10Patch-For-Review: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10elukey) Throttling is gone, but I still see the exec_sync elevated latency, errors...
[16:04:16] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "The systemd bindings break this: bird expects anycast-hc to be running so if we set up bird before that, the bird service will fail anyway" [puppet] - 10https://gerrit.wikimedia.org/r/947381 (owner: 10Ssingh)
[16:04:44] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] P:bird::anycast: require anycast setup on the bird service (take 2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947381 (owner: 10Ssingh)
[16:05:42] <wikibugs>	 (03PS3) 10Majavah: throttle: add rules for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595)
[16:09:43] <wikibugs>	 (03Abandoned) 10Ssingh: P:bird::anycast: require anycast setup on the bird service (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/947381 (owner: 10Ssingh)
[16:10:37] <wikibugs>	 (03PS15) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035)
[16:11:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) Opened high priority case 2023-0809-747283 asking for a RMA.
[16:11:20] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/947409 (https://phabricator.wikimedia.org/T343740)
[16:12:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/947409 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou)
[16:12:46] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/947409 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou)
[16:13:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P50307 and previous config saved to /var/cache/conftool/dbconfig/20230809-161330-ladsgroup.json
[16:13:38] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/947409 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou)
[16:15:26] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-master1001.eqiad.wmnet with OS bullseye
[16:16:23] <wikibugs>	 (03PS1) 10Ssingh: bird: drop support for buster [puppet] - 10https://gerrit.wikimedia.org/r/947412
[16:17:51] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[16:18:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P50308 and previous config saved to /var/cache/conftool/dbconfig/20230809-161832-ladsgroup.json
[16:18:36] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[16:20:13] <jinxer-wm>	 (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from miscweb.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=text&var-origin=miscweb.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[16:20:59] <sukhe>	 uh oh
[16:21:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:21:58] <jelto>	 hmmm that alerts  only for miscweb? miscweb runs in wikikube now. I'll check the service 
[16:22:31] <elukey>	 it started receiving some traffic, and all seems 50x
[16:22:32] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[16:22:35] <sukhe>	 I have ACked it
[16:22:48] <jynus>	 what is it, phab?
[16:23:00] <jynus>	 (I lost scrollback)
[16:23:10] <jelto>	 somebody acked it and I accidentally resovled it because the icon changed
[16:23:21] <jynus>	 no worries
[16:23:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:23:47] <TheresNoTime>	 no services listed on https://wikitech.wikimedia.org/wiki/Miscweb (e.g. https://wikitech.wikimedia.org/wiki/Microsites#Bugzilla_Archive) seem down :/
[16:24:16] <jelto>	 bugzilla is up https://static-bugzilla.wikimedia.org
[16:24:17] <Amir1>	 tendril.wikimedia.org is up, it's miscweb
[16:24:37] <jelto>	 yes miscweb service seem up 
[16:25:00] <elukey>	 there is annual report and transparency report too
[16:25:03] <elukey>	 (I mean  the pods)
[16:25:22] <sobanski>	 Both are accessible
[16:25:37] <jelto>	 yes this service work fine, pods in kubernetes look good
[16:25:51] <jynus>	 NEL doesn't give me a clear signal
[16:26:07] <Amir1>	 k8s have been acting up today
[16:26:11] <jynus>	 at least not in terms of http
[16:26:17] <Amir1>	 I don't know if it's related
[16:26:19] <jynus>	 checking other errors
[16:26:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[16:26:48] <taavi>	 is that the kubernetes service that's having issues? or the VMs with the same name?
[16:26:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:27:12] <jynus>	 but it is 5XX, so likely restbase?
[16:27:19] <jynus>	 or is it a result of that?
[16:28:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:28:15] <jynus>	 restbase revision request rates are 0
[16:28:36] <jynus>	 so it is either that or something that restbase queries
[16:28:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T342617)', diff saved to https://phabricator.wikimedia.org/P50310 and previous config saved to /var/cache/conftool/dbconfig/20230809-162836-ladsgroup.json
[16:28:39] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[16:28:41] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[16:28:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] bird: drop support for buster [puppet] - 10https://gerrit.wikimedia.org/r/947412 (owner: 10Ssingh)
[16:28:53] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[16:28:54] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[16:28:57] <elukey>	 the only thing that I see is bugzilla in k8s codfw - https://logstash.wikimedia.org/goto/d608b2c7caa0a2c91c2a7024a812993d
[16:29:07] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[16:29:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T342617)', diff saved to https://phabricator.wikimedia.org/P50311 and previous config saved to /var/cache/conftool/dbconfig/20230809-162913-ladsgroup.json
[16:29:17] <jelto>	 the "old" miscweb vms are called webserver-misc-eqiad and webserver-misc-codfw.discovery.wmnet as far as I can see, so this alert is related k8s I think
[16:29:22] <jynus>	 nah, ignore me, that is older
[16:29:58] <elukey>	 miscweb.discovery.wmnet points to k8s ingress
[16:30:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:30:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50312 and previous config saved to /var/cache/conftool/dbconfig/20230809-163014-ladsgroup.json
[16:30:18] <elukey>	 yeah ok I see 504s for bugzilla in codfw
[16:30:23] <elukey>	 see the above logs
[16:30:29] <jynus>	 elukey: what rate?
[16:30:40] <jelto>	 the pa.age was for eqsin
[16:31:21] <elukey>	 yeah but eqsin calls either eqiad or codfw 
[16:31:28] <elukey>	 ATS in eqsin I mean
[16:31:48] <elukey>	 jynus: the rate is low, but even the one in the ATS grafana link is low
[16:32:10] <jynus>	 yeah, wanted to know if it matched, as if, it was most of it or there was part that was unaccounted
[16:32:50] <jynus>	 on CDN we are producing around 9 5XX per second
[16:33:16] <elukey>	 so the 504s are marked with UT, namely "Upstream request timeout"
[16:33:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P50313 and previous config saved to /var/cache/conftool/dbconfig/20230809-163338-ladsgroup.json
[16:33:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:33:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:35:00] <elukey>	 ahh nice AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting
[16:35:12] <jelto>	 just to be sure, restbase and miscweb have the same issue? at least grafana logs show also increased 503/504 starting 16:04 UTC
[16:35:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:35:42] <hnowlan>	 those restbase errors are a little concerning alongside https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/191adcf20ba5fcb5c920dc885f79f0c958268546%5E%21/#F0 
[16:35:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:35:53] <jynus>	 jelto: we don't know, my theory is those are connected
[16:36:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:36:19] <sukhe>	 hnowlan: the timing doesn't match though if we see the increased 5xx around 16:04. the patch was merged later
[16:36:22] <jynus>	 e.g. restbase uses some api that fails or something failing because of restbase
[16:36:32] <jelto>	 ah sorry no that was a red hering (at least in the dashboard). There is no increase since 16:04.
[16:36:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:37:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:37:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:38:00] <elukey>	 !log temporarly bump miscweb bugzilla pods from 2 to 4 in k8s wikikube codfw
[16:38:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:38:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:39:02] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update outlink transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/947416
[16:39:17] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: update outlink transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/947416 (owner: 10AikoChou)
[16:39:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T342617)', diff saved to https://phabricator.wikimedia.org/P50314 and previous config saved to /var/cache/conftool/dbconfig/20230809-163928-ladsgroup.json
[16:39:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:39:32] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[16:39:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:39:36] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] ml-services: update outlink transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/947416 (owner: 10AikoChou)
[16:40:35] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update outlink transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/947416 (owner: 10AikoChou)
[16:41:53] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[16:41:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:42:29] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' .
[16:43:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:44:36] <elukey>	 !log temporarly bump miscweb bugzilla pods from 4 to 8 in k8s wikikube codfw
[16:44:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:44:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:45:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:45:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:45:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:45:07] <hnowlan>	 still debugging the restbase issues - the errors are having no user-facing impact as we're not using restbase for that endpoint 
[16:45:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P50315 and previous config saved to /var/cache/conftool/dbconfig/20230809-164520-ladsgroup.json
[16:45:21] <hnowlan>	 but I'm still trying to figure out why that's happening, the wikifeeds service itself is fine 
[16:45:22] <jynus>	 hnowlan: thanks, that is good to know
[16:45:56] <jynus>	 hnowlan: can you think of an upstream or downstream dependency that could link both issues?
[16:46:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:48:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:48:30] <hnowlan>	 jynus: not really :/ the request goes restbase->service mesh->wikifeeds 
[16:48:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P50316 and previous config saved to /var/cache/conftool/dbconfig/20230809-164844-ladsgroup.json
[16:49:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:49:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:49:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:50:13] <jinxer-wm>	 (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from miscweb.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=text&var-origin=miscweb.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh
[16:50:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:50:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:51:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:54:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P50317 and previous config saved to /var/cache/conftool/dbconfig/20230809-165434-ladsgroup.json
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1700)
[17:00:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P50318 and previous config saved to /var/cache/conftool/dbconfig/20230809-170027-ladsgroup.json
[17:03:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P50319 and previous config saved to /var/cache/conftool/dbconfig/20230809-170351-ladsgroup.json
[17:03:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[17:03:54] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[17:04:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[17:04:29] <wikibugs>	 (03CR) 10Btullis: "Sorry for being late to the party on this review. Thanks so much for your work on this @Slyngshede and @elukey." [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede)
[17:05:57] <jynus>	 all should be good now
[17:09:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P50320 and previous config saved to /var/cache/conftool/dbconfig/20230809-170940-ladsgroup.json
[17:14:41] <wikibugs>	 (03PS1) 10Hnowlan: Revert "trafficserver: route wikifeeds" [puppet] - 10https://gerrit.wikimedia.org/r/946665
[17:15:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50321 and previous config saved to /var/cache/conftool/dbconfig/20230809-171533-ladsgroup.json
[17:15:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[17:15:37] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[17:15:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
[17:16:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T342617)', diff saved to https://phabricator.wikimedia.org/P50322 and previous config saved to /var/cache/conftool/dbconfig/20230809-171604-ladsgroup.json
[17:23:43] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Revert "trafficserver: route wikifeeds" [puppet] - 10https://gerrit.wikimedia.org/r/946665 (owner: 10Hnowlan)
[17:24:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T342617)', diff saved to https://phabricator.wikimedia.org/P50323 and previous config saved to /var/cache/conftool/dbconfig/20230809-172447-ladsgroup.json
[17:24:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[17:24:53] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[17:25:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[17:25:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T342617)', diff saved to https://phabricator.wikimedia.org/P50324 and previous config saved to /var/cache/conftool/dbconfig/20230809-172507-ladsgroup.json
[17:25:23] <wikibugs>	 (03PS1) 10Btullis: Use python3 for the check_hdfs_active_namenode script [puppet] - 10https://gerrit.wikimedia.org/r/947421 (https://phabricator.wikimedia.org/T329363)
[17:27:15] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Use python3 for the check_hdfs_active_namenode script [puppet] - 10https://gerrit.wikimedia.org/r/947421 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis)
[17:27:44] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[17:27:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
[17:28:04] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P50325 and previous config saved to /var/cache/conftool/dbconfig/20230809-172803-ladsgroup.json
[17:28:09] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[17:31:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P50326 and previous config saved to /var/cache/conftool/dbconfig/20230809-173110-ladsgroup.json
[17:46:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P50327 and previous config saved to /var/cache/conftool/dbconfig/20230809-174616-ladsgroup.json
[17:48:48] <wikibugs>	 (03PS1) 10Ssingh: P:bird::anycast: use systemd::sysuser for creating the bird user [puppet] - 10https://gerrit.wikimedia.org/r/947425
[17:50:55] <wikibugs>	 (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/947425/42813/" [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh)
[17:52:18] <wikibugs>	 (03CR) 10Ssingh: "Note: bird2 postint creates the user but we are" [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh)
[17:52:34] <wikibugs>	 (03CR) 10Stevemunene: idp_test: add datahub_staging as a OIDC service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944231 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[17:52:49] <wikibugs>	 (03CR) 10Ssingh: P:bird::anycast: use systemd::sysuser for creating the bird user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh)
[17:54:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P50328 and previous config saved to /var/cache/conftool/dbconfig/20230809-175434-ladsgroup.json
[17:54:41] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[17:56:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:00:05] <jouncebot>	 brennen and dancy: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1800).
[18:01:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T342617)', diff saved to https://phabricator.wikimedia.org/P50329 and previous config saved to /var/cache/conftool/dbconfig/20230809-180122-ladsgroup.json
[18:01:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[18:01:28] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[18:01:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[18:01:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:01:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2165 (T342617)', diff saved to https://phabricator.wikimedia.org/P50330 and previous config saved to /var/cache/conftool/dbconfig/20230809-180143-ladsgroup.json
[18:09:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P50331 and previous config saved to /var/cache/conftool/dbconfig/20230809-180940-ladsgroup.json
[18:12:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T342617)', diff saved to https://phabricator.wikimedia.org/P50332 and previous config saved to /var/cache/conftool/dbconfig/20230809-181219-ladsgroup.json
[18:12:24] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[18:24:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P50333 and previous config saved to /var/cache/conftool/dbconfig/20230809-182446-ladsgroup.json
[18:27:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P50334 and previous config saved to /var/cache/conftool/dbconfig/20230809-182726-ladsgroup.json
[18:39:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P50335 and previous config saved to /var/cache/conftool/dbconfig/20230809-183952-ladsgroup.json
[18:39:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance
[18:39:57] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[18:40:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance
[18:40:10] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[18:40:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
[18:40:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P50336 and previous config saved to /var/cache/conftool/dbconfig/20230809-184018-ladsgroup.json
[18:42:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P50337 and previous config saved to /var/cache/conftool/dbconfig/20230809-184228-ladsgroup.json
[18:42:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P50338 and previous config saved to /var/cache/conftool/dbconfig/20230809-184238-ladsgroup.json
[18:43:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:48:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[18:50:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T342617)', diff saved to https://phabricator.wikimedia.org/P50339 and previous config saved to /var/cache/conftool/dbconfig/20230809-185040-ladsgroup.json
[18:50:45] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[18:57:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P50340 and previous config saved to /var/cache/conftool/dbconfig/20230809-185734-ladsgroup.json
[18:57:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T342617)', diff saved to https://phabricator.wikimedia.org/P50341 and previous config saved to /var/cache/conftool/dbconfig/20230809-185745-ladsgroup.json
[18:57:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[18:57:50] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[18:58:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance
[18:58:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T342617)', diff saved to https://phabricator.wikimedia.org/P50342 and previous config saved to /var/cache/conftool/dbconfig/20230809-185805-ladsgroup.json
[19:05:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P50343 and previous config saved to /var/cache/conftool/dbconfig/20230809-190547-ladsgroup.json
[19:12:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P50344 and previous config saved to /var/cache/conftool/dbconfig/20230809-191240-ladsgroup.json
[19:20:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P50345 and previous config saved to /var/cache/conftool/dbconfig/20230809-192053-ladsgroup.json
[19:27:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P50346 and previous config saved to /var/cache/conftool/dbconfig/20230809-192746-ladsgroup.json
[19:27:49] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance
[19:27:51] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[19:28:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance
[19:28:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P50347 and previous config saved to /var/cache/conftool/dbconfig/20230809-192818-ladsgroup.json
[19:33:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) RMA in progress, Juniper happy with address for replacement and staff at destination are aware of delivery.  I will decom the existing faulty card on Sunday when on site and prep...
[19:36:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T342617)', diff saved to https://phabricator.wikimedia.org/P50348 and previous config saved to /var/cache/conftool/dbconfig/20230809-193559-ladsgroup.json
[19:36:04] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[19:36:10] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[19:36:17] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[19:36:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T342617)', diff saved to https://phabricator.wikimedia.org/P50349 and previous config saved to /var/cache/conftool/dbconfig/20230809-193623-ladsgroup.json
[19:44:36] <wikibugs>	 (03PS2) 10Cathal Mooney: Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214)
[19:44:54] <wikibugs>	 (03CR) 10Cathal Mooney: Policy and definition updates for post-migration esams ranges (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney)
[19:45:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T342617)', diff saved to https://phabricator.wikimedia.org/P50350 and previous config saved to /var/cache/conftool/dbconfig/20230809-194501-ladsgroup.json
[19:45:08] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[19:52:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P50351 and previous config saved to /var/cache/conftool/dbconfig/20230809-195212-ladsgroup.json
[19:52:17] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[19:58:17] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on contint2001.wikimedia.org with reason: Decommissioning
[19:58:30] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on contint2001.wikimedia.org with reason: Decommissioning
[19:59:13] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.decommission for hosts contint2001.wikimedia.org
[20:00:08] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P50352 and previous config saved to /var/cache/conftool/dbconfig/20230809-200007-ladsgroup.json
[20:01:57] <icinga-wm>	 RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:02:31] <icinga-wm>	 RECOVERY - BFD status on cr1-drmrs is OK: UP: 0 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:05:09] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:05:16] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.dns.netbox
[20:07:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P50353 and previous config saved to /var/cache/conftool/dbconfig/20230809-200718-ladsgroup.json
[20:08:00] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - aokoth@cumin1001"
[20:09:11] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - aokoth@cumin1001"
[20:09:11] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:09:12] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts contint2001.wikimedia.org
[20:13:55] <wikibugs>	 10SRE, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 (10ssingh)
[20:15:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P50354 and previous config saved to /var/cache/conftool/dbconfig/20230809-201514-ladsgroup.json
[20:15:54] <wikibugs>	 10SRE, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 (10ssingh) In discussion with @cmooney, we will be revisiting this task again when Traffic does some other authdns-related work, so removing it from the Traffic-Icebox.
[20:22:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P50355 and previous config saved to /var/cache/conftool/dbconfig/20230809-202225-ladsgroup.json
[20:23:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T342617)', diff saved to https://phabricator.wikimedia.org/P50356 and previous config saved to /var/cache/conftool/dbconfig/20230809-202316-ladsgroup.json
[20:23:20] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[20:30:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T342617)', diff saved to https://phabricator.wikimedia.org/P50357 and previous config saved to /var/cache/conftool/dbconfig/20230809-203020-ladsgroup.json
[20:30:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[20:30:31] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[20:30:35] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[20:30:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T342617)', diff saved to https://phabricator.wikimedia.org/P50358 and previous config saved to /var/cache/conftool/dbconfig/20230809-203041-ladsgroup.json
[20:35:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50359 and previous config saved to /var/cache/conftool/dbconfig/20230809-203502-ladsgroup.json
[20:37:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P50360 and previous config saved to /var/cache/conftool/dbconfig/20230809-203731-ladsgroup.json
[20:37:35] <stashbot>	 T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
[20:38:23] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P50361 and previous config saved to /var/cache/conftool/dbconfig/20230809-203822-ladsgroup.json
[20:50:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P50362 and previous config saved to /var/cache/conftool/dbconfig/20230809-205008-ladsgroup.json
[20:53:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P50363 and previous config saved to /var/cache/conftool/dbconfig/20230809-205329-ladsgroup.json
[20:55:51] <wikibugs>	 (03PS2) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648)
[20:55:53] <wikibugs>	 (03PS2) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648)
[21:00:05] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T2100)
[21:05:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P50364 and previous config saved to /var/cache/conftool/dbconfig/20230809-210514-ladsgroup.json
[21:08:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T342617)', diff saved to https://phabricator.wikimedia.org/P50365 and previous config saved to /var/cache/conftool/dbconfig/20230809-210835-ladsgroup.json
[21:08:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[21:08:39] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[21:08:50] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[21:08:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50366 and previous config saved to /var/cache/conftool/dbconfig/20230809-210856-ladsgroup.json
[21:16:30] <wikibugs>	 (03CR) 10Jforrester: [tests] Ensure each config has at most one value per wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 (owner: 10Urbanecm)
[21:18:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T342617)', diff saved to https://phabricator.wikimedia.org/P50367 and previous config saved to /var/cache/conftool/dbconfig/20230809-211853-ladsgroup.json
[21:18:58] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[21:19:33] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:20:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50368 and previous config saved to /var/cache/conftool/dbconfig/20230809-212021-ladsgroup.json
[21:20:23] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[21:20:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance
[21:20:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T342617)', diff saved to https://phabricator.wikimedia.org/P50369 and previous config saved to /var/cache/conftool/dbconfig/20230809-212042-ladsgroup.json
[21:29:32] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[21:34:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P50371 and previous config saved to /var/cache/conftool/dbconfig/20230809-213359-ladsgroup.json
[21:49:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P50372 and previous config saved to /var/cache/conftool/dbconfig/20230809-214905-ladsgroup.json
[21:55:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50373 and previous config saved to /var/cache/conftool/dbconfig/20230809-215535-ladsgroup.json
[21:55:39] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[22:04:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T342617)', diff saved to https://phabricator.wikimedia.org/P50375 and previous config saved to /var/cache/conftool/dbconfig/20230809-220412-ladsgroup.json
[22:04:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[22:04:16] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[22:04:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[22:04:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1209 (T342617)', diff saved to https://phabricator.wikimedia.org/P50376 and previous config saved to /var/cache/conftool/dbconfig/20230809-220433-ladsgroup.json
[22:10:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P50377 and previous config saved to /var/cache/conftool/dbconfig/20230809-221041-ladsgroup.json
[22:25:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P50378 and previous config saved to /var/cache/conftool/dbconfig/20230809-222547-ladsgroup.json
[22:34:21] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[22:36:55] <icinga-wm>	 PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:40:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50379 and previous config saved to /var/cache/conftool/dbconfig/20230809-224053-ladsgroup.json
[22:40:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[22:40:58] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[22:41:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[22:41:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50380 and previous config saved to /var/cache/conftool/dbconfig/20230809-224114-ladsgroup.json
[22:56:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T342617)', diff saved to https://phabricator.wikimedia.org/P50381 and previous config saved to /var/cache/conftool/dbconfig/20230809-225605-ladsgroup.json
[22:56:11] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[23:03:19] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[23:03:33] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance
[23:03:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50382 and previous config saved to /var/cache/conftool/dbconfig/20230809-230339-ladsgroup.json
[23:03:42] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[23:11:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P50383 and previous config saved to /var/cache/conftool/dbconfig/20230809-231112-ladsgroup.json
[23:26:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P50384 and previous config saved to /var/cache/conftool/dbconfig/20230809-232619-ladsgroup.json
[23:28:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50385 and previous config saved to /var/cache/conftool/dbconfig/20230809-232855-ladsgroup.json
[23:29:02] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[23:41:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T342617)', diff saved to https://phabricator.wikimedia.org/P50386 and previous config saved to /var/cache/conftool/dbconfig/20230809-234125-ladsgroup.json
[23:41:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[23:41:29] <stashbot>	 T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617
[23:41:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[23:41:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1211 (T342617)', diff saved to https://phabricator.wikimedia.org/P50387 and previous config saved to /var/cache/conftool/dbconfig/20230809-234146-ladsgroup.json
[23:44:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P50388 and previous config saved to /var/cache/conftool/dbconfig/20230809-234402-ladsgroup.json
[23:59:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P50389 and previous config saved to /var/cache/conftool/dbconfig/20230809-235908-ladsgroup.json