[00:00:24] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:00:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:00:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:04:16] PROBLEM - Check systemd state on datahubsearch1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:07:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.048 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:08:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 2.853 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:08:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P50212 and previous config saved to /var/cache/conftool/dbconfig/20230809-000804-ladsgroup.json [00:15:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:15:40] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:15:52] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:02] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.692 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P50213 and previous config saved to /var/cache/conftool/dbconfig/20230809-002310-ladsgroup.json [00:23:12] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 3.953 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:23:18] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:30:50] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:31:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:31:08] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:37:14] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:38:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50214 and previous config saved to /var/cache/conftool/dbconfig/20230809-003817-ladsgroup.json [00:38:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [00:38:21] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:38:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1145.eqiad.wmnet with reason: Maintenance [00:38:40] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945843 [00:38:46] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945843 (owner: 10TrainBranchBot) [00:39:52] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.264 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:40:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:43:00] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T342617)', diff saved to https://phabricator.wikimedia.org/P50215 and previous config saved to /var/cache/conftool/dbconfig/20230809-004605-ladsgroup.json [00:46:10] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [00:47:30] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:50:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:50:42] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:50:48] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:53:38] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/945843 (owner: 10TrainBranchBot) [00:55:08] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.042 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:55:10] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:56:34] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.127 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:01:10] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:01:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P50216 and previous config saved to /var/cache/conftool/dbconfig/20230809-010112-ladsgroup.json [01:01:16] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:02:44] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:07:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:07:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:07:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:08:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:10:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.182 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:10:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 8.800 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:14:54] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:16:18] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:16:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136', diff saved to https://phabricator.wikimedia.org/P50217 and previous config saved to /var/cache/conftool/dbconfig/20230809-011618-ladsgroup.json [01:16:30] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:17:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [01:20:50] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.940 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:20:52] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Wed 18 Oct 2023 03:52:32 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:20:56] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 9.552 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:30:02] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:30:06] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:31:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2136 (T342617)', diff saved to https://phabricator.wikimedia.org/P50218 and previous config saved to /var/cache/conftool/dbconfig/20230809-013124-ladsgroup.json [01:31:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [01:31:28] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [01:31:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2137.codfw.wmnet with reason: Maintenance [01:31:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2137:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50219 and previous config saved to /var/cache/conftool/dbconfig/20230809-013145-ladsgroup.json [01:34:36] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 9.420 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:34:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50422 bytes in 7.481 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:42:08] PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [01:42:26] PROBLEM - Wikitech-static main page has content on cloudweb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [01:54:06] RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 25998 bytes in 0.282 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [01:54:22] RECOVERY - Wikitech-static main page has content on cloudweb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 25998 bytes in 0.299 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [02:06:33] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:31:33] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:32:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:56:50] (03PS2) 10KartikMistry: testwiki: Enable Section Translation for 7 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946852 (https://phabricator.wikimedia.org/T343211) [03:27:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:37:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:57:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:02:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:56:12] (03PS1) 10Marostegui: mariadb: Move db1119 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/947197 (https://phabricator.wikimedia.org/T335080) [05:57:47] (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1119 to m3 [puppet] - 10https://gerrit.wikimedia.org/r/947197 (https://phabricator.wikimedia.org/T335080) (owner: 10Marostegui) [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T0600) [06:03:33] (03PS1) 10Marostegui: Revert "cloudbackup200[12]: remove some spurious config from the last patch" [puppet] - 10https://gerrit.wikimedia.org/r/946654 [06:03:47] (03PS1) 10Marostegui: Revert "Correct the role for the new hadoop workers" [puppet] - 10https://gerrit.wikimedia.org/r/946655 [06:04:29] (03CR) 10Marostegui: [C: 03+2] Revert "cloudbackup200[12]: remove some spurious config from the last patch" [puppet] - 10https://gerrit.wikimedia.org/r/946654 (owner: 10Marostegui) [06:04:37] (03CR) 10Marostegui: [C: 03+2] Revert "Correct the role for the new hadoop workers" [puppet] - 10https://gerrit.wikimedia.org/r/946655 (owner: 10Marostegui) [06:06:00] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [06:07:21] (03PS1) 10Marostegui: site.pp: Remove db1119 from s1 [puppet] - 10https://gerrit.wikimedia.org/r/947198 [06:08:03] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db1119 from s1 [puppet] - 10https://gerrit.wikimedia.org/r/947198 (owner: 10Marostegui) [06:12:36] PROBLEM - haproxy failover on dbproxy1026 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:13:14] PROBLEM - haproxy failover on dbproxy1020 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:14:27] (03PS1) 10Marostegui: install_server: Add db12[34-49] to reimage list [puppet] - 10https://gerrit.wikimedia.org/r/947199 (https://phabricator.wikimedia.org/T342166) [06:15:13] (03CR) 10Marostegui: [C: 03+2] install_server: Add db12[34-49] to reimage list [puppet] - 10https://gerrit.wikimedia.org/r/947199 (https://phabricator.wikimedia.org/T342166) (owner: 10Marostegui) [06:15:49] haproxy alerts are expected [06:18:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [06:18:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [06:18:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50222 and previous config saved to /var/cache/conftool/dbconfig/20230809-061826-ladsgroup.json [06:18:30] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [06:18:30] PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:19:32] PROBLEM - haproxy failover on dbproxy1024 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:19:34] PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:19:48] PROBLEM - haproxy failover on dbproxy1027 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:20:56] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:21:04] RECOVERY - haproxy failover on dbproxy1024 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:21:06] RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:21:20] RECOVERY - haproxy failover on dbproxy1027 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:22:05] (03CR) 10Marostegui: Drop old externallinks columns (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/946556 (https://phabricator.wikimedia.org/T343718) (owner: 10Ladsgroup) [06:22:26] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:23:06] RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:24:05] (03PS1) 10Marostegui: site.pp: Add db12[26-33] [puppet] - 10https://gerrit.wikimedia.org/r/947200 (https://phabricator.wikimedia.org/T342176) [06:24:55] (03CR) 10Marostegui: [C: 03+2] site.pp: Add db12[26-33] [puppet] - 10https://gerrit.wikimedia.org/r/947200 (https://phabricator.wikimedia.org/T342176) (owner: 10Marostegui) [06:28:55] (03PS1) 10Muehlenhoff: Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201 [06:29:40] (03CR) 10CI reject: [V: 04-1] Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201 (owner: 10Muehlenhoff) [06:32:55] PROBLEM - haproxy failover on dbproxy1022 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [06:33:55] RECOVERY - haproxy failover on dbproxy1022 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:35:31] (03PS2) 10Muehlenhoff: Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201 [06:36:16] (03CR) 10CI reject: [V: 04-1] Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201 (owner: 10Muehlenhoff) [06:40:51] (03PS3) 10Muehlenhoff: Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201 [06:42:33] RECOVERY - haproxy failover on dbproxy1020 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:43:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:44:40] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jkieserman [puppet] - 10https://gerrit.wikimedia.org/r/947201 (owner: 10Muehlenhoff) [06:45:19] RECOVERY - haproxy failover on dbproxy1026 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [06:46:23] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jmads out of all services on: 1309 hosts [06:46:27] !log root@cumin2002 END (FAIL) - Cookbook sre.idm.logout (exit_code=99) Logging Jmads out of all services on: 1309 hosts [06:46:51] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jkieserman out of all services on: 1309 hosts [06:47:26] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jkieserman out of all services on: 1309 hosts [06:48:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:51:26] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jkieserman out of all services on: 716 hosts [06:51:39] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jkieserman out of all services on: 716 hosts [06:51:48] !log root@cumin2002 START - Cookbook sre.idm.logout Logging Jkieserman out of all services on: 33 hosts [06:52:05] !log root@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jkieserman out of all services on: 33 hosts [07:00:05] Amir1, Urbanecm, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:17] o/ [07:00:23] * kart_ is here [07:00:24] kart_: I assume you'll self-deploy? [07:00:33] taavi: yes :) [07:00:43] Starting deployment.. [07:01:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946852 (https://phabricator.wikimedia.org/T343211) (owner: 10KartikMistry) [07:01:45] (03Merged) 10jenkins-bot: testwiki: Enable Section Translation for 7 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/946852 (https://phabricator.wikimedia.org/T343211) (owner: 10KartikMistry) [07:02:13] !log kartik@deploy1002 Started scap: Backport for [[gerrit:946852|testwiki: Enable Section Translation for 7 Wikipedias (T343211)]] [07:02:27] T343211: Enable Content and Section translation on 12 Wikipedias - https://phabricator.wikimedia.org/T343211 [07:03:42] !log kartik@deploy1002 kartik: Backport for [[gerrit:946852|testwiki: Enable Section Translation for 7 Wikipedias (T343211)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:05:47] !log kartik@deploy1002 kartik: Continuing with sync [07:12:11] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:946852|testwiki: Enable Section Translation for 7 Wikipedias (T343211)]] (duration: 09m 58s) [07:12:15] T343211: Enable Content and Section translation on 12 Wikipedias - https://phabricator.wikimedia.org/T343211 [07:12:21] taavi: I'm done. [07:12:34] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:12:57] PROBLEM - Wikitech-static main page has content on cloudweb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:13:43] PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [07:17:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:19:37] hmmm something it's going on with wikimedia-static [07:20:04] *wikitech-static [07:20:07] E_COFFEE [07:23:15] ACKNOWLEDGEMENT - Check systemd state on debmonitor2003 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-maintenance-gc.service,uwsgi-debmonitor.service,wmf_auto_restart_uwsgi-debmonitor.service Jcrespo WIP host https://phabricator.wikimedia.org/T241049 - The acknowledgement expires at: 2023-09-06 08:00:00. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:23:15] ACKNOWLEDGEMENT - debmonitor.discovery.wmnet:443 internal on debmonitor2003 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 443: HTTP/1.1 503 Service Unavailable Jcrespo WIP host https://phabricator.wikimedia.org/T241049 - The acknowledgement expires at: 2023-09-06 08:00:00. https://wikitech.wikimedia.org/wiki/Debmonitor [07:23:27] RECOVERY - Wikitech-static main page has content on cloudweb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 26067 bytes in 9.194 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:24:01] RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 26067 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [07:52:17] !log mvernon@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be1003.eqiad.wmnet [07:56:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:58:35] RECOVERY - Disk space on thanos-be1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [07:58:35] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:58:54] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1003.eqiad.wmnet [07:59:43] (03CR) 10Muehlenhoff: thanos: Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/945752 (owner: 10Muehlenhoff) [08:01:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:09:22] (03PS1) 10Muehlenhoff: Add a Firewall::Portrange define [puppet] - 10https://gerrit.wikimedia.org/r/947316 [08:15:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947316 (owner: 10Muehlenhoff) [08:19:40] (03CR) 10Ladsgroup: Drop old externallinks columns (031 comment) [software/schema-changes] - 10https://gerrit.wikimedia.org/r/946556 (https://phabricator.wikimedia.org/T343718) (owner: 10Ladsgroup) [08:20:50] (03CR) 10Marostegui: [C: 03+1] Drop old externallinks columns [software/schema-changes] - 10https://gerrit.wikimedia.org/r/946556 (https://phabricator.wikimedia.org/T343718) (owner: 10Ladsgroup) [08:21:29] (03CR) 10Ladsgroup: [C: 03+2] "\o/" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/946556 (https://phabricator.wikimedia.org/T343718) (owner: 10Ladsgroup) [08:21:52] (03Merged) 10jenkins-bot: Drop old externallinks columns [software/schema-changes] - 10https://gerrit.wikimedia.org/r/946556 (https://phabricator.wikimedia.org/T343718) (owner: 10Ladsgroup) [08:28:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/947022 (https://phabricator.wikimedia.org/T342972) (owner: 10Eevans) [08:29:13] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/946632 (https://phabricator.wikimedia.org/T342968) (owner: 10Eevans) [08:32:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [08:32:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [08:32:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [08:32:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [08:32:58] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:33:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:33:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T342617)', diff saved to https://phabricator.wikimedia.org/P50223 and previous config saved to /var/cache/conftool/dbconfig/20230809-083319-ladsgroup.json [08:33:23] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [08:34:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [08:34:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [08:37:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50224 and previous config saved to /var/cache/conftool/dbconfig/20230809-083738-ladsgroup.json [08:49:21] (03PS1) 10Muehlenhoff: Add Nicholas as approver for wmcs-admin [puppet] - 10https://gerrit.wikimedia.org/r/947319 [08:52:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P50225 and previous config saved to /var/cache/conftool/dbconfig/20230809-085244-ladsgroup.json [09:02:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:02:12] !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:05:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:05:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:05:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:05:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [09:07:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314', diff saved to https://phabricator.wikimedia.org/P50226 and previous config saved to /var/cache/conftool/dbconfig/20230809-090750-ladsgroup.json [09:09:43] (03PS1) 10Elukey: istio: increase max resources for envoy in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/947322 [09:13:57] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:14:17] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:14:29] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:16:58] (03CR) 10Muehlenhoff: "Looks good, one comment inline. You can also remove the threedtopng classes, they are also only used by Thumbor." [puppet] - 10https://gerrit.wikimedia.org/r/946951 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [09:18:47] (03PS8) 10David Caro: replica_cnf_api: add envvars backend [puppet] - 10https://gerrit.wikimedia.org/r/936232 (https://phabricator.wikimedia.org/T265691) [09:20:39] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42801/console" [puppet] - 10https://gerrit.wikimedia.org/r/946981 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:22:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2137:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50227 and previous config saved to /var/cache/conftool/dbconfig/20230809-092258-ladsgroup.json [09:23:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [09:23:02] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [09:23:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [09:23:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50228 and previous config saved to /var/cache/conftool/dbconfig/20230809-092319-ladsgroup.json [09:25:02] (03PS2) 10JMeybohm: deployment_server::general: Globally enable mesh.certmanager [puppet] - 10https://gerrit.wikimedia.org/r/946981 (https://phabricator.wikimedia.org/T300033) [09:25:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [09:25:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [09:25:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [09:26:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2118.codfw.wmnet with reason: Maintenance [09:26:58] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42802/console" [puppet] - 10https://gerrit.wikimedia.org/r/946981 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:30:34] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: route wikifeeds requests via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/945558 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [09:31:15] 10SRE, 10SRE-OnFire, 10Incident Tooling, 10Sustainability (Incident Followup): Grant slightly broader access to Klaxon - https://phabricator.wikimedia.org/T343377 (10TheresNoTime) nb. [[ https://docs.google.com/document/d/1sOQ-b7Z4SLMevGEo9ar8B_8PksGhpXiGopAPLGnfzhk/edit | followup (docs)]] from {T343294} [09:31:21] !log disabling puppet on A:cp to test 945558 [09:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:55] (03CR) 10Hnowlan: [C: 03+2] trafficserver: route wikifeeds requests via the rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/945558 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [09:33:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:33:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:33:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T342617)', diff saved to https://phabricator.wikimedia.org/P50229 and previous config saved to /var/cache/conftool/dbconfig/20230809-093341-ladsgroup.json [09:33:45] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [09:33:46] hmm, confctl didn't log here [09:34:06] (03PS1) 10David Caro: cloud.haproxy: avoid keep-alive for stats scrapers [puppet] - 10https://gerrit.wikimedia.org/r/947325 [09:34:49] PROBLEM - Check systemd state on an-worker1086 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [09:37:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance [09:37:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:37:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [09:37:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T342617)', diff saved to https://phabricator.wikimedia.org/P50230 and previous config saved to /var/cache/conftool/dbconfig/20230809-093715-ladsgroup.json [09:37:19] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] deployment_server::general: Globally enable mesh.certmanager [puppet] - 10https://gerrit.wikimedia.org/r/946981 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:39:28] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42803/console" [puppet] - 10https://gerrit.wikimedia.org/r/947325 (owner: 10David Caro) [09:39:52] (03PS12) 10JMeybohm: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [09:41:04] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42804/console" [puppet] - 10https://gerrit.wikimedia.org/r/947325 (owner: 10David Caro) [09:41:37] (03PS1) 10Hnowlan: Revert "trafficserver: route wikifeeds requests via the rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/946656 [09:42:02] (03PS4) 10JMeybohm: Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033) [09:43:24] (03CR) 10JMeybohm: mediawiki: set requests based on php.workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [09:43:28] (03CR) 10Filippo Giunchedi: [C: 03+1] cloud.haproxy: avoid keep-alive for stats scrapers [puppet] - 10https://gerrit.wikimedia.org/r/947325 (owner: 10David Caro) [09:43:42] (03CR) 10David Caro: [V: 03+1 C: 03+2] cloud.haproxy: avoid keep-alive for stats scrapers [puppet] - 10https://gerrit.wikimedia.org/r/947325 (owner: 10David Caro) [09:44:58] (03CR) 10JMeybohm: [C: 03+2] Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:45:49] (03Merged) 10jenkins-bot: Update apertium to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/946940 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:48:46] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply [09:48:56] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply [09:49:09] (03CR) 10Vgutierrez: [C: 03+1] Revert "trafficserver: route wikifeeds requests via the rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/946656 (owner: 10Hnowlan) [09:49:49] RECOVERY - Check systemd state on an-worker1086 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:59] 10SRE, 10SRE-Access-Requests, 10Data-Platform-SRE: Requesting access to analytics-wmde-users (no kerberos, with ssh) for karapayneWMDE - https://phabricator.wikimedia.org/T342546 (10Gehel) [09:54:23] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1084.eqiad.wmnet with OS bullseye [09:55:13] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [09:55:36] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [09:55:37] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [09:55:56] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [09:56:03] (03CR) 10Klausman: [C: 03+1] istio: increase max resources for envoy in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/947322 (owner: 10Elukey) [09:57:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:57:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [09:57:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P50231 and previous config saved to /var/cache/conftool/dbconfig/20230809-095730-ladsgroup.json [09:57:34] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [09:58:56] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-master1002.eqiad.wmnet [09:59:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P50232 and previous config saved to /var/cache/conftool/dbconfig/20230809-095938-ladsgroup.json [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1000) [10:05:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-master1002.eqiad.wmnet [10:07:07] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [10:07:13] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [10:08:23] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-coord1002.eqiad.wmnet [10:08:46] (03CR) 10Hnowlan: [C: 03+2] Revert "trafficserver: route wikifeeds requests via the rest-gateway" [puppet] - 10https://gerrit.wikimedia.org/r/946656 (owner: 10Hnowlan) [10:09:14] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1084.eqiad.wmnet with reason: host reimage [10:12:23] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1084.eqiad.wmnet with reason: host reimage [10:14:29] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1002.eqiad.wmnet [10:14:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P50233 and previous config saved to /var/cache/conftool/dbconfig/20230809-101444-ladsgroup.json [10:14:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "There an error in a fixture I think." [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [10:19:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [10:19:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance [10:19:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T342617)', diff saved to https://phabricator.wikimedia.org/P50234 and previous config saved to /var/cache/conftool/dbconfig/20230809-101946-ladsgroup.json [10:19:50] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [10:26:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T342617)', diff saved to https://phabricator.wikimedia.org/P50235 and previous config saved to /var/cache/conftool/dbconfig/20230809-102622-ladsgroup.json [10:26:27] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [10:27:48] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for RickiJay-WMDE - https://phabricator.wikimedia.org/T343508 (10RickiJay-WMDE) ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDxBiO5uB5mMR7mWih5KHZ3d9I0UhDiVI7AZ1/i8/LqMuuWSJ2Nf40a2vKmXzKPj2bIiV1PVHqr6+JO8X8PkVoKjl4DFg90IbXKO4CJOmy1Bs7FBTsf+yyFcP8C... [10:29:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P50236 and previous config saved to /var/cache/conftool/dbconfig/20230809-102951-ladsgroup.json [10:36:52] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1084.eqiad.wmnet with OS bullseye [10:41:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P50237 and previous config saved to /var/cache/conftool/dbconfig/20230809-104128-ladsgroup.json [10:44:29] <_joe_> !log ran requestctl commit, which removed the comma removal from the requestctl output as per T305582 [10:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:33] T305582: Annotate X-Analytics header with any matching actions - https://phabricator.wikimedia.org/T305582 [10:44:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P50238 and previous config saved to /var/cache/conftool/dbconfig/20230809-104457-ladsgroup.json [10:44:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:45:01] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [10:45:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [10:45:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T343718)', diff saved to https://phabricator.wikimedia.org/P50239 and previous config saved to /var/cache/conftool/dbconfig/20230809-104518-ladsgroup.json [10:46:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343718)', diff saved to https://phabricator.wikimedia.org/P50240 and previous config saved to /var/cache/conftool/dbconfig/20230809-104625-ladsgroup.json [10:48:48] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945844 [10:52:20] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945844 (owner: 10PipelineBot) [10:53:04] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/945844 (owner: 10PipelineBot) [10:54:38] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:54:55] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:55:00] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [10:55:36] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [10:55:43] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [10:56:16] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [10:56:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P50241 and previous config saved to /var/cache/conftool/dbconfig/20230809-105635-ladsgroup.json [11:01:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P50242 and previous config saved to /var/cache/conftool/dbconfig/20230809-110132-ladsgroup.json [11:02:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [11:02:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [11:06:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T342617)', diff saved to https://phabricator.wikimedia.org/P50243 and previous config saved to /var/cache/conftool/dbconfig/20230809-110647-ladsgroup.json [11:06:51] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:07:37] (03CR) 10Elukey: [C: 03+2] istio: increase max resources for envoy in ml-serve's config [deployment-charts] - 10https://gerrit.wikimedia.org/r/947322 (owner: 10Elukey) [11:08:04] 10SRE, 10ops-codfw: ganeti2014: broken RAM - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) 05Resolved→03Open >>! In T341546#9074281, @Jhancock.wm wrote: > @MoritzMuehlenhoff you should be okay to repool it now. but feel free to reopen the ticket if you need to (knocks on wood) Thanks, th... [11:08:16] PROBLEM - Check systemd state on an-master1002 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-hdfs-namenode.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:09:46] RECOVERY - Check systemd state on an-master1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T342617)', diff saved to https://phabricator.wikimedia.org/P50244 and previous config saved to /var/cache/conftool/dbconfig/20230809-111141-ladsgroup.json [11:11:43] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:11:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:14:40] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:15:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:15:48] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:15:50] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:16:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P50245 and previous config saved to /var/cache/conftool/dbconfig/20230809-111638-ladsgroup.json [11:20:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [11:20:19] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1085.eqiad.wmnet with OS bullseye [11:20:46] (03PS1) 10AikoChou: changeprop: filter sourceswiki from stream for outlink LW service [deployment-charts] - 10https://gerrit.wikimedia.org/r/947328 (https://phabricator.wikimedia.org/T343740) [11:21:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P50246 and previous config saved to /var/cache/conftool/dbconfig/20230809-112153-ladsgroup.json [11:29:11] RECOVERY - Check systemd state on datahubsearch1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343718)', diff saved to https://phabricator.wikimedia.org/P50247 and previous config saved to /var/cache/conftool/dbconfig/20230809-113144-ladsgroup.json [11:31:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:31:48] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [11:31:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [11:32:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P50248 and previous config saved to /var/cache/conftool/dbconfig/20230809-113205-ladsgroup.json [11:33:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P50249 and previous config saved to /var/cache/conftool/dbconfig/20230809-113312-ladsgroup.json [11:34:02] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10Bengali-Sites, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) This has been reported to Google. We're waiting for them to get back. [11:35:13] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1085.eqiad.wmnet with reason: host reimage [11:37:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P50250 and previous config saved to /var/cache/conftool/dbconfig/20230809-113659-ladsgroup.json [11:37:26] jouncebot: nowandnext [11:37:27] No deployments scheduled for the next 2 hour(s) and 22 minute(s) [11:37:27] In 2 hour(s) and 22 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1400) [11:38:20] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1085.eqiad.wmnet with reason: host reimage [11:38:33] (03PS10) 10Ladsgroup: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [11:38:51] (03CR) 10Ladsgroup: [C: 03+2] sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [11:39:33] (03Merged) 10jenkins-bot: sdwiki: set 'wgTranslateNumerals' to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/937922 (https://phabricator.wikimedia.org/T268203) (owner: 10Kaleem Bhatti) [11:39:51] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:937922|sdwiki: set 'wgTranslateNumerals' to false (T268203)]] [11:39:54] T268203: Set $digitTransformTable to use english-style 0123456789 digits on sdwiki - https://phabricator.wikimedia.org/T268203 [11:41:31] !log ladsgroup@deploy1002 kaleembhatti and ladsgroup: Backport for [[gerrit:937922|sdwiki: set 'wgTranslateNumerals' to false (T268203)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [11:41:35] !log ladsgroup@deploy1002 kaleembhatti and ladsgroup: Continuing with sync [11:46:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:47:07] (ProbeDown) firing: (2) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:07] (ProbeDown) firing: Service mw-api-int:4446 has failed probes (http_mw-api-int_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-int:4446 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [11:47:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2113.codfw.wmnet with reason: Maintenance [11:47:37] mmmm [11:47:44] is that you, Amir1 ? [11:47:53] which? [11:48:11] I'm doing schema changes right now [11:48:16] https://wikitech.wikimedia.org/wiki/Map_of_database_maintenance [11:48:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P50251 and previous config saved to /var/cache/conftool/dbconfig/20230809-114819-ladsgroup.json [11:48:22] !incidents [11:48:22] 3936 (UNACKED) [2x] ProbeDown sre (ip4 probes/service eqiad) [11:48:29] !ack 3936 [11:48:29] 3936 (ACKED) [2x] ProbeDown sre (ip4 probes/service eqiad) [11:48:37] are those old? [11:48:41] I don't think that's me [11:48:49] no, sorry [11:48:56] I just checked last deploys [11:49:13] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:937922|sdwiki: set 'wgTranslateNumerals' to false (T268203)]] (duration: 09m 22s) [11:49:17] T268203: Set $digitTransformTable to use english-style 0123456789 digits on sdwiki - https://phabricator.wikimedia.org/T268203 [11:50:25] do you see something, jayme? [11:51:09] https://alerts.wikimedia.org/?q=%40state%3Dactive&q=team%3Dsre says probes failing on some mw-* services [11:51:22] I see a spike on api 5xx but it is very small [11:51:34] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (PUT certificates) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:52:05] there was a mw deployment to k8s ~10min ago [11:52:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T342617)', diff saved to https://phabricator.wikimedia.org/P50252 and previous config saved to /var/cache/conftool/dbconfig/20230809-115206-ladsgroup.json [11:52:07] (ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:07] (ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:52:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [11:52:16] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [11:52:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance [11:52:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T342617)', diff saved to https://phabricator.wikimedia.org/P50253 and previous config saved to /var/cache/conftool/dbconfig/20230809-115227-ladsgroup.json [11:52:45] by whom? [11:53:11] not sure, I just saw the pod age in k8s directly [11:53:36] I did a config deploy (which would deploy to k8s too) ten minutes ago https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/937922/10 [11:53:44] but this is quite tame [11:54:24] logs look clean [11:54:36] mw ones, I mean [11:54:56] maybe some issue on k8s? I will check how varnish sees that [11:55:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [11:55:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance [11:55:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50254 and previous config saved to /var/cache/conftool/dbconfig/20230809-115534-ladsgroup.json [11:56:19] jynus: I think it's k8s related, let me check something [11:56:36] that would explain if only a 1% of traffic is affected [11:56:43] which means low to no user impact [11:58:02] !incidents [11:58:03] 3936 (ACKED) [2x] ProbeDown sre (ip4 probes/service eqiad) [11:58:39] I think it's my fault actually [11:58:47] ? [11:59:03] I'll explain in a bit, let me fix first [11:59:08] sure [11:59:09] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:01:13] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1085.eqiad.wmnet with OS bullseye [12:01:39] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1086.eqiad.wmnet with OS bullseye [12:02:48] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:03:07] clearly it was k8s: https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&refresh=30s [12:03:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P50255 and previous config saved to /var/cache/conftool/dbconfig/20230809-120325-ladsgroup.json [12:03:46] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service,httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:09] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:05:04] (03PS1) 10JMeybohm: Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947331 (https://phabricator.wikimedia.org/T300033) [12:06:26] (03CR) 10JMeybohm: [C: 03+2] Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947331 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:07:03] (03Merged) 10jenkins-bot: Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947331 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:07:55] (03PS1) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) [12:08:08] (03PS2) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) [12:08:22] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [12:08:31] (03CR) 10CI reject: [V: 04-1] Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:08:47] jynus: re-deploying mw on k8s now [12:08:49] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [12:08:54] (03CR) 10Jelto: "looks mostly good. But I guess you also want to disable the restore on gitlab1003 then? https://gerrit.wikimedia.org/r/plugins/gitiles/ope" [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney) [12:08:59] ok, checking graphs [12:09:38] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [12:10:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:10:35] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [12:11:02] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [12:11:30] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:11:35] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:11:43] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [12:11:57] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:12:07] (ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:07] (ProbeDown) firing: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:44] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [12:12:49] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [12:13:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:13:48] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [12:14:27] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1086.eqiad.wmnet with reason: host reimage [12:14:58] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:15:13] jynus: we should be good again [12:15:26] waiting for recoveries [12:15:46] I've restarted the httpbb systemd units [12:15:48] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:15:59] if not, let's please depool that stack [12:16:24] traffic is back already it seems [12:17:07] (ProbeDown) resolved: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:17:07] (ProbeDown) resolved: (6) Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:17:14] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1086.eqiad.wmnet with reason: host reimage [12:17:23] there it is :-D [12:17:42] mw-debug is still to fit [12:17:43] *fix [12:17:52] RECOVERY - dump of es5 in codfw on backupmon1001 is OK: Last dump for es5 at codfw (es2025) taken on 2023-08-08 00:00:06 (4677 GiB, +0.9 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:18:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:18:26] interesting, wdqs may just use the k8s endpoint? [12:18:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P50256 and previous config saved to /var/cache/conftool/dbconfig/20230809-121831-ladsgroup.json [12:18:33] !log jayme@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [12:18:33] !log jayme@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [12:18:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [12:18:35] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [12:18:44] !log jayme@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [12:18:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [12:18:50] !log jayme@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [12:18:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P50257 and previous config saved to /var/cache/conftool/dbconfig/20230809-121852-ladsgroup.json [12:18:58] !log jayme@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [12:18:58] !log jayme@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [12:19:08] So Amir triggered the issue (obviously, unknowingly to him) right? [12:19:14] !log jayme@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [12:19:26] basically it failed on next deploy, right? [12:19:31] !log jayme@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [12:20:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P50258 and previous config saved to /var/cache/conftool/dbconfig/20230809-122000-ladsgroup.json [12:20:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:20:30] and that caused some certificate issue? [12:20:41] (03PS1) 10JMeybohm: Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947333 (https://phabricator.wikimedia.org/T300033) [12:20:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:21:31] jynus: I'm just back from lunch. Not sure I understand the link with wdqs. What made you think that? [12:21:50] (03CR) 10JMeybohm: [C: 03+2] Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:22:02] gehel: It looked at first it also failed while the main issue was ongoing, but now I see it may be just a coincidende [12:22:31] (03Merged) 10jenkins-bot: Don't enable mesh.certmanager for mw deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/947333 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:22:33] The allocator decreasing alert? [12:22:44] yeah, is it just noisy? [12:23:11] jynus: I flipped a switch in some global configuration that made the mw-on-k8s deployments use a different TLS cert on the next deploy (triggered by Amir.1) [12:23:36] https://gerrit.wikimedia.org/r/c/operations/puppet/+/946981/ [12:23:37] so that is why I asked Amir, as it lined up with that, but obviously not his fault [12:23:48] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:24:07] completely my fault, not Amir.1's! [12:24:27] jynus: no, it's a real issue. It doesn't have to be addressed right away, but definitely soon! Some Blazegraph internals are going crazy and we'll need to recover the data from a different host eventually [12:24:55] inflatador, ryankemper : see above [12:25:32] I think getting a report, even a light one could be interesting, not as much for user impact but to avoid something like that when there was more traffic pct [12:25:42] I think this only happend because I never got that t-shirt after breaking wikipedia the first time 😇 [12:25:46] one question, did k8s depool automatically? [12:25:58] o just the traffic was very small? [12:26:13] because the amount of errors was close to noise levels [12:26:13] the total traffic to k8s is only 1% currently [12:26:27] (03PS7) 10Ayounsi: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) [12:27:01] jynus: I'll write an incident report in wikitech [12:27:03] because if I might have taken a different route- depool it now tha we can, and with time fix the issues [12:28:02] I think a deploy can cause more isses than the actual issue [12:28:15] as in, the cache wiping and that [12:28:48] in some cases yes. I this case I just re-deployed the k8s part, so it's more like restarting the mediawiki appservers [12:28:59] no mw-version change or something [12:29:25] the actuall issue was the tls terminating component of the mw deployments, now mw itself obviously [12:30:04] let me help with the doc https://grafana.wikimedia.org/goto/Mu3BrT64k?orgId=1 [12:35:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P50259 and previous config saved to /var/cache/conftool/dbconfig/20230809-123506-ladsgroup.json [12:36:58] (03PS4) 10EoghanGaffney: gitlab: Configure object storage for gitlab1003 on Swift [puppet] - 10https://gerrit.wikimedia.org/r/947016 [12:37:22] (03CR) 10CI reject: [V: 04-1] gitlab: Configure object storage for gitlab1003 on Swift [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney) [12:39:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T342617)', diff saved to https://phabricator.wikimedia.org/P50260 and previous config saved to /var/cache/conftool/dbconfig/20230809-123906-ladsgroup.json [12:39:10] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:39:18] (03PS1) 10ArielGlenn: Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 [12:39:21] (03PS5) 10EoghanGaffney: gitlab: Configure object storage for gitlab1003 on Swift [puppet] - 10https://gerrit.wikimedia.org/r/947016 [12:39:23] (03CR) 10CI reject: [V: 04-1] Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (owner: 10ArielGlenn) [12:39:51] (03PS3) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) [12:40:15] (03CR) 10CI reject: [V: 04-1] Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:40:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1086.eqiad.wmnet with OS bullseye [12:42:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:42:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [12:42:38] (03PS1) 10Btullis: Revert "Revert "Correct the role for the new hadoop workers"" [puppet] - 10https://gerrit.wikimedia.org/r/946660 [12:43:01] (03CR) 10CI reject: [V: 04-1] Revert "Revert "Correct the role for the new hadoop workers"" [puppet] - 10https://gerrit.wikimedia.org/r/946660 (owner: 10Btullis) [12:43:03] (03PS1) 10Ayounsi: Paramiko: remove version pin [software/homer] - 10https://gerrit.wikimedia.org/r/947337 [12:43:39] (03PS2) 10Btullis: Revert "Revert "Correct the role for the new hadoop workers"" [puppet] - 10https://gerrit.wikimedia.org/r/946660 [12:43:46] (03PS4) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) [12:44:09] (03CR) 10CI reject: [V: 04-1] Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:44:30] (03CR) 10Btullis: [C: 03+2] Revert "Revert "Correct the role for the new hadoop workers"" [puppet] - 10https://gerrit.wikimedia.org/r/946660 (owner: 10Btullis) [12:47:13] jynus: https://wikitech.wikimedia.org/wiki/Incidents/2023-08-09_mw-on-k8s_outage_due_to_wrong_tls_cert [12:47:30] (03PS2) 10ArielGlenn: Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882) [12:47:34] (03PS5) 10Muehlenhoff: Extend the firewall::service shim with checks for legacy syntax [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) [12:47:52] (03CR) 10CI reject: [V: 04-1] Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882) (owner: 10ArielGlenn) [12:48:14] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [12:48:34] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [12:48:44] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [12:49:12] !log restarting blazegraph on wdqs1007 (BlazegraphFreeAllocatorsDecreasingRapidly) [12:49:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:21] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [12:50:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [12:50:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P50261 and previous config saved to /var/cache/conftool/dbconfig/20230809-125012-ladsgroup.json [12:50:29] (03PS1) 10Majavah: openstack: wmcs-enc-cli: allow loading data from stdin or file [puppet] - 10https://gerrit.wikimedia.org/r/947339 (https://phabricator.wikimedia.org/T343869) [12:51:49] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply [12:51:53] jayme: a couple of extra views https://grafana.wikimedia.org/goto/GD1uCo6Vk?orgId=1 and https://logstash.wikimedia.org/goto/77727db3eb0ec80c9a80f64cef14ca06 [12:52:14] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [12:52:41] (03PS3) 10ArielGlenn: Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882) [12:53:02] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply [12:53:02] (03CR) 10CI reject: [V: 04-1] Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882) (owner: 10ArielGlenn) [12:53:21] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [12:53:44] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM, one typo I noticed." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/918518 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [12:54:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P50262 and previous config saved to /var/cache/conftool/dbconfig/20230809-125412-ladsgroup.json [12:54:20] (03PS4) 10ArielGlenn: Remove comments saying that the script doesn't verify/rename output files [dumps] - 10https://gerrit.wikimedia.org/r/947336 (https://phabricator.wikimedia.org/T343882) [12:54:41] looking for a +1 on https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/947338 (core-namespaces: Remove dupe wikifunctions alias) — I can't *think* of a reason why the dupe would be needed [12:55:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50263 and previous config saved to /var/cache/conftool/dbconfig/20230809-125555-ladsgroup.json [12:56:02] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [12:57:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/947332 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:00:00] jynus: thanks. I'll add a screenshot of the ATS graph [13:02:02] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42806/console" [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney) [13:05:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P50264 and previous config saved to /var/cache/conftool/dbconfig/20230809-130518-ladsgroup.json [13:05:21] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:05:23] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:05:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:05:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:05:51] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:05:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P50265 and previous config saved to /var/cache/conftool/dbconfig/20230809-130557-ladsgroup.json [13:06:10] (03PS1) 10Btullis: Preseed debian-installer not to prompt for additional firmware [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106) [13:07:02] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney) [13:08:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P50266 and previous config saved to /var/cache/conftool/dbconfig/20230809-130805-ladsgroup.json [13:09:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P50267 and previous config saved to /var/cache/conftool/dbconfig/20230809-130918-ladsgroup.json [13:09:27] (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42807/console" [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney) [13:10:03] (03CR) 10Btullis: "I have verified that the question still exists in bookworm and appears to work in the same way: https://www.debian.org/releases/bookworm/e" [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106) (owner: 10Btullis) [13:10:12] (03PS13) 10JMeybohm: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [13:11:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P50268 and previous config saved to /var/cache/conftool/dbconfig/20230809-131103-ladsgroup.json [13:11:38] (03PS8) 10Volans: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) (owner: 10Ayounsi) [13:12:22] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1087.eqiad.wmnet with OS bullseye [13:15:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [13:15:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2107.codfw.wmnet with reason: Maintenance [13:20:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50269 and previous config saved to /var/cache/conftool/dbconfig/20230809-132012-ladsgroup.json [13:20:15] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:22:58] (03PS14) 10JMeybohm: mediawiki: set requests based on php.workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [13:23:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P50270 and previous config saved to /var/cache/conftool/dbconfig/20230809-132312-ladsgroup.json [13:23:45] (03CR) 10JMeybohm: mediawiki: set requests based on php.workers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/943560 (https://phabricator.wikimedia.org/T342748) (owner: 10Clément Goubert) [13:24:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T342617)', diff saved to https://phabricator.wikimedia.org/P50271 and previous config saved to /var/cache/conftool/dbconfig/20230809-132424-ladsgroup.json [13:24:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [13:24:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance [13:24:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T342617)', diff saved to https://phabricator.wikimedia.org/P50272 and previous config saved to /var/cache/conftool/dbconfig/20230809-132446-ladsgroup.json [13:26:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P50273 and previous config saved to /var/cache/conftool/dbconfig/20230809-132609-ladsgroup.json [13:28:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:29:41] (03PS1) 10Ayounsi: Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) [13:29:44] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1087.eqiad.wmnet with reason: host reimage [13:31:17] (03CR) 10CI reject: [V: 04-1] Junos: Add more info on commit errors [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi) [13:31:57] (03PS1) 10David Caro: haproxy_exporter: allow setting as absent [puppet] - 10https://gerrit.wikimedia.org/r/947353 (https://phabricator.wikimedia.org/T343885) [13:31:59] (03PS1) 10David Caro: prometheus: gather stats from haproxy for openstack and cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) [13:32:53] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1087.eqiad.wmnet with reason: host reimage [13:32:57] (03CR) 10Volans: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/947337 (owner: 10Ayounsi) [13:33:04] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:33:13] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-master1002.eqiad.wmnet with OS bullseye [13:34:41] (03CR) 10Muehlenhoff: "Looks good, but we don't need this for bookworm; in Bookworm the whole firmware in d-i handling was revised since firmware is now allowed " [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106) (owner: 10Btullis) [13:34:47] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Improve Homer output when Juniper device rejects config - https://phabricator.wikimedia.org/T328747 (10ayounsi) a:03ayounsi [13:35:19] (03CR) 10Volans: [C: 03+1] "I haven't tested it but the idea to have more info seems good to me. Unit tests need some tweaking" [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi) [13:35:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P50274 and previous config saved to /var/cache/conftool/dbconfig/20230809-133518-ladsgroup.json [13:35:29] (03CR) 10Ayounsi: [C: 03+2] Paramiko: remove version pin [software/homer] - 10https://gerrit.wikimedia.org/r/947337 (owner: 10Ayounsi) [13:36:46] (03PS2) 10Btullis: Preseed debian-installer not to prompt for additional firmware [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106) [13:37:03] (03CR) 10Ayounsi: Junos: Add more info on commit errors (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/947352 (https://phabricator.wikimedia.org/T328747) (owner: 10Ayounsi) [13:37:09] (03Merged) 10jenkins-bot: Paramiko: remove version pin [software/homer] - 10https://gerrit.wikimedia.org/r/947337 (owner: 10Ayounsi) [13:38:00] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/947339 (https://phabricator.wikimedia.org/T343869) (owner: 10Majavah) [13:38:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P50275 and previous config saved to /var/cache/conftool/dbconfig/20230809-133818-ladsgroup.json [13:38:35] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42808/console" [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [13:39:17] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10MoritzMuehlenhoff) 05Resolved→03Open @adee_wmde You are using the same key to access Wikimedia Cloud Services and Wikimedia production, please generate a separate SSH key for a... [13:39:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106) (owner: 10Btullis) [13:40:39] (03CR) 10Btullis: [C: 03+2] Preseed debian-installer not to prompt for additional firmware [puppet] - 10https://gerrit.wikimedia.org/r/947341 (https://phabricator.wikimedia.org/T308106) (owner: 10Btullis) [13:41:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T342617)', diff saved to https://phabricator.wikimedia.org/P50276 and previous config saved to /var/cache/conftool/dbconfig/20230809-134115-ladsgroup.json [13:41:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [13:41:20] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [13:41:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance [13:41:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T342617)', diff saved to https://phabricator.wikimedia.org/P50277 and previous config saved to /var/cache/conftool/dbconfig/20230809-134136-ladsgroup.json [13:42:19] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10ayounsi) I sent a new email to Juniper yesterday to ask again about the best next steps here. [13:43:37] (03CR) 10David Caro: [C: 03+2] openstack: wmcs-enc-cli: allow loading data from stdin or file [puppet] - 10https://gerrit.wikimedia.org/r/947339 (https://phabricator.wikimedia.org/T343869) (owner: 10Majavah) [13:43:45] (03CR) 10David Caro: openstack: wmcs-enc-cli: allow loading data from stdin or file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947339 (https://phabricator.wikimedia.org/T343869) (owner: 10Majavah) [13:44:43] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) 05Resolved→03Declined [13:44:49] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [13:44:57] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [13:45:11] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) 05Stalled→03Resolved a:03ayounsi Boldly closing this as Katran will solve some if not all those limitations. [13:47:02] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: host reimage [13:47:25] (03PS2) 10David Caro: prometheus: gather stats from haproxy for openstack and cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) [13:47:28] (03CR) 10Elukey: [C: 03+2] changeprop: filter sourceswiki from stream for outlink LW service [deployment-charts] - 10https://gerrit.wikimedia.org/r/947328 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou) [13:48:51] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) @Vgutierrez do you know how the future L4LB will handle ICMP PTB packets? Can it route it to the proper source host? [13:49:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1002.eqiad.wmnet with reason: host reimage [13:50:03] (03PS1) 10Ssingh: P:bird::anycast: require anycast setup on the bird class [puppet] - 10https://gerrit.wikimedia.org/r/947357 [13:50:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P50278 and previous config saved to /var/cache/conftool/dbconfig/20230809-135024-ladsgroup.json [13:50:50] (03CR) 10David Caro: [C: 03+1] "Looks ok, did not test it though" [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott) [13:51:11] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42810/console" [puppet] - 10https://gerrit.wikimedia.org/r/947357 (owner: 10Ssingh) [13:51:15] (03CR) 10David Caro: [C: 03+1] add volumes functionality to wmcs-backup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/946643 (owner: 10Andrew Bogott) [13:52:33] !log installing tiff security updates [13:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:40] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [13:52:52] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42809/console" [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [13:52:57] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [13:53:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P50279 and previous config saved to /var/cache/conftool/dbconfig/20230809-135324-ladsgroup.json [13:53:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [13:53:28] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [13:53:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [13:53:52] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:bird::anycast: require anycast setup on the bird class [puppet] - 10https://gerrit.wikimedia.org/r/947357 (owner: 10Ssingh) [13:53:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P50280 and previous config saved to /var/cache/conftool/dbconfig/20230809-135356-ladsgroup.json [13:54:26] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [13:54:36] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6002.drmrs.wmnet with OS bookworm [13:54:46] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [13:55:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P50281 and previous config saved to /var/cache/conftool/dbconfig/20230809-135503-ladsgroup.json [13:56:01] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1087.eqiad.wmnet with OS bullseye [13:58:17] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10adee_wmde) [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1400) [14:00:30] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for adee_wmde - https://phabricator.wikimedia.org/T342969 (10adee_wmde) >>! In T342969#9080463, @MoritzMuehlenhoff wrote: > @adee_wmde You are using the same key to access Wikimedia Cloud Services and Wikimedia production, please generate a... [14:02:47] 10SRE, 10Research, 10Wikimedia-Mailing-lists: Create research-engineering-alerts list - https://phabricator.wikimedia.org/T342833 (10fkaelin) Thank you kindly. [14:05:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50282 and previous config saved to /var/cache/conftool/dbconfig/20230809-140531-ladsgroup.json [14:05:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [14:05:40] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:05:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1147.eqiad.wmnet with reason: Maintenance [14:05:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50283 and previous config saved to /var/cache/conftool/dbconfig/20230809-140551-ladsgroup.json [14:07:30] !log restarting FPM on mediawiki canaries to pick up tiff update [14:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:35] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:58] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-master1002.eqiad.wmnet with OS bullseye [14:10:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P50284 and previous config saved to /var/cache/conftool/dbconfig/20230809-141009-ladsgroup.json [14:11:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T342617)', diff saved to https://phabricator.wikimedia.org/P50285 and previous config saved to /var/cache/conftool/dbconfig/20230809-141134-ladsgroup.json [14:11:38] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:11:47] (03CR) 10EoghanGaffney: [V: 03+1 C: 03+2] gitlab: Configure object storage for gitlab1003 on Swift [puppet] - 10https://gerrit.wikimedia.org/r/947016 (owner: 10EoghanGaffney) [14:12:41] urbanecm: do you remember if the account creation is global or per-wiki? [14:13:04] taavi: you mean, the acc creation throttle? [14:13:16] afaik it's counted across all sites, but the value can be different depending on the project you sign on. [14:14:00] yeah. so for T343595 I'm wondering if 'value' => 250 (for example) means 250 accounts total or 250 accounts per-wiki [14:14:00] T343595: Increase account creation at Wikimania 2023 August 14-20 [Note: incomplete IP list] - https://phabricator.wikimedia.org/T343595 [14:17:18] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10Vgutierrez) >>! In T253732#9080504, @ayounsi wrote: > @Vgutierrez do you know how the future L4LB will handle ICMP PTB packets? Can... [14:17:35] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:56] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [14:18:05] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1088.eqiad.wmnet with OS bullseye [14:18:33] PROBLEM - BFD status on asw1-b13-drmrs.mgmt is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:19:26] 10SRE-tools, 10Cloud-VPS, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) [14:19:35] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack: [spicerack] split SRE cookbooks into "shared" and "SRE-only" - https://phabricator.wikimedia.org/T343894 (10fnegri) p:05Triage→03Low [14:20:17] PROBLEM - BGP status on asw1-b13-drmrs.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:21:30] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on durum6002.drmrs.wmnet with reason: host reimage [14:22:05] (03PS1) 10Majavah: throttle: remove expired rules [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947361 [14:22:07] (03PS1) 10Majavah: throttle: add rules for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595) [14:22:48] (03CR) 10CI reject: [V: 04-1] throttle: add rules for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595) (owner: 10Majavah) [14:23:17] (03PS2) 10Majavah: throttle: add rules for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595) [14:24:53] !log installing sudo bugfix updates from Bookworm 12.1 point release [14:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:56] (03CR) 10Urbanecm: [C: 03+1] "good catch! let me write a test for this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947338 (https://phabricator.wikimedia.org/T342964) (owner: 10Samtar) [14:25:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P50287 and previous config saved to /var/cache/conftool/dbconfig/20230809-142515-ladsgroup.json [14:26:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P50288 and previous config saved to /var/cache/conftool/dbconfig/20230809-142640-ladsgroup.json [14:28:45] (03PS1) 10Urbanecm: [tests] Ensure each config has at most one value per wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 [14:30:24] taavi: it means 250 accounts total. the counter is stored as a global memcached key, so it's fleet-wide [14:30:34] jouncebot: nowandnext [14:30:34] For the next 0 hour(s) and 29 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1400) [14:30:34] In 2 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1700) [14:30:44] oh perfect, thanks [14:31:10] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1088.eqiad.wmnet with reason: host reimage [14:32:33] I intend to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/947338 [14:32:33] 10sre-alert-triage, 10Platform Engineering: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10JMeybohm) [14:33:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947338 (https://phabricator.wikimedia.org/T342964) (owner: 10Samtar) [14:34:05] (03Merged) 10jenkins-bot: core-namespaces: Remove dupe wikifunctions alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947338 (https://phabricator.wikimedia.org/T342964) (owner: 10Samtar) [14:34:16] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1088.eqiad.wmnet with reason: host reimage [14:34:23] !log samtar@deploy1002 Started scap: Backport for [[gerrit:947338|core-namespaces: Remove dupe wikifunctions alias (T342964)]] [14:34:27] T342964: Add WF: as an alias of Wikifunctions namespace - https://phabricator.wikimedia.org/T342964 [14:36:05] !log samtar@deploy1002 samtar: Backport for [[gerrit:947338|core-namespaces: Remove dupe wikifunctions alias (T342964)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [14:36:15] * TheresNoTime testing [14:37:13] oh! The Wikimedia debug extension doesn't recognise `wikifunctions.org` ? [14:38:56] TheresNoTime: apparently not yet [14:40:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P50289 and previous config saved to /var/cache/conftool/dbconfig/20230809-144022-ladsgroup.json [14:40:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:40:26] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [14:40:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:41:05] `curl -H 'X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet' https://www.wikifunctions.org/wiki/WF:Main_Page` is returning nothing, whereas `curl -H 'X-Wikimedia-Debug: backend=mwdebug1001.eqiad.wmnet' https://www.wikifunctions.org/wiki/Wikifunctions:Main_Page` is returning as expected [14:41:10] hmm [14:41:40] (03PS1) 10Btullis: Remove the manual check of reuse recipe on an-test-master hosts [puppet] - 10https://gerrit.wikimedia.org/r/947366 (https://phabricator.wikimedia.org/T329363) [14:41:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P50290 and previous config saved to /var/cache/conftool/dbconfig/20230809-144147-ladsgroup.json [14:41:57] oops, forgot `-L` for the redirect [14:42:04] TheresNoTime: yup :) [14:42:07] or `-I` to see headers [14:42:13] !log samtar@deploy1002 samtar: Continuing with sync [14:42:22] lgtm then, syncing :D [14:42:32] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10User-jbond: Anycast: consistent ICMP packet too big routing - https://phabricator.wikimedia.org/T253732 (10ayounsi) 05Open→03Declined Thanks, then like {T253666} I'm going to boldly close this task. [14:42:42] (03CR) 10Btullis: [C: 03+2] Remove the manual check of reuse recipe on an-test-master hosts [puppet] - 10https://gerrit.wikimedia.org/r/947366 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [14:42:48] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [14:43:40] (03CR) 10Muehlenhoff: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff) [14:43:42] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team (FY2023/2024-Q1): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) 05In progress→03Resolved > I'll leave the last page to you. That last page was https://wikitech.wikimedia.org/wiki/Wikim... [14:43:48] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Goal, 10cloud-services-team (FY2023/2024-Q1): Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) [14:43:52] TheresNoTime: ad the extension, we have https://gerrit.wikimedia.org/r/c/performance/WikimediaDebug/+/941883 merged [14:43:56] 10sre-alert-triage, 10Infrastructure-Foundations: Alert triage: overdue alert [warning] Systemd units failing on debmonitor2003 - https://phabricator.wikimedia.org/T343897 (10JMeybohm) [14:43:57] we just need someone to release the extension [14:44:02] 10SRE, 10Infrastructure-Foundations, 10netops: Detect IP address collisions - https://phabricator.wikimedia.org/T189522 (10ayounsi) 05Open→03Resolved a:03ayounsi We have a working solution for the mgmt network (until it's time to split mgmt into smaller subnets). And for production, automation and per... [14:44:10] (03CR) 10Muehlenhoff: [C: 03+2] Uncomment sysctl-userns alias [puppet] - 10https://gerrit.wikimedia.org/r/945812 (owner: 10Muehlenhoff) [14:44:18] urbanecm: ah I see [14:44:23] (03PS1) 10Ssingh: Revert "P:bird::anycast: require anycast setup on the bird class" [puppet] - 10https://gerrit.wikimedia.org/r/946664 [14:45:12] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:45:16] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:46:06] 10sre-alert-triage, 10Infrastructure-Foundations: Alert triage: overdue alert [warning] puppet fails on idp-test1002 - https://phabricator.wikimedia.org/T343898 (10JMeybohm) [14:47:15] (03CR) 10Ssingh: [C: 03+2] Revert "P:bird::anycast: require anycast setup on the bird class" [puppet] - 10https://gerrit.wikimedia.org/r/946664 (owner: 10Ssingh) [14:47:30] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: wmcs.spicerack: Setup a host to run cookbooks from prod network - https://phabricator.wikimedia.org/T276440 (10fnegri) [14:47:57] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Epic, and 2 others: Improve how we run WMCS cookbooks - https://phabricator.wikimedia.org/T319401 (10fnegri) 05In progress→03Resolved [14:48:45] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:947338|core-namespaces: Remove dupe wikifunctions alias (T342964)]] (duration: 14m 21s) [14:48:49] T342964: Add WF: as an alias of Wikifunctions namespace - https://phabricator.wikimedia.org/T342964 [14:48:57] (03CR) 10Muehlenhoff: profile::mirrors::serve: Remove Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff) [14:49:30] !log `[samtar@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki wikifunctionswiki --fix` for T342964 [14:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/945782 (owner: 10Muehlenhoff) [14:50:35] 10SRE, 10SRE-Access-Requests, 10Infrastructure Security, 10Infrastructure-Foundations, 10fundraising-tech-ops: As an FR-Tech SRE, we want to be able to designate a host for decommissioning - https://phabricator.wikimedia.org/T334154 (10MoritzMuehlenhoff) [14:52:22] 10sre-alert-triage, 10Machine-Learning-Team: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10JMeybohm) [14:53:11] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10Patch-For-Review: Create a new group dns-admins - https://phabricator.wikimedia.org/T341440 (10MoritzMuehlenhoff) 05Open→03Resolved @Jgreen @Dwisehaupt AFAICT this should be resolved and you should be able to merge DNS changes just fine.... [14:55:10] RECOVERY - BGP status on asw1-b13-drmrs.mgmt is OK: BGP OK - up: 12, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:55:32] RECOVERY - BFD status on asw1-b13-drmrs.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:56:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T342617)', diff saved to https://phabricator.wikimedia.org/P50291 and previous config saved to /var/cache/conftool/dbconfig/20230809-145653-ladsgroup.json [14:56:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [14:56:57] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [14:57:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [14:57:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1088.eqiad.wmnet with OS bullseye [14:57:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T342617)', diff saved to https://phabricator.wikimedia.org/P50292 and previous config saved to /var/cache/conftool/dbconfig/20230809-145714-ladsgroup.json [14:57:40] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1089.eqiad.wmnet with OS bullseye [14:58:35] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host durum6002.drmrs.wmnet with OS bookworm [15:04:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [15:04:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [15:04:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50293 and previous config saved to /var/cache/conftool/dbconfig/20230809-150443-ladsgroup.json [15:04:49] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:05:37] (03CR) 10FNegri: [C: 03+2] haproxy_exporter: allow setting as absent [puppet] - 10https://gerrit.wikimedia.org/r/947353 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [15:05:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T342617)', diff saved to https://phabricator.wikimedia.org/P50294 and previous config saved to /var/cache/conftool/dbconfig/20230809-150547-ladsgroup.json [15:05:51] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:05:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50295 and previous config saved to /var/cache/conftool/dbconfig/20230809-150557-ladsgroup.json [15:06:05] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-master1001.eqiad.wmnet with OS bullseye [15:06:34] (03PS9) 10Ayounsi: WIP: first scaffolding for gNMI support [software/homer] - 10https://gerrit.wikimedia.org/r/939681 (https://phabricator.wikimedia.org/T320638) [15:09:15] (03CR) 10FNegri: [C: 03+1] prometheus: gather stats from haproxy for openstack and cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/947354 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [15:14:17] (03CR) 10Urbanecm: [C: 04-1] "of course this doesn't work. php silently merges the duplicate key when require'ing core-Namespaces.php. hmm..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 (owner: 10Urbanecm) [15:15:21] (03CR) 10Filippo Giunchedi: [C: 03+1] haproxy_exporter: allow setting as absent [puppet] - 10https://gerrit.wikimedia.org/r/947353 (https://phabricator.wikimedia.org/T343885) (owner: 10David Caro) [15:16:34] (03CR) 10JHathaway: [C: 03+1] profile::mirrors::serve: Remove Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944768 (owner: 10Muehlenhoff) [15:17:52] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1089.eqiad.wmnet with reason: host reimage [15:19:10] (03PS1) 10Hnowlan: trafficserver: route wikifeeds [puppet] - 10https://gerrit.wikimedia.org/r/947372 (https://phabricator.wikimedia.org/T339119) [15:20:16] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:20:24] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [15:20:42] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1089.eqiad.wmnet with reason: host reimage [15:20:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P50297 and previous config saved to /var/cache/conftool/dbconfig/20230809-152053-ladsgroup.json [15:21:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50298 and previous config saved to /var/cache/conftool/dbconfig/20230809-152103-ladsgroup.json [15:22:44] PROBLEM - Wikitech-static main page has content on cloudweb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [15:22:48] (03PS1) 10Effie Mouzeli: Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) [15:23:34] RECOVERY - Wikitech-static main page has content on cloudweb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 26067 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [15:26:10] (03CR) 10Ssingh: [C: 03+1] trafficserver: route wikifeeds [puppet] - 10https://gerrit.wikimedia.org/r/947372 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [15:26:53] 10sre-alert-triage, 10Machine-Learning-Team: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10elukey) Very interesting: ` Aug 09 00:09:33 ml-serve1001 kubelet[3980749]: E0809 00:09:33.603646 3980749... [15:27:43] (03CR) 10Hnowlan: [C: 03+2] trafficserver: route wikifeeds [puppet] - 10https://gerrit.wikimedia.org/r/947372 (https://phabricator.wikimedia.org/T339119) (owner: 10Hnowlan) [15:27:55] 10SRE, 10Infrastructure-Foundations: Interactive firmware prompts on Bullseye with some Broadcom NICs - https://phabricator.wikimedia.org/T308106 (10BTullis) I think that this is now fixed. My test host was an-worker1088 and it was asking for firmware for the bnx2x NIC. After the change it went past this point... [15:28:27] (03PS2) 10Effie Mouzeli: Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) [15:29:08] !log disabling puppet on A:cp to test r/947372 [15:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:27] (03PS1) 10Elukey: admin_ng: increase resources for calico kube-controllers in ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/947376 (https://phabricator.wikimedia.org/T343900) [15:36:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P50299 and previous config saved to /var/cache/conftool/dbconfig/20230809-153600-ladsgroup.json [15:36:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50300 and previous config saved to /var/cache/conftool/dbconfig/20230809-153610-ladsgroup.json [15:36:33] (03CR) 10Elukey: [C: 03+2] admin_ng: increase resources for calico kube-controllers in ml-serve [deployment-charts] - 10https://gerrit.wikimedia.org/r/947376 (https://phabricator.wikimedia.org/T343900) (owner: 10Elukey) [15:43:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T342617)', diff saved to https://phabricator.wikimedia.org/P50301 and previous config saved to /var/cache/conftool/dbconfig/20230809-154317-ladsgroup.json [15:43:25] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:43:36] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1089.eqiad.wmnet with OS bullseye [15:44:07] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:45:30] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:47:20] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: host reimage [15:47:30] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:47:42] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:48:33] (03PS3) 10Effie Mouzeli: Update blubberoid to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/947373 (https://phabricator.wikimedia.org/T300033) [15:48:58] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [15:49:08] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [15:50:47] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-master1001.eqiad.wmnet with reason: host reimage [15:51:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T342617)', diff saved to https://phabricator.wikimedia.org/P50302 and previous config saved to /var/cache/conftool/dbconfig/20230809-155106-ladsgroup.json [15:51:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [15:51:10] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [15:51:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50303 and previous config saved to /var/cache/conftool/dbconfig/20230809-155116-ladsgroup.json [15:51:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [15:51:20] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [15:51:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [15:51:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T342617)', diff saved to https://phabricator.wikimedia.org/P50304 and previous config saved to /var/cache/conftool/dbconfig/20230809-155127-ladsgroup.json [15:51:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance [15:51:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P50305 and previous config saved to /var/cache/conftool/dbconfig/20230809-155137-ladsgroup.json [15:53:01] (03PS1) 10Giuseppe Lavagetto: admin: update my bashrc [puppet] - 10https://gerrit.wikimedia.org/r/947379 [15:56:51] (03PS1) 10Ssingh: P:bird::anycast: require anycast setup on the bird class (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/947381 [15:57:14] (03PS2) 10Ssingh: P:bird::anycast: require anycast setup on the bird service (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/947381 [15:58:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P50306 and previous config saved to /var/cache/conftool/dbconfig/20230809-155824-ladsgroup.json [15:59:46] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947361 (owner: 10Majavah) [15:59:52] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42811/console" [puppet] - 10https://gerrit.wikimedia.org/r/947381 (owner: 10Ssingh) [16:00:24] (03CR) 10Urbanecm: [C: 03+1] "lgtm, 250 should be more than enough." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595) (owner: 10Majavah) [16:02:38] 10sre-alert-triage, 10Machine-Learning-Team, 10Patch-For-Review: Alert triage: overdue alert [warning] Kubelet exec_sync operations on ml-serve1001.eqiad.wmnet take 1.133s in p99 - https://phabricator.wikimedia.org/T343900 (10elukey) Throttling is gone, but I still see the exec_sync elevated latency, errors... [16:04:16] (03CR) 10Ssingh: [V: 03+1] "The systemd bindings break this: bird expects anycast-hc to be running so if we set up bird before that, the bird service will fail anyway" [puppet] - 10https://gerrit.wikimedia.org/r/947381 (owner: 10Ssingh) [16:04:44] (03CR) 10Ssingh: [V: 03+1] P:bird::anycast: require anycast setup on the bird service (take 2) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947381 (owner: 10Ssingh) [16:05:42] (03PS3) 10Majavah: throttle: add rules for Wikimania 2023 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947362 (https://phabricator.wikimedia.org/T343595) [16:09:43] (03Abandoned) 10Ssingh: P:bird::anycast: require anycast setup on the bird service (take 2) [puppet] - 10https://gerrit.wikimedia.org/r/947381 (owner: 10Ssingh) [16:10:37] (03PS15) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) [16:11:02] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10ayounsi) Opened high priority case 2023-0809-747283 asking for a RMA. [16:11:20] (03PS1) 10AikoChou: ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/947409 (https://phabricator.wikimedia.org/T343740) [16:12:11] (03CR) 10Elukey: [C: 03+1] ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/947409 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou) [16:12:46] (03CR) 10AikoChou: [C: 03+2] ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/947409 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou) [16:13:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P50307 and previous config saved to /var/cache/conftool/dbconfig/20230809-161330-ladsgroup.json [16:13:38] (03Merged) 10jenkins-bot: ml-services: update outlink docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/947409 (https://phabricator.wikimedia.org/T343740) (owner: 10AikoChou) [16:15:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-test-master1001.eqiad.wmnet with OS bullseye [16:16:23] (03PS1) 10Ssingh: bird: drop support for buster [puppet] - 10https://gerrit.wikimedia.org/r/947412 [16:17:51] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:18:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P50308 and previous config saved to /var/cache/conftool/dbconfig/20230809-161832-ladsgroup.json [16:18:36] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [16:20:13] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from miscweb.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=text&var-origin=miscweb.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:20:59] uh oh [16:21:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:21:58] hmmm that alerts only for miscweb? miscweb runs in wikikube now. I'll check the service [16:22:31] it started receiving some traffic, and all seems 50x [16:22:32] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:22:35] I have ACked it [16:22:48] what is it, phab? [16:23:00] (I lost scrollback) [16:23:10] somebody acked it and I accidentally resovled it because the icon changed [16:23:21] no worries [16:23:33] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:23:47] no services listed on https://wikitech.wikimedia.org/wiki/Miscweb (e.g. https://wikitech.wikimedia.org/wiki/Microsites#Bugzilla_Archive) seem down :/ [16:24:16] bugzilla is up https://static-bugzilla.wikimedia.org [16:24:17] tendril.wikimedia.org is up, it's miscweb [16:24:37] yes miscweb service seem up [16:25:00] there is annual report and transparency report too [16:25:03] (I mean the pods) [16:25:22] Both are accessible [16:25:37] yes this service work fine, pods in kubernetes look good [16:25:51] NEL doesn't give me a clear signal [16:26:07] k8s have been acting up today [16:26:11] at least not in terms of http [16:26:17] I don't know if it's related [16:26:19] checking other errors [16:26:37] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:26:48] is that the kubernetes service that's having issues? or the VMs with the same name? [16:26:59] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:27:12] but it is 5XX, so likely restbase? [16:27:19] or is it a result of that? [16:28:11] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:28:15] restbase revision request rates are 0 [16:28:36] so it is either that or something that restbase queries [16:28:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T342617)', diff saved to https://phabricator.wikimedia.org/P50310 and previous config saved to /var/cache/conftool/dbconfig/20230809-162836-ladsgroup.json [16:28:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [16:28:41] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:28:43] (03CR) 10Ayounsi: [C: 03+1] bird: drop support for buster [puppet] - 10https://gerrit.wikimedia.org/r/947412 (owner: 10Ssingh) [16:28:53] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [16:28:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [16:28:57] the only thing that I see is bugzilla in k8s codfw - https://logstash.wikimedia.org/goto/d608b2c7caa0a2c91c2a7024a812993d [16:29:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [16:29:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T342617)', diff saved to https://phabricator.wikimedia.org/P50311 and previous config saved to /var/cache/conftool/dbconfig/20230809-162913-ladsgroup.json [16:29:17] the "old" miscweb vms are called webserver-misc-eqiad and webserver-misc-codfw.discovery.wmnet as far as I can see, so this alert is related k8s I think [16:29:22] nah, ignore me, that is older [16:29:58] miscweb.discovery.wmnet points to k8s ingress [16:30:07] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:30:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50312 and previous config saved to /var/cache/conftool/dbconfig/20230809-163014-ladsgroup.json [16:30:18] yeah ok I see 504s for bugzilla in codfw [16:30:23] see the above logs [16:30:29] elukey: what rate? [16:30:40] the pa.age was for eqsin [16:31:21] yeah but eqsin calls either eqiad or codfw [16:31:28] ATS in eqsin I mean [16:31:48] jynus: the rate is low, but even the one in the ATS grafana link is low [16:32:10] yeah, wanted to know if it matched, as if, it was most of it or there was part that was unaccounted [16:32:50] on CDN we are producing around 9 5XX per second [16:33:16] so the 504s are marked with UT, namely "Upstream request timeout" [16:33:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P50313 and previous config saved to /var/cache/conftool/dbconfig/20230809-163338-ladsgroup.json [16:33:47] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:33:59] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:35:00] ahh nice AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting [16:35:12] just to be sure, restbase and miscweb have the same issue? at least grafana logs show also increased 503/504 starting 16:04 UTC [16:35:19] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:35:42] those restbase errors are a little concerning alongside https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/191adcf20ba5fcb5c920dc885f79f0c958268546%5E%21/#F0 [16:35:51] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:35:53] jelto: we don't know, my theory is those are connected [16:36:07] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:36:19] hnowlan: the timing doesn't match though if we see the increased 5xx around 16:04. the patch was merged later [16:36:22] e.g. restbase uses some api that fails or something failing because of restbase [16:36:32] ah sorry no that was a red hering (at least in the dashboard). There is no increase since 16:04. [16:36:59] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:37:23] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:37:37] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:38:00] !log temporarly bump miscweb bugzilla pods from 2 to 4 in k8s wikikube codfw [16:38:01] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:03] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:39:02] (03PS1) 10AikoChou: ml-services: update outlink transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/947416 [16:39:17] (03CR) 10Elukey: [C: 03+1] ml-services: update outlink transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/947416 (owner: 10AikoChou) [16:39:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T342617)', diff saved to https://phabricator.wikimedia.org/P50314 and previous config saved to /var/cache/conftool/dbconfig/20230809-163928-ladsgroup.json [16:39:31] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:39:32] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [16:39:33] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:39:36] (03CR) 10AikoChou: [C: 03+2] ml-services: update outlink transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/947416 (owner: 10AikoChou) [16:40:35] (03Merged) 10jenkins-bot: ml-services: update outlink transformer image [deployment-charts] - 10https://gerrit.wikimedia.org/r/947416 (owner: 10AikoChou) [16:41:53] !log aikochou@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:41:53] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:42:29] !log aikochou@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'articletopic-outlink' for release 'main' . [16:43:33] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:44:36] !log temporarly bump miscweb bugzilla pods from 4 to 8 in k8s wikikube codfw [16:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:51] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:45:03] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:45:03] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:45:03] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:45:07] still debugging the restbase issues - the errors are having no user-facing impact as we're not using restbase for that endpoint [16:45:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P50315 and previous config saved to /var/cache/conftool/dbconfig/20230809-164520-ladsgroup.json [16:45:21] but I'm still trying to figure out why that's happening, the wikifeeds service itself is fine [16:45:22] hnowlan: thanks, that is good to know [16:45:56] hnowlan: can you think of an upstream or downstream dependency that could link both issues? [16:46:33] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:48:03] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:48:30] jynus: not really :/ the request goes restbase->service mesh->wikifeeds [16:48:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P50316 and previous config saved to /var/cache/conftool/dbconfig/20230809-164844-ladsgroup.json [16:49:19] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:49:23] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:49:33] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:50:13] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from miscweb.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=eqsin%20prometheus/ops&var-cluster=text&var-origin=miscweb.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:50:49] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:50:53] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:51:03] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:54:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P50317 and previous config saved to /var/cache/conftool/dbconfig/20230809-165434-ladsgroup.json [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1700) [17:00:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314', diff saved to https://phabricator.wikimedia.org/P50318 and previous config saved to /var/cache/conftool/dbconfig/20230809-170027-ladsgroup.json [17:03:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P50319 and previous config saved to /var/cache/conftool/dbconfig/20230809-170351-ladsgroup.json [17:03:53] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [17:03:54] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [17:04:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [17:04:29] (03CR) 10Btullis: "Sorry for being late to the party on this review. Thanks so much for your work on this @Slyngshede and @elukey." [puppet] - 10https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: 10Slyngshede) [17:05:57] all should be good now [17:09:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P50320 and previous config saved to /var/cache/conftool/dbconfig/20230809-170940-ladsgroup.json [17:14:41] (03PS1) 10Hnowlan: Revert "trafficserver: route wikifeeds" [puppet] - 10https://gerrit.wikimedia.org/r/946665 [17:15:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3314 (T342617)', diff saved to https://phabricator.wikimedia.org/P50321 and previous config saved to /var/cache/conftool/dbconfig/20230809-171533-ladsgroup.json [17:15:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [17:15:37] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [17:15:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [17:16:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T342617)', diff saved to https://phabricator.wikimedia.org/P50322 and previous config saved to /var/cache/conftool/dbconfig/20230809-171604-ladsgroup.json [17:23:43] (03CR) 10Hnowlan: [C: 03+2] Revert "trafficserver: route wikifeeds" [puppet] - 10https://gerrit.wikimedia.org/r/946665 (owner: 10Hnowlan) [17:24:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T342617)', diff saved to https://phabricator.wikimedia.org/P50323 and previous config saved to /var/cache/conftool/dbconfig/20230809-172447-ladsgroup.json [17:24:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [17:24:53] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [17:25:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Maintenance [17:25:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1192 (T342617)', diff saved to https://phabricator.wikimedia.org/P50324 and previous config saved to /var/cache/conftool/dbconfig/20230809-172507-ladsgroup.json [17:25:23] (03PS1) 10Btullis: Use python3 for the check_hdfs_active_namenode script [puppet] - 10https://gerrit.wikimedia.org/r/947421 (https://phabricator.wikimedia.org/T329363) [17:27:15] (03CR) 10Btullis: [C: 03+2] Use python3 for the check_hdfs_active_namenode script [puppet] - 10https://gerrit.wikimedia.org/r/947421 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [17:27:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [17:27:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [17:28:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P50325 and previous config saved to /var/cache/conftool/dbconfig/20230809-172803-ladsgroup.json [17:28:09] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [17:31:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P50326 and previous config saved to /var/cache/conftool/dbconfig/20230809-173110-ladsgroup.json [17:46:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P50327 and previous config saved to /var/cache/conftool/dbconfig/20230809-174616-ladsgroup.json [17:48:48] (03PS1) 10Ssingh: P:bird::anycast: use systemd::sysuser for creating the bird user [puppet] - 10https://gerrit.wikimedia.org/r/947425 [17:50:55] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/947425/42813/" [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh) [17:52:18] (03CR) 10Ssingh: "Note: bird2 postint creates the user but we are" [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh) [17:52:34] (03CR) 10Stevemunene: idp_test: add datahub_staging as a OIDC service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/944231 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [17:52:49] (03CR) 10Ssingh: P:bird::anycast: use systemd::sysuser for creating the bird user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/947425 (owner: 10Ssingh) [17:54:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P50328 and previous config saved to /var/cache/conftool/dbconfig/20230809-175434-ladsgroup.json [17:54:41] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [17:56:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:00:05] brennen and dancy: Time to snap out of that daydream and deploy Train log triage with CPT. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T1800). [18:01:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T342617)', diff saved to https://phabricator.wikimedia.org/P50329 and previous config saved to /var/cache/conftool/dbconfig/20230809-180122-ladsgroup.json [18:01:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [18:01:28] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:01:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [18:01:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:01:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2165 (T342617)', diff saved to https://phabricator.wikimedia.org/P50330 and previous config saved to /var/cache/conftool/dbconfig/20230809-180143-ladsgroup.json [18:09:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P50331 and previous config saved to /var/cache/conftool/dbconfig/20230809-180940-ladsgroup.json [18:12:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T342617)', diff saved to https://phabricator.wikimedia.org/P50332 and previous config saved to /var/cache/conftool/dbconfig/20230809-181219-ladsgroup.json [18:12:24] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:24:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P50333 and previous config saved to /var/cache/conftool/dbconfig/20230809-182446-ladsgroup.json [18:27:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P50334 and previous config saved to /var/cache/conftool/dbconfig/20230809-182726-ladsgroup.json [18:39:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P50335 and previous config saved to /var/cache/conftool/dbconfig/20230809-183952-ladsgroup.json [18:39:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [18:39:57] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [18:40:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [18:40:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [18:40:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [18:40:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P50336 and previous config saved to /var/cache/conftool/dbconfig/20230809-184018-ladsgroup.json [18:42:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P50337 and previous config saved to /var/cache/conftool/dbconfig/20230809-184228-ladsgroup.json [18:42:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P50338 and previous config saved to /var/cache/conftool/dbconfig/20230809-184238-ladsgroup.json [18:43:18] (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:48:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:50:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T342617)', diff saved to https://phabricator.wikimedia.org/P50339 and previous config saved to /var/cache/conftool/dbconfig/20230809-185040-ladsgroup.json [18:50:45] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:57:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P50340 and previous config saved to /var/cache/conftool/dbconfig/20230809-185734-ladsgroup.json [18:57:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T342617)', diff saved to https://phabricator.wikimedia.org/P50341 and previous config saved to /var/cache/conftool/dbconfig/20230809-185745-ladsgroup.json [18:57:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [18:57:50] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [18:58:00] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1193.eqiad.wmnet with reason: Maintenance [18:58:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1193 (T342617)', diff saved to https://phabricator.wikimedia.org/P50342 and previous config saved to /var/cache/conftool/dbconfig/20230809-185805-ladsgroup.json [19:05:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P50343 and previous config saved to /var/cache/conftool/dbconfig/20230809-190547-ladsgroup.json [19:12:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P50344 and previous config saved to /var/cache/conftool/dbconfig/20230809-191240-ladsgroup.json [19:20:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P50345 and previous config saved to /var/cache/conftool/dbconfig/20230809-192053-ladsgroup.json [19:27:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P50346 and previous config saved to /var/cache/conftool/dbconfig/20230809-192746-ladsgroup.json [19:27:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:27:51] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [19:28:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [19:28:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P50347 and previous config saved to /var/cache/conftool/dbconfig/20230809-192818-ladsgroup.json [19:33:27] 10SRE, 10Infrastructure-Foundations, 10netops: cr2-esams:FPC0 Parity error - https://phabricator.wikimedia.org/T318783 (10cmooney) RMA in progress, Juniper happy with address for replacement and staff at destination are aware of delivery. I will decom the existing faulty card on Sunday when on site and prep... [19:36:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T342617)', diff saved to https://phabricator.wikimedia.org/P50348 and previous config saved to /var/cache/conftool/dbconfig/20230809-193559-ladsgroup.json [19:36:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [19:36:10] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [19:36:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [19:36:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T342617)', diff saved to https://phabricator.wikimedia.org/P50349 and previous config saved to /var/cache/conftool/dbconfig/20230809-193623-ladsgroup.json [19:44:36] (03PS2) 10Cathal Mooney: Policy and definition updates for post-migration esams ranges [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) [19:44:54] (03CR) 10Cathal Mooney: Policy and definition updates for post-migration esams ranges (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/944216 (https://phabricator.wikimedia.org/T343214) (owner: 10Cathal Mooney) [19:45:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T342617)', diff saved to https://phabricator.wikimedia.org/P50350 and previous config saved to /var/cache/conftool/dbconfig/20230809-194501-ladsgroup.json [19:45:08] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [19:52:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P50351 and previous config saved to /var/cache/conftool/dbconfig/20230809-195212-ladsgroup.json [19:52:17] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [19:58:17] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on contint2001.wikimedia.org with reason: Decommissioning [19:58:30] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on contint2001.wikimedia.org with reason: Decommissioning [19:59:13] !log aokoth@cumin1001 START - Cookbook sre.hosts.decommission for hosts contint2001.wikimedia.org [20:00:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P50352 and previous config saved to /var/cache/conftool/dbconfig/20230809-200007-ladsgroup.json [20:01:57] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:02:31] RECOVERY - BFD status on cr1-drmrs is OK: UP: 0 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:05:09] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:05:16] !log aokoth@cumin1001 START - Cookbook sre.dns.netbox [20:07:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P50353 and previous config saved to /var/cache/conftool/dbconfig/20230809-200718-ladsgroup.json [20:08:00] !log aokoth@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - aokoth@cumin1001" [20:09:11] !log aokoth@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint2001.wikimedia.org decommissioned, removing all IPs except the asset tag one - aokoth@cumin1001" [20:09:11] !log aokoth@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:09:12] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts contint2001.wikimedia.org [20:13:55] 10SRE, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 (10ssingh) [20:15:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P50354 and previous config saved to /var/cache/conftool/dbconfig/20230809-201514-ladsgroup.json [20:15:54] 10SRE, 10Traffic: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 (10ssingh) In discussion with @cmooney, we will be revisiting this task again when Traffic does some other authdns-related work, so removing it from the Traffic-Icebox. [20:22:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P50355 and previous config saved to /var/cache/conftool/dbconfig/20230809-202225-ladsgroup.json [20:23:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T342617)', diff saved to https://phabricator.wikimedia.org/P50356 and previous config saved to /var/cache/conftool/dbconfig/20230809-202316-ladsgroup.json [20:23:20] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:30:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T342617)', diff saved to https://phabricator.wikimedia.org/P50357 and previous config saved to /var/cache/conftool/dbconfig/20230809-203020-ladsgroup.json [20:30:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [20:30:31] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [20:30:35] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [20:30:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1203 (T342617)', diff saved to https://phabricator.wikimedia.org/P50358 and previous config saved to /var/cache/conftool/dbconfig/20230809-203041-ladsgroup.json [20:35:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50359 and previous config saved to /var/cache/conftool/dbconfig/20230809-203502-ladsgroup.json [20:37:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P50360 and previous config saved to /var/cache/conftool/dbconfig/20230809-203731-ladsgroup.json [20:37:35] T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718 [20:38:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P50361 and previous config saved to /var/cache/conftool/dbconfig/20230809-203822-ladsgroup.json [20:50:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P50362 and previous config saved to /var/cache/conftool/dbconfig/20230809-205008-ladsgroup.json [20:53:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P50363 and previous config saved to /var/cache/conftool/dbconfig/20230809-205329-ladsgroup.json [20:55:51] (03PS2) 10Stevemunene: airflow-wmde: configure wmde airflow instance [puppet] - 10https://gerrit.wikimedia.org/r/940938 (https://phabricator.wikimedia.org/T340648) [20:55:53] (03PS2) 10Stevemunene: airflow-wmde: Create scap deployment source for wmde [puppet] - 10https://gerrit.wikimedia.org/r/940939 (https://phabricator.wikimedia.org/T340648) [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230809T2100) [21:05:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P50364 and previous config saved to /var/cache/conftool/dbconfig/20230809-210514-ladsgroup.json [21:08:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T342617)', diff saved to https://phabricator.wikimedia.org/P50365 and previous config saved to /var/cache/conftool/dbconfig/20230809-210835-ladsgroup.json [21:08:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [21:08:39] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:08:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance [21:08:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50366 and previous config saved to /var/cache/conftool/dbconfig/20230809-210856-ladsgroup.json [21:16:30] (03CR) 10Jforrester: [tests] Ensure each config has at most one value per wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947365 (owner: 10Urbanecm) [21:18:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T342617)', diff saved to https://phabricator.wikimedia.org/P50367 and previous config saved to /var/cache/conftool/dbconfig/20230809-211853-ladsgroup.json [21:18:58] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [21:19:33] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:20:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50368 and previous config saved to /var/cache/conftool/dbconfig/20230809-212021-ladsgroup.json [21:20:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [21:20:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1148.eqiad.wmnet with reason: Maintenance [21:20:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T342617)', diff saved to https://phabricator.wikimedia.org/P50369 and previous config saved to /var/cache/conftool/dbconfig/20230809-212042-ladsgroup.json [21:29:32] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:34:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P50371 and previous config saved to /var/cache/conftool/dbconfig/20230809-213359-ladsgroup.json [21:49:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P50372 and previous config saved to /var/cache/conftool/dbconfig/20230809-214905-ladsgroup.json [21:55:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50373 and previous config saved to /var/cache/conftool/dbconfig/20230809-215535-ladsgroup.json [21:55:39] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [22:04:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T342617)', diff saved to https://phabricator.wikimedia.org/P50375 and previous config saved to /var/cache/conftool/dbconfig/20230809-220412-ladsgroup.json [22:04:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [22:04:16] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [22:04:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [22:04:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1209 (T342617)', diff saved to https://phabricator.wikimedia.org/P50376 and previous config saved to /var/cache/conftool/dbconfig/20230809-220433-ladsgroup.json [22:10:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P50377 and previous config saved to /var/cache/conftool/dbconfig/20230809-221041-ladsgroup.json [22:25:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P50378 and previous config saved to /var/cache/conftool/dbconfig/20230809-222547-ladsgroup.json [22:34:21] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:36:55] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:40:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50379 and previous config saved to /var/cache/conftool/dbconfig/20230809-224053-ladsgroup.json [22:40:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:40:58] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [22:41:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [22:41:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50380 and previous config saved to /var/cache/conftool/dbconfig/20230809-224114-ladsgroup.json [22:56:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T342617)', diff saved to https://phabricator.wikimedia.org/P50381 and previous config saved to /var/cache/conftool/dbconfig/20230809-225605-ladsgroup.json [22:56:11] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [23:03:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [23:03:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2147.codfw.wmnet with reason: Maintenance [23:03:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2147 (T342617)', diff saved to https://phabricator.wikimedia.org/P50382 and previous config saved to /var/cache/conftool/dbconfig/20230809-230339-ladsgroup.json [23:03:42] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [23:11:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P50383 and previous config saved to /var/cache/conftool/dbconfig/20230809-231112-ladsgroup.json [23:26:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P50384 and previous config saved to /var/cache/conftool/dbconfig/20230809-232619-ladsgroup.json [23:28:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T342617)', diff saved to https://phabricator.wikimedia.org/P50385 and previous config saved to /var/cache/conftool/dbconfig/20230809-232855-ladsgroup.json [23:29:02] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [23:41:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T342617)', diff saved to https://phabricator.wikimedia.org/P50386 and previous config saved to /var/cache/conftool/dbconfig/20230809-234125-ladsgroup.json [23:41:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [23:41:29] T342617: Make old columns of externallinks nullable - https://phabricator.wikimedia.org/T342617 [23:41:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [23:41:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1211 (T342617)', diff saved to https://phabricator.wikimedia.org/P50387 and previous config saved to /var/cache/conftool/dbconfig/20230809-234146-ladsgroup.json [23:44:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P50388 and previous config saved to /var/cache/conftool/dbconfig/20230809-234402-ladsgroup.json [23:59:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P50389 and previous config saved to /var/cache/conftool/dbconfig/20230809-235908-ladsgroup.json