[00:02:04] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:03:32] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 70 probes of 667 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:05:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:07:04] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:10:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [00:10:26] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 62 probes of 667 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [00:12:38] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:13:12] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:15:56] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:19:56] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:20:05] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) As in T300324#7752134, I've rolled out all the k8s services where Envoy version was the only diff. We're now up to 1.18 everywhere, except for k8s servi... [00:20:15] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) [00:22:10] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [00:24:50] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [00:26:12] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:26:26] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [00:30:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:24] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:31:58] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:44] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:35:06] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [00:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:17] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bu... [00:39:40] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:40:28] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:45:06] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:48:22] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:51:06] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:52:52] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:57:04] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:58:48] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:59:20] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:00:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:00:33] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:02:08] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:04:24] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:05:30] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:10:12] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10Aklapper) 05Stalled→03Open > I'm stalling the task since it'll likely be resolvable once we've decom'd all the old swift backends that still use old... [01:18:32] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:25:32] (03PS2) 10SBassett: Revert "Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [01:26:34] RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:06] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:29:18] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:32:40] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:35:34] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:35:35] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [01:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1047.eqiad.wmnet with O... [01:36:06] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:37:45] (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:41:52] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:42:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:44:10] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:47:06] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:47:38] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:54:54] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:57:42] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:00:05] Deploy window Automatic 🚂🧪Trainsperiment Week branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T0200) [02:01:04] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:03:46] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye [02:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:03:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bu... [02:04:42] (03CR) 10SBassett: [C: 03+1] "Per my comment at T300978#7795258" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [02:05:58] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:07:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.3 [core] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772545 [02:07:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.3 [core] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772545 (owner: 10TrainBranchBot) [02:07:38] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:08:36] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-joal-singleuser.service,session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,session-259530.scope,session-259534.scope,session-259540.scope,user@20171.service,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:48] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:09:18] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:11:36] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [02:14:26] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:16:01] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [02:16:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:16:12] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with O... [02:17:02] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-joal-singleuser.service,session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,session-259530.scope,session-259534.scope,session-259540.scope,user@20171.service,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:46] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:18:58] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:20:40] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:46] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.3 [core] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772545 (owner: 10TrainBranchBot) [02:29:14] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:31:22] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-joal-singleuser.service,session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,session-259530.scope,session-259534.scope,session-259540.scope,user@20171.service,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:34] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:33:14] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:37:18] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:41:16] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:44:46] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:47:00] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:50:28] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:54:28] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:56:10] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:00:44] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:21:42] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:23:06] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:29:24] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:30:30] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:31:40] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:32:46] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:39:10] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:40:16] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:41:26] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:43:42] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:47:48] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:48:54] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:50:38] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:51:46] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:52:20] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:57:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:02:01] (CirrusSearchHighOldGCFrequency) firing: (4) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:03:48] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:07:14] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:07:48] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:10:40] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:12:58] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:21:00] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:24:18] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:25:02] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:26:10] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:27:54] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:29:38] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:35:22] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:35:56] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:41:38] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:42:14] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:43:58] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:45:06] (03PS2) 10Aaron Schulz: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129) [04:45:10] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:46:14] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:49:40] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:58:16] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:00:33] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:02:18] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:04:00] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:05:14] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:06:18] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:07:28] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:10:20] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:18:22] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:24:06] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:25:20] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:26:02] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [05:38:44] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:41:28] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:41:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1175.eqiad.wmnet with OS bullseye [05:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:08] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:43:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:43:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance [05:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:40] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:45:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:47:46] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:50:04] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:50:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:51:42] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:52:54] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:53:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage [05:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:41] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage [05:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300775)', diff saved to https://phabricator.wikimedia.org/P22916 and previous config saved to /var/cache/conftool/dbconfig/20220322-055707-marostegui.json [05:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:12] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [05:57:23] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:58:31] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:09:12] 10SRE, 10ops-eqiad: db1175 not booting up - https://phabricator.wikimedia.org/T304280 (10Marostegui) Thanks Chris! The server was able to get reimaged [06:10:13] (03PS1) 10Marostegui: Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772482 [06:11:29] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:12:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P22917 and previous config saved to /var/cache/conftool/dbconfig/20220322-061212-marostegui.json [06:12:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1175.eqiad.wmnet with OS bullseye [06:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:38] (03PS1) 10Marostegui: instances.yaml: Add db1132 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/772665 (https://phabricator.wikimedia.org/T301879) [06:19:32] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1132 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/772665 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui) [06:21:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1132 to dbctl T301879', diff saved to https://phabricator.wikimedia.org/P22918 and previous config saved to /var/cache/conftool/dbconfig/20220322-062140-marostegui.json [06:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:46] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [06:23:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1132 to s1 with minimal weight T301879', diff saved to https://phabricator.wikimedia.org/P22919 and previous config saved to /var/cache/conftool/dbconfig/20220322-062310-marostegui.json [06:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:46] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:27:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P22920 and previous config saved to /var/cache/conftool/dbconfig/20220322-062717-marostegui.json [06:27:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:27:34] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:30:06] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:32:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [06:32:18] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance [06:32:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298557)', diff saved to https://phabricator.wikimedia.org/P22921 and previous config saved to /var/cache/conftool/dbconfig/20220322-063223-marostegui.json [06:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:27] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [06:35:16] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:35:42] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [06:36:50] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:37:04] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:38:48] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:41:33] (03PS4) 10Juan90264: Create "editautopatrolprotected" protection level for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772481 (https://phabricator.wikimedia.org/T303579) [06:42:01] (CirrusSearchHighOldGCFrequency) firing: (5) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:42:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300775)', diff saved to https://phabricator.wikimedia.org/P22922 and previous config saved to /var/cache/conftool/dbconfig/20220322-064222-marostegui.json [06:42:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [06:42:25] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance [06:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:28] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [06:42:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T300775)', diff saved to https://phabricator.wikimedia.org/P22923 and previous config saved to /var/cache/conftool/dbconfig/20220322-064230-marostegui.json [06:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:48] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:45:49] (03PS3) 10Elukey: Set bullseye + overlayfs for kubernetes1007 [puppet] - 10https://gerrit.wikimedia.org/r/770440 (https://phabricator.wikimedia.org/T300744) [06:47:14] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:50:59] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1007 [puppet] - 10https://gerrit.wikimedia.org/r/770440 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [06:51:59] (03PS1) 10Urbanecm: MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772483 (https://phabricator.wikimedia.org/T304353) [06:52:14] (03PS1) 10Urbanecm: MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772484 (https://phabricator.wikimedia.org/T304353) [06:52:26] (03PS1) 10Urbanecm: MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772485 (https://phabricator.wikimedia.org/T304353) [06:54:00] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1007.eqiad.wmnet with OS bullseye [06:54:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:36] Hello [06:56:53] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [06:57:59] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [07:00:05] Amir1, awight, Urbanecm, and taavi: Dear deployers, time to do the UTC morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T0700). [07:00:05] koi and Juan_90264: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:16] o/ [07:00:24] Hello! [07:00:28] I can deploy today [07:00:29] I'm present [07:00:34] And i also have my own fixes [07:00:35] o/ here, but I'd rather not deploy today [07:00:58] (KubernetesCalicoDown) firing: kubernetes1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:01:45] (03CR) 10Urbanecm: [C: 03+2] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772484 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm) [07:01:49] Impressive, these hours only help me to be available. Thankful for that changed that! [07:01:54] (03CR) 10Urbanecm: [C: 03+2] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772485 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm) [07:02:18] I'm creating one more change and I'm going to send it to this backport [07:02:57] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:03:58] taavi: glad you're around, since you were the one to do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/760552, any objections to reverting the revert today? :) [07:04:11] (03CR) 10Urbanecm: [C: 03+2] Create "editautopatrolprotected" protection level for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772481 (https://phabricator.wikimedia.org/T303579) (owner: 10Juan90264) [07:04:18] Juan_90264: I'll start with your patch [07:04:44] Okay [07:04:53] (03Merged) 10jenkins-bot: Create "editautopatrolprotected" protection level for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772481 (https://phabricator.wikimedia.org/T303579) (owner: 10Juan90264) [07:04:57] urbanecm: I don't have any objections as long as secteam is still happy with it [07:05:05] thanks taavi [07:05:46] hashar: jnuche: good morning, if either of you is around, for the T304353 fix, i guess i don't have to do the wmf.1 patch as well, since we're now fully at wmf.2, is that right? [07:05:47] T304353: PHP Warning: preg_match() expects parameter 2 to be string, array given - https://phabricator.wikimedia.org/T304353 [07:06:02] Juan_90264: your patch is at mwdebug1001 [07:06:04] please test [07:06:17] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: host reimage [07:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:29] Okay, I will test [07:08:37] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:08:59] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: host reimage [07:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:43] (03PS3) 10Urbanecm: Revert "Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [07:09:57] (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [07:10:13] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:10:37] (03Merged) 10jenkins-bot: Revert "Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang) [07:11:06] Urbanecm: I tested and approved [07:11:07] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:11:11] syncing [07:11:13] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:12:13] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:13:17] (03CR) 10jerkins-bot: [V: 04-1] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772483 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm) [07:13:22] :( [07:13:38] (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1008 [puppet] - 10https://gerrit.wikimedia.org/r/772686 (https://phabricator.wikimedia.org/T300744) [07:13:44] (03PS4) 10Juan90264: Allow flooders to remove the group from themselves in viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772666 (https://phabricator.wikimedia.org/T303578) [07:13:58] (03CR) 10jerkins-bot: [V: 04-1] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772484 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm) [07:14:08] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b4a9935: Create "editautopatrolprotected" protection level for viwiki (T303579) (duration: 00m 57s) [07:14:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:14:13] T303579: Create "editautopatrolprotected" protection level for viwiki - https://phabricator.wikimedia.org/T303579 [07:14:15] Juan_90264: should be live now [07:14:41] koi: your patch is at mwdebug1001, please have a look [07:15:17] RECOVERY - PHP opcache health on mw1414 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:15:23] (03CR) 10jerkins-bot: [V: 04-1] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772485 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm) [07:15:25] urbanecm, lgtm [07:15:30] syncing [07:16:16] It already seems to be working, thanks Urbanecm. [07:16:52] So I'm going to put in one more change now. [07:17:03] okay [07:17:47] !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: caad5a4df35c0daa5fd3423e4abf5aa4d5c38a7a: wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia (T300978) (duration: 00m 49s) [07:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:52] T300978: Update $wgCrossSiteAJAXdomains to include {foundation, ee, ge, punjabi}.wm - https://phabricator.wikimedia.org/T300978 [07:18:23] koi: and, live [07:18:31] ty! [07:18:33] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:18:39] np [07:20:58] (KubernetesCalicoDown) resolved: kubernetes1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [07:21:02] Hello, I already put [07:21:23] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1007.eqiad.wmnet with OS bullseye [07:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:59] (03CR) 10Urbanecm: [C: 03+2] Allow flooders to remove the group from themselves in viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772666 (https://phabricator.wikimedia.org/T303578) (owner: 10Juan90264) [07:22:42] (03Merged) 10jenkins-bot: Allow flooders to remove the group from themselves in viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772666 (https://phabricator.wikimedia.org/T303578) (owner: 10Juan90264) [07:23:01] Okay merged [07:23:22] Juan_90264: and pulled to mwdebug1001 [07:23:22] (03CR) 10jerkins-bot: [V: 04-1] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772484 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm) [07:23:24] can you test? [07:23:31] Yes [07:24:15] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "all tests passed in master, most tests passed here as well, to unbreak the feature" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772484 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm) [07:24:25] 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) >>! In T300130#7791627, @elukey wrote: > > If the Beta experiment works, I think that we are ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/763172... [07:24:49] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "per the wmf.2 variant" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772485 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm) [07:25:17] Urbanecm: I tested and approved [07:25:34] deploying [07:26:50] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8151bf2: Allow flooders to remove the group from themselves in viwiki (T303578) (duration: 00m 50s) [07:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:55] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:26:55] T303578: Allow viwiki flooders to remove the group from themselves - https://phabricator.wikimedia.org/T303578 [07:28:27] scap complains about mw1448, saying `/wiki/{title} (Special Version) timed out before a response was received` [07:28:46] I SSH'ed into the host and all its cores are very busy [07:28:55] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:28:55] (=at 100%) [07:29:23] can someone check what's with that host? [07:30:52] Wasn't it then? [07:31:09] my backports are fetched to the debug server and work, waiting on info re mw1448 before i deploy [07:31:44] Okay [07:32:15] urbanecm: from a quick check it seems that php-fpm is consuming cpu, and it started yesterday at around 21 UTC [07:32:18] https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1448&var-datasource=eqiad%20prometheus%2Fops&orgId=1&var-cluster=api_appserver&from=now-2d&to=now [07:32:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298557)', diff saved to https://phabricator.wikimedia.org/P22924 and previous config saved to /var/cache/conftool/dbconfig/20220322-073243-marostegui.json [07:32:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:48] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [07:33:01] and there was a deployment around that time [07:33:10] Urbanecm: The change already seems to be working too [07:33:55] elukey: I'm wondering why it didn't happen with the prior few syncs. Or maybe it did, but since it printed the msg in the middle, i didn't see it? [07:34:10] elukey: which deployment? The train [07:34:12] Juan_90264: yep, i synced your config patch :) [07:34:48] Thank you Urbanecm, bye and good morning! [07:34:56] See you later Juan_90264 ! [07:35:26] urbanecm: afaics only mw1448 and mw1449 are showing up this behavior, is it urgent to unblock the deployment or can we spend 10/15 mins in debugging them? Otherwise we can restart php-fpm on one, and depool the other [07:35:44] elukey: we can definitely wait 15 mins, no problem :) [07:36:15] ack, checking a few things :) [07:36:28] Okay -- thanks. Please ping me once i can continue. [07:36:53] ther are also 3 api appservers depooled https://config-master.wikimedia.org/pybal/eqiad/api-https [07:38:13] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:40:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:40:56] more metrics about the node: [07:40:57] https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-node=mw1448&from=now-24h&to=now [07:42:17] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:42:21] it seems that it started to slow down a lot [07:42:40] same thing for mw1449 [07:43:09] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:43:42] elukey: that's perfectly in line with group2 wmf.2 [07:43:58] But why only them few servers [07:46:03] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:47:21] !log depool mw1448 manually on the node (high cpu usage from php-fpm) [07:47:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:37] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:47:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22925 and previous config saved to /var/cache/conftool/dbconfig/20220322-074748-marostegui.json [07:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:53] depooling it causes the cpu usage to drop [07:48:55] RECOVERY - PHP opcache health on mw1448 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:49:29] !log restart php-fpm on mw1448 - high cpu usage right after yesterday's deployment at 21 UTC [07:49:31] That alerted during the train [07:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:40] why wasn't in checked after scap [07:49:49] (Which should have restarted anyway) [07:50:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:50:21] urbanecm: I just restarted php-fpm on mw1448, I want to check how it behaves with some requests [07:50:46] Ack. Take the time needed :) [07:51:49] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:51:51] for the curious, the opcache stats before the restart are https://phabricator.wikimedia.org/P22926 [07:52:07] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:52:21] so it seems an issue with opcache [07:53:00] !log restart php-fpm on mw1449 - opcache full after deployment [07:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:49] RECOVERY - PHP opcache health on mw1449 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [07:53:53] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:54:13] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:55:27] urbanecm: you can proceed from my point of view, metrics are good now [07:55:31] lemme know how it goes [07:55:34] elukey: thanks! Syncing [07:57:03] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.2/extensions/GrowthExperiments/modules/ext.growthExperiments.MentorDashboard/MenteeOverview/MenteeOverviewPresets.js: 84877bd: MenteeOverviewPresets.getUsersToShow: Fix typo (T304353) (duration: 00m 49s) [07:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:57:07] RhinosF1: to answer your question see https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#PHP7_opcache_health - we have a daily systemd timer that checks the opcache status, in this case the deployment caused some increase in usage and two appservers were waiting for a run of the timer to get php-fpm restarted (this is my understanding) [07:57:07] T304353: PHP Warning: preg_match() expects parameter 2 to be string, array given - https://phabricator.wikimedia.org/T304353 [07:57:16] elukey: everything went just fine now [07:57:18] thanks again [07:57:21] super [07:57:32] !log UTC morning backport window completed [07:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:13] elukey: doesn't scap also do it during the train if high [07:59:24] (It did alert then, not sure why releng didn't react) [08:00:05] dancy, hashar, brennen, dduvall, jeena, and jnuche: Your horoscope predicts another unfortunate 🚂🧪Trainsperiment Week Deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T0800). [08:00:16] hashar: jnuche: fyi T304353 errors should no longer happen [08:00:24] (I just synced the fix for it and merged to wmf.3) [08:00:49] urbanecm: thanks! [08:00:50] RhinosF1: it may increase right after a deployment, this is why we have the timer, and the alerts probably fell through the cracks (it happens, nothing major) [08:02:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22927 and previous config saved to /var/cache/conftool/dbconfig/20220322-080253-marostegui.json [08:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:18] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/pcc-worker1001/34467/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/772448 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi) [08:05:01] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:05:03] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:07:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1132 some more weight T301879', diff saved to https://phabricator.wikimedia.org/P22928 and previous config saved to /var/cache/conftool/dbconfig/20220322-080713-marostegui.json [08:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:17] T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879 [08:10:35] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:10:52] (03PS1) 10Elukey: role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) [08:12:20] (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1008 [puppet] - 10https://gerrit.wikimedia.org/r/772686 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey) [08:14:11] good morning [08:14:17] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:15:48] urbanecm: thanks for the patch! [08:16:31] (03PS1) 10Marostegui: Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772667 [08:17:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P22929 and previous config saved to /var/cache/conftool/dbconfig/20220322-081702-root.json [08:17:05] 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup): Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi) I thought about this a little bit, perhaps the easiest to start with would be to revert the following reviews: * ht... [08:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:10] np hashar : [08:17:33] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:17:35] (03CR) 10Marostegui: [C: 03+2] Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772667 (owner: 10Marostegui) [08:17:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298557)', diff saved to https://phabricator.wikimedia.org/P22930 and previous config saved to /var/cache/conftool/dbconfig/20220322-081758-marostegui.json [08:18:00] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:18:01] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance [08:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:03] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [08:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22931 and previous config saved to /var/cache/conftool/dbconfig/20220322-081806-marostegui.json [08:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:35] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1008.eqiad.wmnet with OS bullseye [08:19:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:09] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:22:23] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34468/console" [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [08:23:58] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10MatthewVernon) @Aklapper I don't think that's right: ` mvernon@cumin1001:~$ sudo cumin O:swift::storage 'id swift' #[...] ===== NODE GROUP =====... [08:24:34] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34469/console" [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [08:24:47] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:25:52] (03PS2) 10Elukey: role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) [08:26:45] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34470/console" [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [08:28:25] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:28:25] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:29:58] (KubernetesCalicoDown) firing: kubernetes1008.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:30:33] (03PS3) 10Elukey: role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) [08:31:13] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:31:35] (downtiming k8s alerts for 1008, reimage in progress) [08:32:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P22932 and previous config saved to /var/cache/conftool/dbconfig/20220322-083206-root.json [08:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:21] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:35:05] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: host reimage [08:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:45] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10Aklapper) Eh, thanks (and sorry). In that case, this task should depend on whatever task is about decommissioning all the old swift backends that still u... [08:36:57] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:37:49] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: host reimage [08:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:46] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:44:58] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:47:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P22933 and previous config saved to /var/cache/conftool/dbconfig/20220322-084710-root.json [08:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:00] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:49:30] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:49:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1008.eqiad.wmnet with OS bullseye [08:49:58] (KubernetesCalicoDown) resolved: kubernetes1008.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:49:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:18] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10MatthewVernon) I think the newest host with the old id is ms-be2056, which arrived on 2019-09-18, so we won't be decommissioning the last of these nodes... [08:51:33] (03PS1) 10Jaime Nuche: testwikis wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772789 [08:51:35] (03CR) 10Jaime Nuche: [C: 03+2] testwikis wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772789 (owner: 10Jaime Nuche) [08:52:20] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772789 (owner: 10Jaime Nuche) [08:52:26] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.3 refs T300203 [08:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:30] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [08:53:20] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [08:53:36] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [08:55:11] 10SRE, 10SRE Observability, 10observability, 10Patch-For-Review, and 2 others: Unquoted URL parameter - https://phabricator.wikimedia.org/T304323 (10fgiunchedi) [08:55:23] 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi) [08:59:19] 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10Volans) I agree with this direction, as long as all the involved parties that were adding them are aware of... [08:59:46] !log drmrs propagate LVS med to core routers [08:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P22934 and previous config saved to /var/cache/conftool/dbconfig/20220322-090214-root.json [09:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:50] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:03:38] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:03:47] (03CR) 10Volans: [C: 03+1] "LGTM, is unlikely but this could cause some alert to fire, and that's a good thing :)" [puppet] - 10https://gerrit.wikimedia.org/r/772448 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi) [09:08:44] (03PS1) 10JMeybohm: Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772790 (https://phabricator.wikimedia.org/T304237) [09:09:10] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:11:56] !log restarted blazegraph on wdqs2002 (deadlocked) [09:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:06] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:14:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772790 (https://phabricator.wikimedia.org/T304237) (owner: 10JMeybohm) [09:17:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P22935 and previous config saved to /var/cache/conftool/dbconfig/20220322-091718-root.json [09:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:10] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:22:28] (03CR) 10JMeybohm: [C: 03+2] Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772790 (https://phabricator.wikimedia.org/T304237) (owner: 10JMeybohm) [09:23:58] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:24:23] huh, Wikibase wmf.3 is missing a backport that we did for wmf.1 and mwf.2 [09:24:33] and that (I think?) was also merged on master [09:24:43] I don’t know how it dropped out of wmf.3… [09:24:55] oh wait, sorry, nevermind. it is in there [09:24:59] all good :) [09:25:12] !log depool cp1077 for reimage - T290005 [09:25:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:16] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:25:16] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [09:28:06] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:28:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22936 and previous config saved to /var/cache/conftool/dbconfig/20220322-092830-marostegui.json [09:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:36] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [09:28:40] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:29:48] (03PS1) 10JMeybohm: Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772792 (https://phabricator.wikimedia.org/T304237) [09:29:51] (03CR) 10MMandere: [C: 03+2] site: Reimage cp1077 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772431 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [09:30:47] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772792 (https://phabricator.wikimedia.org/T304237) (owner: 10JMeybohm) [09:31:34] (03CR) 10JMeybohm: [C: 03+2] Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772792 (https://phabricator.wikimedia.org/T304237) (owner: 10JMeybohm) [09:34:13] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1077.eqiad.wmnet with OS buster [09:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:22] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1077.eqiad.wmnet with OS buster [09:34:24] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:40:32] 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10Aklapper) 05Open→03Stalled Let's do that [09:43:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22937 and previous config saved to /var/cache/conftool/dbconfig/20220322-094335-marostegui.json [09:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:01] (03PS1) 10MMandere: site: Reimage cp1079 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772793 (https://phabricator.wikimedia.org/T290005) [09:44:36] (03PS2) 10Filippo Giunchedi: nagios: quote check_http url/string parameters [puppet] - 10https://gerrit.wikimedia.org/r/772448 (https://phabricator.wikimedia.org/T304323) [09:45:06] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks Volans for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/772448 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi) [09:46:02] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cloudcontrol1005.wikimedia.org with reason: dcaro testing backups [09:46:04] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cloudcontrol1005.wikimedia.org with reason: dcaro testing backups [09:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:16] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:48:46] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:50:18] (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:50:20] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [09:51:07] !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1077.eqiad.wmnet with reason: host reimage [09:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:35] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:53:22] RECOVERY - LVS appservers-https codfw port 443/tcp - Main MediaWiki application server cluster- appservers.svc.codfw.wmnet -https- IPv4 #page on appservers.svc.codfw.wmnet is OK: OK - Certificate appservers-rw.discovery.wmnet will expire on Mon 06 Jul 2026 02:13:19 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:54:34] !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.3 refs T300203 (duration: 62m 07s) [09:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:38] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [09:54:58] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1077.eqiad.wmnet with reason: host reimage [09:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:15] (03Abandoned) 10Urbanecm: MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772483 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm) [09:58:17] PROBLEM - PHP opcache health on mw1406 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:58:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22938 and previous config saved to /var/cache/conftool/dbconfig/20220322-095841-marostegui.json [09:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:01] PROBLEM - Docker registry health on registry1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 143 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker [09:59:01] PROBLEM - Docker registry health on registry2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 143 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Docker [10:00:55] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:00:57] PROBLEM - PHP opcache health on mw1426 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:00:59] PROBLEM - PHP opcache health on mw1411 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:01:13] (03PS1) 10Jaime Nuche: group0 wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772794 [10:01:16] (03CR) 10Jaime Nuche: [C: 03+2] group0 wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772794 (owner: 10Jaime Nuche) [10:01:19] PROBLEM - PHP opcache health on mw1404 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:01:39] PROBLEM - PHP opcache health on mw1354 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:01:45] PROBLEM - PHP opcache health on mw1361 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:01:54] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772794 (owner: 10Jaime Nuche) [10:02:27] PROBLEM - PHP opcache health on mw1322 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:02:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:03:01] PROBLEM - PHP opcache health on mw1365 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:03:13] PROBLEM - PHP opcache health on mw1454 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:03:13] PROBLEM - PHP opcache health on mw1385 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:03:32] this is not really good [10:03:37] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.3 refs T300203 [10:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:42] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [10:03:45] hashar: --^ [10:04:01] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:04:32] _joe_ jayme around ? [10:04:43] <_joe_> yes, already looking [10:04:44] elukey: we are [10:04:56] jnuche: here :] [10:04:57] o/ [10:04:59] <_joe_> elukey: scap should restart the servers in a few minutes [10:05:02] it happened after yesterday's deployment as well, only two api appservers though [10:05:08] okok [10:05:08] we have finished promoting to group 0 [10:05:11] <_joe_> when it finished sending the updates [10:05:13] I am in a google meet with Jaime [10:05:13] <_joe_> so let's wait [10:05:18] (CertAlmostExpired) resolved: Certificate for api-https:443 is about to expire - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:05:21] didn't know it okok [10:05:30] hashar: ack thanks :) [10:05:33] PROBLEM - PHP opcache health on mw1418 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:05:37] don't we restart the php7.2 opcache on deployment? [10:05:48] <_joe_> hashar: yes but only at the end of the rsync [10:05:51] <_joe_> which isn't ideal [10:06:07] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:06:14] <_joe_> hashar: maybe we needed to actually disable opcache revalidation for the "trainsperiment" [10:06:31] PROBLEM - PHP opcache health on mw1424 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:06:31] PROBLEM - PHP opcache health on mw1434 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:06:45] PROBLEM - PHP opcache health on mw1450 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:06:54] <_joe_> yeah this is going to get bad soon [10:06:55] could it also be filled up by the old mw versions we no more care about? I am not sure whether we cleaned them up [10:07:05] should we rollback? [10:07:09] PROBLEM - PHP opcache health on mw1353 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:07:44] <_joe_> hashar: did scap finish? [10:07:53] <_joe_> I'd wait for that [10:07:54] yes [10:08:03] <_joe_> did it run check and restart of the appservers? [10:08:25] PROBLEM - PHP opcache health on mw1420 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:08:30] PROBLEM - Etcd replication lag #page on conf2005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 149 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Etcd [10:08:53] * Emperor here [10:08:58] * volans here [10:09:05] PROBLEM - PHP opcache health on mw1345 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:09:05] PROBLEM - PHP opcache health on mw1351 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:09:07] PROBLEM - PHP opcache health on mw1380 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:09:07] PROBLEM - PHP opcache health on mw1405 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:09:07] PROBLEM - PHP opcache health on mw1320 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:09:08] <_joe_> ok, can someone look at the etcd replication thing? [10:09:10] _joe_: scap finished deploying/promoting to group0, no idea what that implies for the appservers [10:09:13] <_joe_> I have to work on appservers [10:09:17] * volans looking at etcd [10:09:24] I would assume scap to have restarted the opcache [10:09:31] should we rollback? [10:09:35] <_joe_> no [10:10:20] <_joe_> I'm not sure why the alerts are even firing tbh [10:10:39] volans: I'm reading wikitech about etcd replication [10:10:57] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:11:10] https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication does not fill me with joy [10:11:46] <_joe_> can someone ack that alert? [10:11:47] PROBLEM - PHP opcache health on mw1327 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:11:49] PROBLEM - PHP opcache health on mw1386 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:11:51] PROBLEM - PHP opcache health on mw1433 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:11:53] <_joe_> ok so [10:12:00] RECOVERY - LVS appservers-https eqiad port 443/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet -https- IPv4 #page on appservers.svc.eqiad.wmnet is OK: OK - Certificate appservers-rw.discovery.wmnet will expire on Mon 06 Jul 2026 02:13:19 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:12:20] <_joe_> I am going to run a rolling restart of appservers [10:12:35] <_joe_> volans: is replication actually running or not? [10:12:58] <_joe_> because that tells me how should I do the rolling restart [10:13:30] if I look at https://grafana.wikimedia.org/d/GuHySj3mz/mediawiki-php-service?orgId=1&from=now-2d&to=now [10:13:43] _joe_: the /lag endpoint returns -1, the cdfw cluster is heathy (from etcdctl), I'm checking the replication process now [10:13:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22939 and previous config saved to /var/cache/conftool/dbconfig/20220322-101346-marostegui.json [10:13:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:13:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [10:13:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:51] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [10:13:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22940 and previous config saved to /var/cache/conftool/dbconfig/20220322-101354-marostegui.json [10:13:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:57] the used memory / number of keys seem to get flushed from time to time since yesterday [10:13:59] PROBLEM - LVS datahubsearch eqiad port 9200/tcp - Search cluster serving DataHub IPv4 on datahubsearch.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 495 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:07] so I am guessing we are now overflowing the opcache [10:14:28] https://grafana.wikimedia.org/d/GuHySj3mz/mediawiki-php-service?orgId=1&from=now-2d&to=now&viewPanel=34 [10:14:31] etcdmirror is logging things [10:14:31] Mar 22 10:13:54 conf2005 etcdmirror-conftool-eqiad-wmnet[1440]: [etcd-mirror] INFO: Replicating key /conftool/v1/mediawiki-config/eqiad/dbconfig at index 460777 [10:14:34] so seems replicating [10:14:42] <_joe_> volans: yes replication works [10:14:47] <_joe_> not sure what the page is about [10:14:57] <_joe_> ok I'll work on the actual production problem [10:15:17] _joe_: the check checks for [10:15:18] check_http_url_for_regexp_on_port!conf2005.codfw.wmnet!8000!/lag!'^(-[1-9]|[0-5][^0-9]+)' [10:15:36] nad that endpoint currently returns -1 [10:15:51] "HTTP/1.1 200 OK - pattern not found" is thrown by docker registry servers as well...maybe something in the check changed [10:15:53] but this might be an artifact of the added quotes to URL [10:15:55] checking [10:16:00] +1 [10:16:07] etcdctl cluster-health says OK (sorry, I'm starting from ~0 knowledge here) [10:16:13] RECOVERY - PHP opcache health on mw1433 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:16:37] yes I think this might be an artifact of added quotes to the URL parameter in icinga command, I'm checking [10:16:57] ack [10:16:59] PROBLEM - PHP opcache health on mw1314 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:17:01] PROBLEM - PHP opcache health on mw1333 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:17:01] PROBLEM - PHP opcache health on mw1343 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:17:03] PROBLEM - PHP opcache health on mw1371 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:17:05] PROBLEM - PHP opcache health on mw1394 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:17:23] PROBLEM - Docker registry health on registry1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 143 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker [10:17:23] PROBLEM - Docker registry health on registry2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 143 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Docker [10:17:37] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:18:04] _joe_: confirmed etcd is all good, I'm sending a patch to fix the check [10:18:04] <_joe_> ok I understood what the problem is [10:18:16] <_joe_> volans: what happened? [10:18:18] _joe_: should we clean up the old mediawiki versions? [10:18:27] double quoting, one on the check definition oe on the commands [10:18:36] <_joe_> hashar: that doesn't matter about that [10:18:43] <_joe_> volans: ok who changed that? [10:18:51] RECOVERY - PHP opcache health on mw1424 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:18:51] <_joe_> what changed that I mean [10:18:52] cause the opcache is only filed when files are being read isn't it ? [10:19:13] _joe_: we changed some quoting on Sunday when looking at the cert expiry page [10:19:13] <_joe_> hashar: correct [10:19:14] _joe_: https://github.com/wikimedia/puppet/commit/033278f474e09e1ef2d24ceced220c0673e2b840 [10:19:18] _joe_: filippo's patch to fi the unquote URL parameter earlier [10:19:43] PROBLEM - PHP opcache health on mw1313 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:19:45] PROBLEM - PHP opcache health on mw1326 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:19:49] PROBLEM - PHP opcache health on mw1400 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:20:32] To clarify: do we think all these pages are in fact the quoting issue, or is there also something unhappy? [10:20:37] RECOVERY - PHP opcache health on mw1354 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:20:37] <_joe_> hashar: can you paste me somewhere the output of your scap command? [10:20:43] (sorry to still be asking the stupid questions) [10:20:47] jnuche is running the train [10:20:52] <_joe_> jnuche then [10:21:02] <_joe_> because I'm not sure why the restart didn't happen. [10:21:07] volans: are you fixing that on the caller-side? [10:21:28] _joe_ one sec [10:21:39] (03PS1) 10Volans: icinga: remove quotes from ereg parameter [puppet] - 10https://gerrit.wikimedia.org/r/772802 [10:21:52] _joe_: I don't think it did yesterday either as there was still some with issues from yesterday that elukey had to restart this morning [10:21:59] patch here ^^^ Emperor [10:22:09] ah [10:22:16] jayme: no on the command because there are multple callers and some with many escapes [10:22:19] <_joe_> !log running check-and-restart on mw-eqiad-appservers [10:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:22] Emperor: some are due to that, but opcache is different [10:22:24] so the quick fix is to reveer the added quote [10:22:29] the TODO is to do it properly later [10:22:34] yes [10:22:36] RECOVERY - PHP opcache health on mw1327 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:22:43] wanted to point that out :) [10:22:48] https://usercontent.irccloud-cdn.com/file/RlXEwrXK/trainsperiment-tues.log [10:22:58] _joe_: ^^ [10:23:02] RECOVERY - PHP opcache health on mw1351 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:23:04] RECOVERY - PHP opcache health on mw1405 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:23:04] RECOVERY - PHP opcache health on mw1320 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:23:15] <_joe_> jnuche: please use phabricator's pastes [10:23:24] <_joe_> so we can refer to them in tasks [10:23:38] RECOVERY - PHP opcache health on mw1365 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:23:46] (03PS2) 10Volans: icinga: remove quotes from ereg parameter [puppet] - 10https://gerrit.wikimedia.org/r/772802 (https://phabricator.wikimedia.org/T304323) [10:23:48] (03CR) 10JMeybohm: [C: 03+1] icinga: remove quotes from ereg parameter [puppet] - 10https://gerrit.wikimedia.org/r/772802 (https://phabricator.wikimedia.org/T304323) (owner: 10Volans) [10:23:49] <_joe_> jnuche: wait so the sync-apaches is still not finished? [10:23:50] (03CR) 10MVernon: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/772802 (https://phabricator.wikimedia.org/T304323) (owner: 10Volans) [10:24:02] I am in transit, my apologies for the disruption :( [10:24:03] (03CR) 10Volans: [V: 03+2 C: 03+2] icinga: remove quotes from ereg parameter [puppet] - 10https://gerrit.wikimedia.org/r/772802 (https://phabricator.wikimedia.org/T304323) (owner: 10Volans) [10:24:10] (got 5 mins left on the ack by the way, just FYI) [10:24:12] RECOVERY - PHP opcache health on mw1326 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:24:12] RECOVERY - PHP opcache health on mw1353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:24:30] <_joe_> TheresNoTime: which ack? [10:24:39] * volans running puppet on alert1001 [10:24:40] _joe_: no, it finished, it seems the file didn't flush [10:24:54] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:24:58] `alertname="PHP opcache health"`, someone asked for the alert to be ack'd? [10:25:02] <_joe_> jnuche: please post a complete log to phabricator [10:25:08] <_joe_> TheresNoTime: not that one :) [10:25:12] _joe_: on it [10:25:15] oh, sorry _joe_ [10:25:31] (removed) [10:26:07] <_joe_> TheresNoTime: sorry, where did you "ack" it? [10:26:22] RECOVERY - PHP opcache health on mw1385 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:26:24] RECOVERY - PHP opcache health on mw1434 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:26:32] s/ack/silence at alerts.wikimedia.org [10:26:34] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: The command defined for service Gerrit JSON does not exist https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [10:26:38] PROBLEM - PHP opcache health on mw1447 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:26:44] RECOVERY - PHP opcache health on mw1345 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:26:54] <_joe_> !log running check-restart-php on api appservers [10:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:56] volans: I've resolved the etcd incident in VO [10:27:00] the `script` command refuses to flush the log :] [10:27:01] (CirrusSearchHighOldGCFrequency) firing: (6) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:27:22] Emperor: ack, it should recover shorthly and would have done that automatically [10:27:27] but that's ok too :) [10:27:27] volans: is the gerrit alert another monitoring issue [10:27:54] (03PS3) 10Jgiannelos: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249) [10:27:58] RECOVERY - PHP opcache health on mw1386 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:28:00] the docker registry should be the same [10:28:10] RECOVERY - PHP opcache health on mw1447 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:28:11] yes. So is datahubsearch [10:28:12] RECOVERY - PHP opcache health on mw1343 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:28:18] RECOVERY - PHP opcache health on mw1380 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:28:21] <_joe_> hashar, jnche to be clear, scap *should have* restarted php-fpm [10:28:22] RECOVERY - Docker registry health on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Docker [10:28:31] Gerrit could be something different, though [10:28:39] RhinosF1: the Gerrit JSON one? [10:28:50] volans: yes [10:28:54] _joe_: yeah that is my expectation. jnuche script output doesn't have the full output most probably cause script has output buffering [10:28:58] RECOVERY - PHP opcache health on mw1394 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:29:10] maybe the scap logs in kibana have some details. I am digging there [10:29:14] yes it's similar, has quotes in the URL parameter on the caller side [10:29:17] fixing thx [10:29:21] looking at gerrit dashboards, so far nothing obvious [10:30:40] RECOVERY - PHP opcache health on mw1415 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:30:40] RECOVERY - PHP opcache health on mw1411 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:30:46] _joe_: https://phabricator.wikimedia.org/P22941 [10:30:48] RECOVERY - LVS datahubsearch eqiad port 9200/tcp - Search cluster serving DataHub IPv4 on datahubsearch.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 495 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:30:48] RECOVERY - Docker registry health on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker [10:30:48] RECOVERY - Docker registry health on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker [10:30:49] RECOVERY - Etcd replication lag #page on conf2005 is OK: HTTP OK: HTTP/1.1 200 OK - 149 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Etcd [10:30:49] RECOVERY - Docker registry health on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Docker [10:31:03] (03PS1) 10Volans: icinga: avoid double quoting for the URL [puppet] - 10https://gerrit.wikimedia.org/r/772806 (https://phabricator.wikimedia.org/T304323) [10:31:08] RhinosF1: ^^^ [10:31:23] Mar 22, 2022 @ 09:54:25 sync-world Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s) [10:31:23] Mar 22, 2022 @ 10:03:28 sync-wikiversions Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s) [10:31:36] (03CR) 10RhinosF1: [C: 03+1] icinga: avoid double quoting for the URL [puppet] - 10https://gerrit.wikimedia.org/r/772806 (https://phabricator.wikimedia.org/T304323) (owner: 10Volans) [10:31:56] <_joe_> 10:03:28 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s) [10:32:00] <_joe_> ok this is the issue [10:32:05] volans: looks ok [10:32:05] <_joe_> it just ran on 86 hosts [10:32:08] <_joe_> no idea why [10:32:11] yeah no idea why [10:32:12] ahah [10:32:16] (03CR) 10Volans: [C: 03+2] icinga: avoid double quoting for the URL [puppet] - 10https://gerrit.wikimedia.org/r/772806 (https://phabricator.wikimedia.org/T304323) (owner: 10Volans) [10:32:43] the check for gerrit still looks to have too many quotes [10:32:46] RECOVERY - PHP opcache health on mw1400 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:32:46] RECOVERY - PHP opcache health on mw1418 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:32:57] Emperor: see the patch just merged [10:32:59] (03PS1) 10David Caro: Removed mdipietro and added vrook [puppet] - 10https://gerrit.wikimedia.org/r/772807 [10:33:01] _joe_: have you manually restarted php on all app servers? [10:33:12] <_joe_> hashar: yes [10:33:27] <_joe_> hashar: with this scap bug unsolved, we can't proceed further. [10:33:34] volans: as ever you are ahead of me :) [10:33:41] (03CR) 10RhinosF1: [C: 04-1] "took already has a data.yaml entry" [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro) [10:34:06] dcaro: that data.yaml is completely wrong [10:34:28] <_joe_> hashar: will you open a task or should I? [10:34:49] a) you're adding rook to absented there b) we never replace people c) rook is already in data.yaml [10:36:07] _joe_: please do :) [10:36:20] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:23] <_joe_> hashar: frankly, I'd prefer if you did own the issue. [10:36:30] <_joe_> but ok [10:36:37] looks like that issue has been there for a while. March 7th had 86 hosts, March 3rd 91 hosts [10:36:42] <_joe_> how do I make it a blocker for all trains this week? [10:36:50] RECOVERY - PHP opcache health on mw1314 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:37:05] _joe_: I will file it :) [10:37:17] I don't want you to be burden by too many tasks! :D [10:37:26] volans: since you're working on icinga, I see it's complaining about config errors [10:37:26] _joe_: it's all one task for this week [10:37:26] (03CR) 10Marostegui: [C: 03+2] Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772482 (owner: 10Marostegui) [10:37:59] Emperor: that's for dcaro [10:38:00] Error: Could not find any contact matching 'mdipietro' (config file '/etc/icinga/objects/contactgroups.cfg', starting on line 67) [10:38:27] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10JMeybohm) [10:38:28] RECOVERY - PHP opcache health on mw1426 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:38:46] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:39:04] RECOVERY - PHP opcache health on mw1404 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:39:26] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [10:39:48] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10JMeybohm) Certs have been renewed (with cergen managed ones). Thanks @Joe for pairing! [10:40:15] (03CR) 10David Caro: Removed mdipietro and added vrook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro) [10:40:59] (03CR) 10David Caro: Removed mdipietro and added vrook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro) [10:41:19] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1077.eqiad.wmnet with OS buster [10:41:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:23] volans: yep, just changed that [10:41:26] RECOVERY - PHP opcache health on mw1361 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:41:27] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1077.eqiad.wmnet with OS buster com... [10:41:32] dcaro: if you're removing an entry from [10:41:35] Data.yaml [10:41:44] Or replacing someone's shell name, it's probably wrong [10:41:50] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:10] RECOVERY - PHP opcache health on mw1313 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:42:10] RECOVERY - PHP opcache health on mw1333 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:42:37] <_joe_> jayme: can you take a look at deploy_to_mwdebug ? [10:42:59] RhinosF1: can you elaborate on what's the right thing? [10:43:21] dcaro: if you're adding a new person to data.yaml, you add a new entry [10:43:32] (03CR) 10MSantos: [C: 03+1] "This is ready to go. LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249) (owner: 10Jgiannelos) [10:43:36] RhinosF1: what if I'm replacing a person? [10:43:38] And move your now left worker to absent, drop their ssh key and add them to absented [10:43:41] (renaming) [10:43:41] dcaro: you don't [10:43:50] dcaro: and vrook should not be in the absented group IMHO [10:43:57] we don't replace shell accounts in data.yaml [10:44:05] because a new staff member joined [10:44:10] And old left [10:44:14] that's different though [10:44:16] RECOVERY - PHP opcache health on mw1371 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:44:46] <_joe_> RhinosF1: there's context you're clearly missing. [10:44:58] dcaro: https://github.com/wikimedia/puppet/commit/115af1f6971775168cb49fc21c5809c280badbcb was done ages ago for the same user though [10:45:04] They already have a shell account [10:45:14] RECOVERY - PHP opcache health on mw1454 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:45:23] <_joe_> RhinosF1: please let's stop discussing this here. [10:45:57] +1 there's quite a lot of noise right now [10:46:12] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:46:19] !log pool cp1077 with HAProxy as TLS termination layer - T290005 [10:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:24] T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 [10:46:42] RECOVERY - PHP opcache health on mw1420 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:48:18] RECOVERY - PHP opcache health on mw1406 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:48:46] _joe_: yes [10:49:00] RECOVERY - PHP opcache health on mw1322 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:49:50] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:51:08] RECOVERY - PHP opcache health on mw1450 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [10:51:10] _joe_: seems broken since friday [10:51:51] filed as https://phabricator.wikimedia.org/T304414 [10:52:05] <_joe_> jayme: uhh since after I fixed it? [10:52:22] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [10:52:26] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:52:33] <_joe_> jayme: anyways, can you take a look and fix it? I have to work on other stuff [10:52:33] _joe_: not sure when exactly that was. error file is from 2022-03-18T14:43:22.703871 [10:52:39] sure, sure [10:54:02] (03PS2) 10David Caro: Removed mdipietro and added vrook [puppet] - 10https://gerrit.wikimedia.org/r/772807 [10:54:10] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:55:38] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [10:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:28] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:38] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:56:43] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Volans) Thanks! I think we can now destroy the ones in the Puppet CA mentioned in T304237#7790839 at this point. [10:57:02] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:57:05] ^ looking [10:58:00] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:59:26] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:59:50] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:00:08] 10SRE, 10SRE Observability, 10observability, 10Patch-For-Review, and 2 others: Unquoted URL parameter - https://phabricator.wikimedia.org/T304323 (10Volans) Unfortunately this had some follow up alert (some expected) due to double quoting, done both in the caller and the command definition. I think we shou... [11:00:50] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro) [11:01:16] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:38] (03CR) 10David Caro: [C: 03+2] Removed mdipietro and added vrook [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro) [11:01:56] (03CR) 10David Caro: [C: 03+2] Removed mdipietro and added vrook (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro) [11:02:44] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:03:31] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [11:03:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [11:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:32] (03CR) 10Jgiannelos: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249) (owner: 10Jgiannelos) [11:07:38] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Sat 28 May 2022 08:33:22 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [11:08:00] RhinosF1: here the recovery you asked for ^ [11:08:17] volans: :), thanks for looking into it [11:09:27] (03PS3) 10Majavah: metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299) [11:09:50] _joe_: fyi the failing release did not get ready because "Readiness probe failed: HTTP probe failed with statuscode: 503" - should be good now [11:10:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [11:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:06] <_joe_> jayme: uh [11:10:12] <_joe_> that's pretty bad though :P [11:10:22] yeah...it rolled back ofc [11:10:27] <_joe_> also, imagine we have to do a release during an outage [11:10:35] <_joe_> sigh. [11:11:44] in that case, we should potentially force it. But it's part of the deployment strategy of k8s to not continue when new pods don't come up healthy [11:13:06] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [11:15:25] (03PS3) 10Ladsgroup: idp: Open up orchestrator to cumin host, take IV [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond) [11:16:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22942 and previous config saved to /var/cache/conftool/dbconfig/20220322-111607-marostegui.json [11:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:16] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [11:22:44] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:25:11] (03PS1) 10AikoChou: ml-services: update draft/article quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/772811 (https://phabricator.wikimedia.org/T300270) [11:27:14] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:28:44] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:29:22] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:29:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1100 for reboot', diff saved to https://phabricator.wikimedia.org/P22943 and previous config saved to /var/cache/conftool/dbconfig/20220322-112931-marostegui.json [11:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1123 for reboot', diff saved to https://phabricator.wikimedia.org/P22944 and previous config saved to /var/cache/conftool/dbconfig/20220322-113003-marostegui.json [11:30:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:01] !log Reboot db1100 and db1123 for kernel upgrade before master swap [11:31:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22945 and previous config saved to /var/cache/conftool/dbconfig/20220322-113113-marostegui.json [11:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:44] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:35:54] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:36:16] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:36:42] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:40:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P22946 and previous config saved to /var/cache/conftool/dbconfig/20220322-114051-root.json [11:40:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P22948 and previous config saved to /var/cache/conftool/dbconfig/20220322-114102-root.json [11:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:20] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:46:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22949 and previous config saved to /var/cache/conftool/dbconfig/20220322-114618-marostegui.json [11:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:48] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:48:49] _joe_: I think the issue is the `appserver` dsh group which is empty. It is generated from a hiera value having `service: apache2` but apparently that is now using `nginx` [11:49:08] scap to all servers work though cause it uses another group: `mediawiki-installation` [11:49:22] my debug digging is in https://phabricator.wikimedia.org/T304414#7796144 and following comment [11:49:35] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Also @valerio.bozzolan you should feel free to email the IPs to noc@wikimedia.org if you wish to avoid putting them here wh... [11:49:35] essentially /etc/dsh/group/appserver is empty [11:49:39] <_joe_> hashar: yeah that's probably it, I was sure I did change it when we removed the cluster [11:49:45] so we do not restart php opcache there [11:49:48] <_joe_> hashar: yeah I'll fix that [11:50:01] <_joe_> hashar: although some of the servers having issues were apis [11:50:07] <_joe_> so I guess there's more going on [11:50:19] with https://gerrit.wikimedia.org/r/c/operations/puppet/+/767203 you have updated mediawiki-installation but haven't updated the appserver group [11:50:31] <_joe_> yeah I was looking at that exactly [11:50:39] <_joe_> it's an easy fix thankfully [11:51:07] and I have no idea why we run the opcache restart against hosts of `appserver,api_appserver,jobrunner,testserver,parsoid_php` [11:51:08] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan) [11:51:15] instead of all the ones from `mediawiki-installation` [11:51:20] maybe cause of dumps host [11:51:22] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan) I've added all the details in a nice private Paste visible to you (P22947) and added it in the Task description. T... [11:51:26] anyway issue found ;] [11:51:28] <_joe_> hashar: so that we run in parallel on multiple clusters [11:51:41] <_joe_> instead of running sequentially through all mw servers [11:51:51] <_joe_> we can run on 10% of each cluster safely [11:51:59] ah maybe [11:52:09] <_joe_> instead of being forced to run on 10% of the smallest cluster to be safe [11:52:22] anyway problem solved! I am going to have lunch and we will resume the train :] [11:53:13] jnuche: I have found the issue. The list of servers to restart php opcache on is incomplete ^ [11:53:17] <_joe_> hashar: yeah gimme the time to fix the issue :) [11:53:30] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:53:31] s/apache2/nginx/ ! [11:53:40] I am getting lucnh & [11:54:33] 10SRE, 10Traffic: Remove image check on Varnish Dockerized Test Environment - https://phabricator.wikimedia.org/T303794 (10MMandere) 05Open→03Resolved [11:55:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P22950 and previous config saved to /var/cache/conftool/dbconfig/20220322-115557-root.json [11:56:00] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan) [11:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P22951 and previous config saved to /var/cache/conftool/dbconfig/20220322-115606-root.json [11:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22952 and previous config saved to /var/cache/conftool/dbconfig/20220322-120123-marostegui.json [12:01:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:01:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [12:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:29] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [12:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:36] 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10Ottomata) This would be easier if {T276972} was done, but it doesn't look like there's enthusiasm for it. I'd love to be able to automate ingestion fro... [12:04:28] (03PS1) 10Cathal Mooney: Re-enable direct path to Seabone / Telecom Italia in Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/772815 [12:04:30] (03PS1) 10Ladsgroup: Enable WRITE BOTH for templatelinks normalization in wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772816 (https://phabricator.wikimedia.org/T299421) [12:04:44] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:05:53] (03CR) 10Cathal Mooney: [C: 03+2] Re-enable direct path to Seabone / Telecom Italia in Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/772815 (owner: 10Cathal Mooney) [12:06:19] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Re-enable direct path to Seabone / Telecom Italia in Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/772815 (owner: 10Cathal Mooney) [12:08:23] jouncebot: nowandnext [12:08:23] No deployments scheduled for the next 0 hour(s) and 51 minute(s) [12:08:23] In 0 hour(s) and 51 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T1300) [12:08:42] (03CR) 10Ladsgroup: [C: 03+2] Enable WRITE BOTH for templatelinks normalization in wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772816 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [12:09:05] (03CR) 10EllenR: "This looks good; however I like having the tags (T123456) for the various changes. It is very helpful to understand why particular pieces " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [12:09:24] (03Merged) 10jenkins-bot: Enable WRITE BOTH for templatelinks normalization in wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772816 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [12:09:47] (03CR) 10EllenR: [C: 03+1] "Sorry, forgot to get the code review number in -" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [12:11:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P22953 and previous config saved to /var/cache/conftool/dbconfig/20220322-121101-root.json [12:11:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P22954 and previous config saved to /var/cache/conftool/dbconfig/20220322-121110-root.json [12:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:10] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:12:12] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:772816|Enable WRITE BOTH for templatelinks normalization in wikitech (T299421)]] (duration: 01m 41s) [12:12:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:16] T299421: Turn on write both in production for templatelinks normalization - https://phabricator.wikimedia.org/T299421 [12:13:26] (03PS1) 10Ladsgroup: Enable WRITE BOTH on rest of s6 for templatelinks normalization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772817 (https://phabricator.wikimedia.org/T299421) [12:14:52] 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10DMburugu) I approve the request. [12:15:02] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:15:26] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:15:41] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:15:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:49] (03PS1) 10Cathal Mooney: Revert "Re-enable direct path to Seabone / Telecom Italia in Eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/772668 [12:16:04] !log dbmaint s8@eqiad T300992 [12:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:07] (03PS2) 10Ladsgroup: Enable WRITE BOTH on rest of s6 for templatelinks normalization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772817 (https://phabricator.wikimedia.org/T299421) [12:16:08] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:16:08] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:17:31] 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10jbond) +1 i think the -C change was mostly introduced by me and happy for it to be reverted, other options... [12:17:34] !log dbmaint s5@eqiad T300992 [12:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:14] (03CR) 10Cathal Mooney: [C: 03+2] Revert "Re-enable direct path to Seabone / Telecom Italia in Eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/772668 (owner: 10Cathal Mooney) [12:18:38] (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Revert "Re-enable direct path to Seabone / Telecom Italia in Eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/772668 (owner: 10Cathal Mooney) [12:18:53] !log dbmaint s6@eqiad T300992 [12:18:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:19:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:14] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Thanks for the info @valerio.bozzolan It seems the return traffic to that address was routing out of our network to Telia... [12:20:40] (03CR) 10Ladsgroup: [C: 03+2] Enable WRITE BOTH on rest of s6 for templatelinks normalization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772817 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [12:21:15] !log dbmaint s7@eqiad T300992 [12:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:19] T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992 [12:21:23] (03Merged) 10jenkins-bot: Enable WRITE BOTH on rest of s6 for templatelinks normalization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772817 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup) [12:23:12] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:36] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:772817|Enable WRITE BOTH on rest of s6 for templatelinks normalization (T299421)]] (duration: 00m 54s) [12:24:38] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:41] T299421: Turn on write both in production for templatelinks normalization - https://phabricator.wikimedia.org/T299421 [12:24:49] !log dbmaint s3@eqiad T300600 [12:24:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:55] T300600: Upgrade s3 to Bullseye - https://phabricator.wikimedia.org/T300600 [12:26:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P22955 and previous config saved to /var/cache/conftool/dbconfig/20220322-122605-root.json [12:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P22956 and previous config saved to /var/cache/conftool/dbconfig/20220322-122613-root.json [12:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:02] (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: discard even more hostkey checking stuff [puppet] - 10https://gerrit.wikimedia.org/r/772818 (https://phabricator.wikimedia.org/T304420) [12:28:19] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [12:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:30:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 after testing', diff saved to https://phabricator.wikimedia.org/P22957 and previous config saved to /var/cache/conftool/dbconfig/20220322-123056-marostegui.json [12:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [12:32:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [12:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:50] (03CR) 10Aklapper: "Please abandon if this is not wanted/needed anymore" [deployment-charts] - 10https://gerrit.wikimedia.org/r/748734 (owner: 10Varac) [12:33:14] (03CR) 10JMeybohm: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm) [12:33:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [12:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:08] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:36:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [12:36:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [12:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P22958 and previous config saved to /var/cache/conftool/dbconfig/20220322-124109-root.json [12:41:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P22959 and previous config saved to /var/cache/conftool/dbconfig/20220322-124117-root.json [12:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [12:41:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [12:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:00] (03PS2) 10Arturo Borrero Gonzalez: openstack: networktests: discard even more hostkey checking stuff [puppet] - 10https://gerrit.wikimedia.org/r/772818 (https://phabricator.wikimedia.org/T304420) [12:44:18] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:44:24] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:45:18] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Ok I've emailed Seabone/TI NOC now, hopefully they come back with something meaningful. There isn't a whole lot more we ca... [12:51:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [12:51:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance [12:51:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance [12:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:14] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance [12:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:52] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:52:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [12:52:06] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance [12:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:52:20] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:54:28] !log installing 5.10.103 kernels on servers running a kernel from buster backports T303179 [12:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:44] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:54:46] 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10jbond) The change has been made on the private repo ` git show b9303238 [12:52:... [12:55:50] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:56:05] (03PS1) 10Jbond: vtrs: move password to profile name space [labs/private] - 10https://gerrit.wikimedia.org/r/772821 (https://phabricator.wikimedia.org/T303272) [12:56:47] (03CR) 10Jbond: [V: 03+2 C: 03+2] vtrs: move password to profile name space [labs/private] - 10https://gerrit.wikimedia.org/r/772821 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond) [12:56:56] o/ [12:56:58] backkk [12:57:20] <_joe_> hashar: sorry I got diverted by other stuff, will do the patch now [12:57:37] :D [12:58:00] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:58:20] the expected outcome is `/etc/dsh/group/appserver` should have hosts defined [12:58:33] (03PS5) 10Jbond: mariadb: Reference the actual VRTS passwords in the m2 grants file. [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) (owner: 10Kormat) [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T1300). [13:00:05] nemo-yiannis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:27] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34472/console" [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) (owner: 10Kormat) [13:01:24] o/ [13:01:28] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:01:32] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:01:33] (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) (owner: 10Kormat) [13:02:47] (03PS1) 10Giuseppe Lavagetto: scap: fix dsh targets for php restarts [puppet] - 10https://gerrit.wikimedia.org/r/772822 (https://phabricator.wikimedia.org/T304414) [13:03:16] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:03:31] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: fix dsh targets for php restarts [puppet] - 10https://gerrit.wikimedia.org/r/772822 (https://phabricator.wikimedia.org/T304414) (owner: 10Giuseppe Lavagetto) [13:04:02] (03PS6) 10Ladsgroup: mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890 [13:06:32] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:07:52] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) [13:07:56] 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) @thcipriani are you able to approve @TThoabala membership of the deployment group @Tchanders Sounds good to me, ill get all the approvals in [lace and create the change... [13:08:23] (03PS7) 10Ladsgroup: mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890 [13:08:32] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890 (owner: 10Ladsgroup) [13:08:42] <_joe_> hashar: fixed [13:08:56] <_joe_> thanks for the analysis and sorry for the issue arising in the first place :/ [13:08:58] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:10:06] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:10:08] _joe_: it happens :D [13:10:32] there are so many layers of config it is hard to figure it out entirely [13:10:42] conftool / hiera / dsh files / scap itself etc [13:11:21] (03PS1) 10Jbond: admin: add tsepothoabala to deployment [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) [13:11:26] if we ran the php opcache restart via scap, we surely would have noticed it [13:11:52] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:12:56] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:13:00] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:14:50] (03CR) 10Jbond: [C: 04-1] "ill -1 this until TsepoThoabala returns" [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond) [13:15:10] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:16:26] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) 05Open→03Stalled Change to stalled until TsepoThoabala return [13:19:25] !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudgw2002-dev.codfw.wmnet [13:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:28] !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudgw2002-dev.codfw.wmnet [13:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:01] !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudgw2002-dev.codfw.wmnet [13:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:35] (03PS30) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) [13:21:09] (03PS16) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [13:21:14] we are promoting 1.39.0-wmf.3 to group 1 [13:21:44] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/34473/" [puppet] - 10https://gerrit.wikimedia.org/r/772818 (https://phabricator.wikimedia.org/T304420) (owner: 10Arturo Borrero Gonzalez) [13:22:32] (03PS1) 10Jcrespo: Initial release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 [13:23:00] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:23:16] (03PS1) 10Jaime Nuche: group1 wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772826 [13:23:18] (03CR) 10Jaime Nuche: [C: 03+2] group1 wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772826 (owner: 10Jaime Nuche) [13:23:58] (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (0312 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis) [13:24:08] (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772826 (owner: 10Jaime Nuche) [13:24:14] (03PS1) 10Majavah: update cloud-vps bastion ip to bastion-eqiad1-03 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/772827 [13:24:26] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:25:37] (03PS2) 10Jcrespo: Initial release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445) [13:25:54] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2002-dev.codfw.wmnet [13:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:02] !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudgw2001-dev.codfw.wmnet [13:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:07] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.3 refs T300203 [13:26:10] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:12] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [13:26:38] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:27:00] !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.3 refs T300203 (duration: 00m 52s) [13:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:52] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudmetrics1003.eqiad.wmnet [13:27:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:04] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:29:09] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:29:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:06] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:30:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:30:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:31:09] (03PS1) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) [13:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:56] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) @valerio.bozzolan the affected users are direct Telecom Italia customers is that correct? It certainly wouldn't hurt if th... [13:32:38] (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata) [13:33:00] (03CR) 10MVernon: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [13:33:58] !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2001-dev.codfw.wmnet [13:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:52] (03PS2) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) [13:35:06] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:35:40] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1003.eqiad.wmnet [13:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:50] promoting 1.39.0-wmf.3 to group 2 now [13:36:09] (03PS1) 10Jaime Nuche: all wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772830 [13:36:11] (03CR) 10Jaime Nuche: [C: 03+2] all wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772830 (owner: 10Jaime Nuche) [13:36:26] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudmetrics1004.eqiad.wmnet [13:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:49] (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.3 refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772830 (owner: 10Jaime Nuche) [13:37:01] (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata) [13:37:36] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:38:54] (03PS3) 10Jcrespo: Initial release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445) [13:39:55] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.3 refs T300203 [13:40:05] _joe_: 13:39:27 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 347 host(s) [13:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:09] \o/ [13:40:10] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [13:40:15] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [13:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [13:41:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance [13:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298557)', diff saved to https://phabricator.wikimedia.org/P22960 and previous config saved to /var/cache/conftool/dbconfig/20220322-134148-marostegui.json [13:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:52] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [13:42:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:42:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:15] (03PS4) 10Jgiannelos: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249) [13:43:18] PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:26] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:46] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:44:07] (03PS3) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) [13:44:34] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1004.eqiad.wmnet [13:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:43] (03PS1) 10Jcrespo: mediabackups: Install package and its dependencies through .deb [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020) [13:45:14] (03CR) 10jerkins-bot: [V: 04-1] mediabackups: Install package and its dependencies through .deb [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020) (owner: 10Jcrespo) [13:46:11] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [13:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:34] (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata) [13:47:57] (03PS2) 10Jcrespo: mediabackups: Install package and its dependencies through .deb [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020) [13:49:00] RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:49:26] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:49:49] (03PS3) 10Ssingh: P:icinga: add profile for performance tweaking [puppet] - 10https://gerrit.wikimedia.org/r/771610 (https://phabricator.wikimedia.org/T303593) [13:50:04] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:51:19] (03CR) 10Ssingh: P:icinga: add profile for performance tweaking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771610 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh) [13:51:32] (03CR) 10Ssingh: "rebased and added bug #, no code change." [puppet] - 10https://gerrit.wikimedia.org/r/771610 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh) [13:52:07] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet [13:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:56] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:56:09] (03CR) 10Andrew Bogott: [C: 03+2] update cloud-vps bastion ip to bastion-eqiad1-03 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/772827 (owner: 10Majavah) [13:57:01] (CirrusSearchHighOldGCFrequency) firing: (6) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:57:04] 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Elitre) >>! In T202061#7774033, @CDanis wrote: > @lmata yeah, sorry, that's been on... [13:58:31] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet [13:58:38] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:58:40] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:59:06] (03CR) 10Elukey: [C: 03+1] hiera: add dummy tokens for ML staging k8s setup [labs/private] - 10https://gerrit.wikimedia.org/r/772430 (owner: 10Klausman) [14:01:08] (03CR) 10Jcrespo: [C: 04-1] "Not until a definitive (even if first iteration) package is ready and uploaded." [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020) (owner: 10Jcrespo) [14:01:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] update cloud-vps bastion ip to bastion-eqiad1-03 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/772827 (owner: 10Majavah) [14:03:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298557)', diff saved to https://phabricator.wikimedia.org/P22961 and previous config saved to /var/cache/conftool/dbconfig/20220322-140331-marostegui.json [14:07:10] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:07:12] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:09:48] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [14:10:00] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:10:31] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [14:11:03] 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [14:11:08] (03CR) 10Andrew Bogott: [C: 03+1] openstack: networktests: discard even more hostkey checking stuff [puppet] - 10https://gerrit.wikimedia.org/r/772818 (https://phabricator.wikimedia.org/T304420) (owner: 10Arturo Borrero Gonzalez) [14:12:22] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:13:52] (03PS4) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) [14:15:45] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: networktests: discard even more hostkey checking stuff [puppet] - 10https://gerrit.wikimedia.org/r/772818 (https://phabricator.wikimedia.org/T304420) (owner: 10Arturo Borrero Gonzalez) [14:16:14] (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata) [14:18:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P22962 and previous config saved to /var/cache/conftool/dbconfig/20220322-141836-marostegui.json [14:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:27] (03PS1) 10Ladsgroup: mariadb: Change wikiuser to wikiuser2022 [puppet] - 10https://gerrit.wikimedia.org/r/772834 [14:19:31] (03PS5) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) [14:19:54] (03PS4) 10Klausman: hiera: add dummy tokens for ML staging k8s setup [labs/private] - 10https://gerrit.wikimedia.org/r/772430 [14:19:59] (03CR) 10Klausman: [C: 03+2] hiera: add dummy tokens for ML staging k8s setup [labs/private] - 10https://gerrit.wikimedia.org/r/772430 (owner: 10Klausman) [14:21:50] (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata) [14:21:56] (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: add dummy tokens for ML staging k8s setup [labs/private] - 10https://gerrit.wikimedia.org/r/772430 (owner: 10Klausman) [14:22:02] (03CR) 10Marostegui: [C: 03+1] mariadb: Change wikiuser to wikiuser2022 [puppet] - 10https://gerrit.wikimedia.org/r/772834 (owner: 10Ladsgroup) [14:22:16] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:22:27] (03CR) 10Ladsgroup: [C: 03+2] mariadb: Change wikiuser to wikiuser2022 [puppet] - 10https://gerrit.wikimedia.org/r/772834 (owner: 10Ladsgroup) [14:23:01] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/771610 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh) [14:25:02] (03PS6) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [14:25:20] (03PS3) 10Elukey: Initial debianization of istio-cni [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) [14:26:07] (03PS6) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) [14:27:05] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Hmm ok. I can see in the traceroute it now makes it a few hops further: ` cmooney@re0.cr2-eqiad> traceroute wait 1 no-reso... [14:27:10] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:27:15] (03CR) 10Filippo Giunchedi: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:27:58] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:28:19] (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata) [14:29:26] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:29:32] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Hmm ok. I can see in the traceroute it now makes it a few hops further: ` cmooney@re0.cr2-eqiad> traceroute wait 1 no-reso... [14:30:13] 10SRE, 10ops-eqsin, 10DC-Ops: Q2(Need By: TBD) rack/setup/install new mr1-eqsin - https://phabricator.wikimedia.org/T294872 (10ayounsi) The SRX300 is ready to be put in production. Because the way it was staged, it will need a small config change (renumber irb.900 from 10.132.128.3 to 10.132.128.1) for devi... [14:33:38] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03226 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:33:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P22963 and previous config saved to /var/cache/conftool/dbconfig/20220322-143341-marostegui.json [14:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:22] 10SRE, 10Wikimedia-Site-requests, 10Chinese-Sites: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Stang) [14:35:36] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:35:38] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:35:45] (03PS7) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) [14:36:06] 10SRE, 10Wikimedia-Site-requests, 10Chinese-Sites: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Stang) Consensus reached hundreds of years ago, removing tag [14:37:59] (03PS1) 10Hashar: Merge tag 'v3.3.10' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772838 (https://phabricator.wikimedia.org/T304226) [14:38:13] (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.3.10' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772838 (https://phabricator.wikimedia.org/T304226) (owner: 10Hashar) [14:38:22] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:40:32] (03PS7) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) [14:40:53] (03PS1) 10David Caro: wmcs.backy2: add link to the runbook for backup_vms [puppet] - 10https://gerrit.wikimedia.org/r/772839 (https://phabricator.wikimedia.org/T304408) [14:41:28] (03PS17) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [14:44:47] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:45:36] (03CR) 10MVernon: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:45:56] (03CR) 10Ssingh: [C: 03+1] site: Reimage cp1079 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772793 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere) [14:46:26] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:47:05] (03PS18) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) [14:48:44] (03Merged) 10jenkins-bot: Merge tag 'v3.3.10' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772838 (https://phabricator.wikimedia.org/T304226) (owner: 10Hashar) [14:48:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298557)', diff saved to https://phabricator.wikimedia.org/P22964 and previous config saved to /var/cache/conftool/dbconfig/20220322-144847-marostegui.json [14:48:48] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:48:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [14:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:52] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [14:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298557)', diff saved to https://phabricator.wikimedia.org/P22965 and previous config saved to /var/cache/conftool/dbconfig/20220322-144855-marostegui.json [14:48:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:18] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:49:48] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:49:50] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:50:43] (03CR) 10Ssingh: [C: 03+2] P:icinga: add profile for performance tweaking [puppet] - 10https://gerrit.wikimedia.org/r/771610 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh) [14:53:20] (03CR) 10Filippo Giunchedi: [C: 03+1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata) [14:53:23] (03PS1) 10Ayounsi: Setup new mr1-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/772843 (https://phabricator.wikimedia.org/T294872) [14:53:51] (03CR) 10Ottomata: [C: 03+2] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata) [14:54:13] (03PS2) 10Ayounsi: Setup new mr1-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/772843 (https://phabricator.wikimedia.org/T294872) [14:54:54] (03PS6) 10JMeybohm: Switch service type to ClusterIP in case Ingress is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966) [14:57:26] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:57:42] (03PS1) 10Arturo Borrero Gonzalez: openstac: networktests: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/772844 [14:58:47] (03CR) 10Andrew Bogott: [C: 03+2] openstac: networktests: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/772844 (owner: 10Arturo Borrero Gonzalez) [14:59:25] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM modulo testing the CR chain in Pontoon" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:59:33] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM modulo testing the CR chain in Pontoon" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon) [14:59:47] (03PS11) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:00:39] (03CR) 10Filippo Giunchedi: "I'll let Cole vote though since I'm not super familiar with the changes, idea LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey) [15:00:44] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:01:06] (03PS1) 10Hashar: Update Gerrit to v3.3.10 [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772846 (https://phabricator.wikimedia.org/T304226) [15:01:40] (03PS12) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:02:01] (CirrusSearchHighOldGCFrequency) firing: (6) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:02:19] (03CR) 10Hashar: [C: 03+2] "git fat works!" [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772846 (https://phabricator.wikimedia.org/T304226) (owner: 10Hashar) [15:02:42] (03Merged) 10jenkins-bot: Update Gerrit to v3.3.10 [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772846 (https://phabricator.wikimedia.org/T304226) (owner: 10Hashar) [15:05:12] (03Abandoned) 10SBassett: admin: replace existing ssh key for sbassett [puppet] - 10https://gerrit.wikimedia.org/r/772410 (https://phabricator.wikimedia.org/T304319) (owner: 10SBassett) [15:06:22] !log hashar@deploy1002 Started deploy [gerrit/gerrit@967b0d7]: Gerrit to 3.3.10 on gerrit2001 T304226 [15:06:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:27] T304226: Gerrit security release 3.3.10 - https://phabricator.wikimedia.org/T304226 [15:06:35] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@967b0d7]: Gerrit to 3.3.10 on gerrit2001 T304226 (duration: 00m 12s) [15:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:39] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1003.eqiad.wmnet with OS bullseye [15:06:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:05] (03PS9) 10Btullis: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [15:08:52] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:10:41] !log Upgrading and starting Gerrit on gerrit2001 (replica) [15:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:05] jouncebot: now [15:13:05] No deployments scheduled for the next 0 hour(s) and 46 minute(s) [15:13:46] !log hashar@deploy1002 Started deploy [gerrit/gerrit@967b0d7]: Gerrit to 3.3.10 on gerrit1001 T304226 [15:13:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:50] T304226: Gerrit security release 3.3.10 - https://phabricator.wikimedia.org/T304226 [15:13:56] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@967b0d7]: Gerrit to 3.3.10 on gerrit1001 T304226 (duration: 00m 10s) [15:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:14] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:14:31] !log Stopping Gerrit for security update T304226 [15:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:04] !log Gerrit 3.3.10 up and running T304226 [15:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:30] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:21:22] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Papaul) I asked @Cmjohnson to connect cloudvrit1024 to asw2-b4 yesterday for testing, the result was the same ` Failed to load ld... [15:21:46] (03CR) 10JMeybohm: [C: 03+1] Add dummy secrets for datahub deployment [labs/private] - 10https://gerrit.wikimedia.org/r/771563 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [15:22:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P22967 and previous config saved to /var/cache/conftool/dbconfig/20220322-152247-root.json [15:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:02] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:25:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance [15:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298557)', diff saved to https://phabricator.wikimedia.org/P22968 and previous config saved to /var/cache/conftool/dbconfig/20220322-152508-marostegui.json [15:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:15] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [15:26:08] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:26:34] (03PS1) 10Klausman: hiera: Add k8s dummy tokens for ML staging env [labs/private] - 10https://gerrit.wikimedia.org/r/772866 (https://phabricator.wikimedia.org/T302195) [15:29:42] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage [15:29:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:54] (03CR) 10Klausman: [C: 03+2] hiera: Add k8s dummy tokens for ML staging env [labs/private] - 10https://gerrit.wikimedia.org/r/772866 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:30:00] (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: Add k8s dummy tokens for ML staging env [labs/private] - 10https://gerrit.wikimedia.org/r/772866 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [15:30:04] (03CR) 10Btullis: karapace: add karapace role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [15:30:05] (03PS13) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:30:44] (03CR) 10JMeybohm: [C: 03+1] Add a kubeconfig configuration for datahub [puppet] - 10https://gerrit.wikimedia.org/r/771407 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [15:30:46] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:32:01] (CirrusSearchHighOldGCFrequency) firing: (6) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:32:08] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:32:14] (03CR) 10JMeybohm: [C: 03+1] "Feel free to ping me when you're ready to merge/deploy this (if you feel like you want somebody around)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/771409 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [15:33:13] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage [15:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:12] (03PS10) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) [15:36:21] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:36:32] (03CR) 10Btullis: [C: 03+2] Add a kubeconfig configuration for datahub [puppet] - 10https://gerrit.wikimedia.org/r/771407 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [15:38:03] (03PS11) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) [15:38:15] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) The code to do db switchover is https://github.com/wikimedia... [15:38:55] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:39:26] (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy secrets for datahub deployment [labs/private] - 10https://gerrit.wikimedia.org/r/771563 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [15:39:51] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:42:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) [15:42:19] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.7419 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:42:29] (03PS14) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) [15:43:22] (03PS1) 10Cathal Mooney: Add new Analytics subnets to static Capirca net definitions [homer/public] - 10https://gerrit.wikimedia.org/r/772868 (https://phabricator.wikimedia.org/T299758) [15:43:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298557)', diff saved to https://phabricator.wikimedia.org/P22969 and previous config saved to /var/cache/conftool/dbconfig/20220322-154349-marostegui.json [15:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:54] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [15:45:02] 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi) Thank you for the feedback! >>! In T304321#7795644, @Volans wrote: > I agree with this direct... [15:45:18] 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) That's the main thing and what {T196366} also needs. The di... [15:46:10] (03PS1) 10Filippo Giunchedi: nagios_common: remove -C from check_http [puppet] - 10https://gerrit.wikimedia.org/r/772869 (https://phabricator.wikimedia.org/T304321) [15:47:01] (CirrusSearchHighOldGCFrequency) firing: (6) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:47:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Cmjohnson) @Vgutierrez the new DIMM is here, please let me know when I can make the swap [15:48:06] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34482/console" [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [15:48:23] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:50:37] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:50:59] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [15:51:02] (03CR) 10Razzi: [V: 03+1 C: 03+2] karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi) [15:51:55] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:53:01] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3548 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:53:12] (03CR) 10Btullis: [C: 03+2] Add a namespace for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/771409 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis) [15:54:29] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:54:44] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1003.eqiad.wmnet with OS bullseye [15:54:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:05] !log btullis@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:56] (03PS1) 10Sergio Gimeno: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) [15:57:01] (CirrusSearchHighOldGCFrequency) firing: (5) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:57:56] (03CR) 10jerkins-bot: [V: 04-1] beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [15:58:31] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:58:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P22970 and previous config saved to /var/cache/conftool/dbconfig/20220322-155854-marostegui.json [15:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:16] (03PS2) 10Sergio Gimeno: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) [15:59:36] !log imported jvmquake 1.0.1 for stretch/buster (JDK8) and bullseye (JDK11) [15:59:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:04] jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T1600). [16:00:04] taavi: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:11] (03CR) 10jerkins-bot: [V: 04-1] beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [16:00:15] taavi: 👋 looking [16:00:46] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1003.eqiad.wmnet with OS bullseye [16:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:19] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:02:01] (CirrusSearchHighOldGCFrequency) firing: (5) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:02:05] o/ hey rzl [16:02:12] _joe_: if you're around, do you have a moment to look at https://gerrit.wikimedia.org/r/724049 for the puppet request window? I want to make sure your -1 is addressed [16:02:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Cmjohnson) 05Open→03Resolved Received the DIMM and replaced it, resolving this task [16:02:43] taavi: looking in the meantime but I'll be a sec to chew through these regexes myself :) [16:03:00] (03PS3) 10Sergio Gimeno: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) [16:03:21] I know my patch may not be the exact fit to this window per https://wikitech.wikimedia.org/wiki/Puppet_request_window#What_kind_of_patches_can_go_through_Puppet_request_windows? but I don't see any other way to push that patch forward, sorry :-/ [16:03:43] <_joe_> rzl: oof [16:03:49] (03CR) 10jerkins-bot: [V: 04-1] beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [16:03:53] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06452 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:04:03] <_joe_> rzl: not really time actually to re-vet that [16:04:37] <_joe_> taavi: sadly our team is down to 3 people and a bit thin on resources; if people need us to merge patches that are not immediate blockers they'll have to wait. [16:05:02] (03PS1) 10Klausman: labs: Add dummy keyfile for ML staging k8s in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/772871 (https://phabricator.wikimedia.org/T302195) [16:05:59] (03CR) 10Klausman: [V: 03+2 C: 03+2] labs: Add dummy keyfile for ML staging k8s in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/772871 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [16:07:02] _joe_: ack, thanks for checking -- taavi: sorry, I probably can't get this merged in the puppet window but I'll keep it on my radar, and give it a proper review as soon as time permits [16:07:09] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34485/console" [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [16:07:09] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudnet1003.eqiad.wmnet with OS bullseye [16:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:26] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1003.eqiad.wmnet with OS bullseye [16:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:36] :/ fair enough, thanks anyways [16:07:54] <_joe_> taavi: I'll put that patch at the end of my current queue of smaller things I can do in the leftover time though [16:08:17] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:08:20] <3 [16:08:24] (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34486/console" [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman) [16:09:14] !log btullis@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:19] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:11:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Cmjohnson) [16:11:48] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [16:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:25] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [16:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:46] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage [16:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P22971 and previous config saved to /var/cache/conftool/dbconfig/20220322-161359-marostegui.json [16:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:11] (03CR) 10Klausman: [C: 03+1] Initial debianization of istio-cni [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [16:15:06] 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10jbond) p:05Triage→03Medium [16:15:35] 10SRE, 10Thumbor, 10Service-deployment-requests: New Service Request Wikimedia-Thumbor - https://phabricator.wikimedia.org/T304436 (10WDoranWMF) [16:15:49] 10SRE, 10Thumbor, 10Service-deployment-requests: New Service Request Wikimedia-Thumbor - https://phabricator.wikimedia.org/T304436 (10WDoranWMF) [16:16:44] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [16:16:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:51] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage [16:16:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:25] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1003.eqiad.wmnet with OS bullseye [16:17:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:42] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [16:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:06] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1003.eqiad.wmnet with OS bullseye [16:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:20] PROBLEM - Check systemd state on karapace1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:56] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage [16:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:04] (03CR) 10Muehlenhoff: [C: 03+1] "Seems sane (to the extent possible :-), two nits inline." [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey) [16:19:40] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) 05Open→03Resolved [16:19:50] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1158 - https://phabricator.wikimedia.org/T303910 (10Cmjohnson) 05Open→03Resolved The SSD has been replaced and is rebuilding. [16:20:13] 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) [16:22:33] (03PS2) 10Zabe: Stop writing to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956) [16:22:45] (03PS3) 10Zabe: Stop writing to wmf* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956) [16:23:40] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage [16:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:54] !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on karapace1001.eqiad.wmnet with reason: Setting up karapace for the first time [16:27:56] !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on karapace1001.eqiad.wmnet with reason: Setting up karapace for the first time [16:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298557)', diff saved to https://phabricator.wikimedia.org/P22972 and previous config saved to /var/cache/conftool/dbconfig/20220322-162904-marostegui.json [16:29:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [16:29:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance [16:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:29:09] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [16:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298557)', diff saved to https://phabricator.wikimedia.org/P22973 and previous config saved to /var/cache/conftool/dbconfig/20220322-162917-marostegui.json [16:29:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:04] RECOVERY - Check systemd state on karapace1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:23] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) Arzhel and I discussed this a bit, and we're going add a few more countries manually for now before proceeding with the esams-resiliency... [16:30:49] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1003.eqiad.wmnet with OS bullseye [16:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:13] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) [16:33:41] (03PS1) 10BBlack: map Portugal to drmrs [dns] - 10https://gerrit.wikimedia.org/r/772876 (https://phabricator.wikimedia.org/T304089) [16:35:14] !log T303548 start wikidatawiki reindexing on eqiad codfw and cloudelastic cirrus clusters [16:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:18] T303548: CirrusSearchIndexTooOld - https://phabricator.wikimedia.org/T303548 [16:39:04] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:39:10] (03PS1) 10Btullis: Reenable the sflow job [puppet] - 10https://gerrit.wikimedia.org/r/772877 (https://phabricator.wikimedia.org/T302263) [16:40:32] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10KFrancis) @jbond I am confirming the signed NDA. Please proceed with the access request. Thanks! [16:41:20] (03CR) 10Ayounsi: [C: 03+1] map Portugal to drmrs [dns] - 10https://gerrit.wikimedia.org/r/772876 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack) [16:42:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:06:11] (03PS1) 10Jcrespo: mediabackups: Add reference key for file decryption on recovery config [puppet] - 10https://gerrit.wikimedia.org/r/772885 (https://phabricator.wikimedia.org/T300020) [17:07:24] (03PS2) 10Jcrespo: mediabackups: Add reference to key for decryption on recovery config too [puppet] - 10https://gerrit.wikimedia.org/r/772885 (https://phabricator.wikimedia.org/T300020) [17:07:41] 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10jbond) >>! In T302287#7797378, @KFrancis wrote: > @jbond I am confirming the signed NDA. Please proceed with the access request. Thanks! thanks :) @MarkAHershberger as a voluntee... [17:08:19] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [17:08:29] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Add reference to key for decryption on recovery config too [puppet] - 10https://gerrit.wikimedia.org/r/772885 (https://phabricator.wikimedia.org/T300020) (owner: 10Jcrespo) [17:09:19] jouncebot: nowandnext [17:09:20] No deployments scheduled for the next 0 hour(s) and 50 minute(s) [17:09:20] In 0 hour(s) and 50 minute(s): 🚂🧪Trainsperiment Week Deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T1800) [17:09:26] * taavi deploys a sec patch [17:10:34] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1004.eqiad.wmnet with reason: host reimage [17:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:51] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34487/console" [puppet] - 10https://gerrit.wikimedia.org/r/772877 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis) [17:14:19] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1004.eqiad.wmnet with reason: host reimage [17:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:12] !log deploy security patch for T304354 [17:15:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P22979 and previous config saved to /var/cache/conftool/dbconfig/20220322-171748-marostegui.json [17:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:20] (03PS7) 10Eigyan: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) [17:18:34] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ayounsi) So on the `Failed to load ldlinux.c32`: I got cloudvirt1024 to boot the debian installer using: ` install1003:~$ cat /etc/dhcp/automatio... [17:19:36] (03CR) 10Eigyan: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [17:20:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:20:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:35] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:21:36] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:24] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:07] !log trainsperiment (T300203): with 1.39.0-wmf.3 on all wikis, we're paused for a planned catchup window - nothing to do at the moment, we'll deploy 1.39.0-wmf.4 tomorrow (2022-03-23). [17:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:11] T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203 [17:25:17] 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10jbond) [17:25:24] 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10jbond) @Sgs the analytics users group is now deprecated. i believe you will need analytics-privatedata-users with kerberos access, @Ottomata should be able to both confirm and aprove this. P... [17:25:43] (03CR) 10Ayounsi: [C: 03+1] Add new Analytics subnets to static Capirca net definitions [homer/public] - 10https://gerrit.wikimedia.org/r/772868 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [17:25:47] (03PS4) 10Jcrespo: Initial release of mediabackups software [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445) [17:26:01] (03CR) 10Ayounsi: [C: 03+1] Reenable the sflow job [puppet] - 10https://gerrit.wikimedia.org/r/772877 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis) [17:26:41] 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10jbond) p:05Triage→03Medium [17:32:22] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10jbond) @KFrancis yes please this still needs an NDA the previous ticket relates to signing L2 [17:32:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298557)', diff saved to https://phabricator.wikimedia.org/P22980 and previous config saved to /var/cache/conftool/dbconfig/20220322-173253-marostegui.json [17:32:55] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:32:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [17:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:59] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [17:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22981 and previous config saved to /var/cache/conftool/dbconfig/20220322-173301-marostegui.json [17:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:44] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan) Maybe totally unrelated, but maybe yes: https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thr... [17:33:54] (03CR) 10Cathal Mooney: [C: 03+2] Add new Analytics subnets to static Capirca net definitions [homer/public] - 10https://gerrit.wikimedia.org/r/772868 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [17:34:23] (03Merged) 10jenkins-bot: Add new Analytics subnets to static Capirca net definitions [homer/public] - 10https://gerrit.wikimedia.org/r/772868 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney) [17:43:13] 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10KFrancis) @TheDJ Please send your personal email and mailing address to me at kfrancis@wikimedia.org and I'll put together the agreement. Thank you! [17:45:38] (03PS8) 10Jdlrobson: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [17:45:42] (03CR) 10Jdlrobson: [C: 03+1] [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [17:47:16] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1004.eqiad.wmnet with OS bullseye [17:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:51] !log dcausse@deploy1002 Started scap: (no justification provided) [17:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:53] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Andrew) In case this is an additional data point: I just reimaged cloundnet1003 and cloudnet1004 without any pxe or image issues. [17:51:55] dcausse: ^^ hey, what's going on with that full scap? [17:52:20] taavi: just wanted to /srv/deployment/wikimedia/discovery/analytics [17:52:48] umh that's not going to do it, `scap sync-file` and `scap sync-world` are mediawiki specific [17:53:09] you're likely looking for `scap deploy` [17:53:35] yes my bad totally messed that up [17:54:05] cancelled it (it just Finished l10n-update) [17:55:43] 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10RhinosF1) That wasn't sent until way after your issues started nor were fixed. [17:55:54] !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@c4d0736]: (no justification provided) [17:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:50] (03PS1) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [18:00:02] (03PS2) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [18:00:05] dancy, hashar, brennen, dduvall, jeena, and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for 🚂🧪Trainsperiment Week Deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T1800). [18:01:10] !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@c4d0736]: (no justification provided) (duration: 05m 16s) [18:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:29] (03CR) 10ArielGlenn: [C: 03+1] "Sounds great to me." [puppet] - 10https://gerrit.wikimedia.org/r/772335 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:04:01] (03PS5) 10Jcrespo: Initial release of mediabackups software [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445) [18:04:11] (03PS3) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [18:05:24] (03PS4) 10Sergio Gimeno: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) [18:05:26] (03PS6) 10Jcrespo: Initial release of mediabackups software [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445) [18:06:00] (03CR) 10Ottomata: [C: 03+1] Reenable the sflow job [puppet] - 10https://gerrit.wikimedia.org/r/772877 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis) [18:06:59] (03PS4) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [18:08:25] (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Initial release of mediabackups software [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445) (owner: 10Jcrespo) [18:09:23] 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10Ottomata) Approved. But, @sgs can you edit the description and describe a little more what access you need? See https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I... [18:13:54] (03PS3) 10Jcrespo: mediabackups: Install package and its dependencies through .deb [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020) [18:14:44] (03CR) 10Jcrespo: [C: 03+2] mediabackups: Install package and its dependencies through .deb [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020) (owner: 10Jcrespo) [18:19:03] (03PS1) 10Razzi: karapace: use karapace included python; set hostname [puppet] - 10https://gerrit.wikimedia.org/r/772912 [18:19:57] 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10Sgs) [18:20:48] 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10Sgs) >>! In T304361#7797589, @jbond wrote: > @Sgs the analytics users group is now deprecated. i believe you will need analytics-privatedata-users with kerberos access, @Ottomata should be ab... [18:21:17] (03PS4) 10Giuseppe Lavagetto: [WiP] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 [18:22:35] (03PS5) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [18:25:20] (03PS2) 10Razzi: karapace: use karapace included python; set hostname [puppet] - 10https://gerrit.wikimedia.org/r/772912 (https://phabricator.wikimedia.org/T301565) [18:25:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22982 and previous config saved to /var/cache/conftool/dbconfig/20220322-182531-marostegui.json [18:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:37] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [18:26:14] (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34492/console" [puppet] - 10https://gerrit.wikimedia.org/r/772912 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi) [18:28:05] 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo) [18:28:46] 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Create a first release of the media backups automation tools - https://phabricator.wikimedia.org/T276445 (10jcrespo) 05Open→03Resolved Done: * https://phabricator.wikimedia.org/diffusion/OSMB/history/master/;v0.1 * https://github.com/wikimed... [18:29:01] (03PS6) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [18:30:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34493/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey) [18:30:52] !log remove old karapace1001 known hosts following reimage: `razzi@puppetmaster1001:~$ ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "karapace1001.eqiad.wmnet"` [18:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:02] (03PS7) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [18:32:46] razzi: what are you trying to do? We shouldn't change that file on puppet masters [18:33:48] it gets populated after each puppet run, but in general it doesn't need to be touched [18:33:52] elukey: I reimaged a virtual machine following https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Virtual_hosts, and it was giving a warning since it had the old hostname [18:34:54] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:35:26] razzi: what warning are you talking about? (to understand what is the problem) [18:35:39] you need to clean the old host tls certificate first [18:35:58] and sign the new one after using install console [18:36:16] https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_/_Reimage_a_VM [18:36:32] and https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Manual_installation [18:37:32] elukey: here's the paste of where I got the warning, it was upon the /usr/local/bin/install_console https://phabricator.wikimedia.org/P22983 [18:40:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22984 and previous config saved to /var/cache/conftool/dbconfig/20220322-184037-marostegui.json [18:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:18] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:42:18] razzi: it is not a big problem if you get that warning, the root console is available. But you need to clean the old puppet host cert first, then install console + run puppet (that generates a new csr for the vm to the puppetmaster), sign on puppetmaster to accept the new key and finally you'll be able to run puppet on install console [18:43:18] eventually the new fingerprint will be available for the new node [18:44:30] (03PS8) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [18:47:31] (03CR) 10Herron: "Shall we give this a try?" [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119) (owner: 10Herron) [18:49:04] (03PS9) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [18:50:04] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34496/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey) [18:51:26] (03Abandoned) 10Jcrespo: WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 (owner: 10Jcrespo) [18:51:57] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10ssingh) >>! In T303593#7796742, @gerritbot wrote: > Change 771610 **merged** by Ssingh: > %%%[operations/puppet@production] P:icinga: add prof... [18:52:58] PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:53:07] (03Abandoned) 10Jcrespo: [QIP] Add second prototype to handle File metadata directly from the db [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/637769 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo) [18:53:12] (03PS10) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [18:54:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34497/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey) [18:54:26] (03Abandoned) 10Jcrespo: Add 4 line naive prototype for downloading all images from a wiki [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/636007 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo) [18:55:21] (03Abandoned) 10Jcrespo: POC: Testing interfacing with swift to gather metadata [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/638665 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo) [18:55:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22985 and previous config saved to /var/cache/conftool/dbconfig/20220322-185542-marostegui.json [18:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:10] elukey: ok yeah I did all those steps; I didn't realize the new fingerprint would update automatically. The new server is online [18:57:10] I added my understanding to https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Virtual_hosts, let me know how that looks [18:59:40] razzi: it is a bit generic, I'd suggest to review all the bits involved and to add a more precise explanation (nothing big) [19:00:52] (I mean to familiarize with how the host fingerprints are populated etc..) [19:01:08] RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:01:24] (03PS5) 10Giuseppe Lavagetto: Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471) [19:01:57] (03CR) 10Razzi: [V: 03+1] "Small changes that should fix karapace1001." [puppet] - 10https://gerrit.wikimedia.org/r/772912 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi) [19:02:53] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [19:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:04] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bu... [19:04:17] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye [19:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:27] (03PS11) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [19:04:29] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with O... [19:06:42] (03CR) 10MewOphaswongse: [C: 03+1] beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [19:07:32] (03PS12) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [19:10:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22986 and previous config saved to /var/cache/conftool/dbconfig/20220322-191049-marostegui.json [19:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:56] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [19:13:22] 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10Ottomata) +1 sounds good. Approved! [19:14:11] (03CR) 10Ottomata: [C: 03+1] karapace: use karapace included python; set hostname [puppet] - 10https://gerrit.wikimedia.org/r/772912 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi) [19:14:45] (03PS13) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909 [19:20:01] (03CR) 10Razzi: [V: 03+1 C: 03+2] karapace: use karapace included python; set hostname [puppet] - 10https://gerrit.wikimedia.org/r/772912 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi) [19:23:15] (03PS1) 10RLazarus: slo: Move most of the text panel content to a description field, so it can be overridden [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842) [19:24:52] (03CR) 10Kosta Harlan: [C: 03+1] beta, testwiki: enable testing of topics match mode for GLAM events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [19:31:55] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10herron) >>! In T303593#7797822, @ssingh wrote: > - as per the commit message above, "We first start by setting interface::rps to the alerting_... [19:32:07] 10SRE, 10Data-Engineering: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10Ottomata) [19:34:09] 10SRE, 10Data-Engineering: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10Ottomata) @MoritzMuehlenhoff advice? Can I import [[ https://docs.conda.io/projects/conda/en/latest/user-guide/install/rpm-debian.html | conda's official .deb ]] into our apt repo, or would you prefer... [19:36:34] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:45:04] PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:46:36] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4839 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [19:47:45] That keeps going off [19:50:06] _joe_: irc says you've been active a few minutes ago, is ^ worth a task? That's 3rd time I can remember it going off today [19:59:33] greetings all! [20:00:05] RoanKattouw and Urbanecm: Dear deployers, time to do the UTC late backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T2000). [20:00:05] eigyan, jandrewniak, and mewoph: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:01:10] 👋 [20:01:16] ✔️ [20:01:38] 👋 [20:05:41] Is there anyone around to do the deploys? Or should we do it ourselves? [20:06:24] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:07:59] eigyan, mewoph, I can do the deploy if RoanKattouw and Urbanecm are not around [20:08:24] I'm here if you want me to do it [20:08:29] I am, but i didn't see the ping [20:08:30] Sorry, was just getting back from lunch [20:08:49] jan_drewniak: if you're comfortable deploying, feel free to, otherwise me or Roan can do it :) [20:09:02] hello all, I am happy with any decision made [20:09:16] I am prepared to watch with vigor :) [20:10:04] I have only attended one deploy training so far with many more to come :) [20:10:05] urbanecm: I comfortable copy & pasting some bash commands :P but since you guys do this everyday, I'll leave it to the pros :) [20:10:18] Okay :)) [20:10:19] :) [20:10:25] In that case, I can deploy today [20:10:39] urbanecm spoke from a pro! [20:10:52] ^spoken [20:13:08] 10SRE, 10Commons, 10MediaWiki-File-management, 10RESTBase-API, and 4 others: RFC: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847 (10LGoto) [20:14:17] eigyan: hello, can you please clarify why is hewiki => true removed from wmgUseQuickSurveys at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/772433? [20:14:39] Yes I can urbanecm [20:15:20] Per J Robson's code review that was a redundant piece of code, it is mentioned in the patch [20:15:25] to be removed [20:15:33] per his suggestion [20:16:13] it appears the wmgUseQuickSurveys value is set higher upstream [20:16:23] ah, ok, makes sense [20:16:28] (03PS9) 10Urbanecm: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [20:16:32] (03CR) 10Urbanecm: [C: 03+2] [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [20:16:43] cool urbanecm [20:17:01] eigyan: since it is a beta-only patch, it will be deployed to beta automatically within ~30 minutes (if not, feel free to ping me and I can investigate) [20:17:10] PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:17:22] excellent, thanks urbanecm [20:17:25] (03Merged) 10jenkins-bot: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan) [20:18:31] (03PS2) 10Urbanecm: Enable EventGate logging for WikipediaPortal schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772507 (https://phabricator.wikimedia.org/T271163) (owner: 10Jdrewniak) [20:18:47] jan_drewniak: your patch is next :). Will you be able to test it at a debug srv? [20:19:36] urbanecm: I don't think so, it's enabling an event-logging schema, so I'll test it by sending events after it's deployed. [20:20:00] you can test on a debug server [20:20:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:08] but it is also pretty safe to just deploy [20:20:09] since it is new [20:20:16] okay [20:20:30] i should probably add docs in event platform on wikitech on how to do that! :) [20:20:33] in that case I'll just sync it and let jan_drewniak test it later on [20:20:35] ya [20:20:36] ottomata: would be great :) [20:20:44] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.04839 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [20:20:48] (03CR) 10Urbanecm: [C: 03+2] Enable EventGate logging for WikipediaPortal schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772507 (https://phabricator.wikimedia.org/T271163) (owner: 10Jdrewniak) [20:20:53] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:20:54] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:21:37] (03Merged) 10jenkins-bot: Enable EventGate logging for WikipediaPortal schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772507 (https://phabricator.wikimedia.org/T271163) (owner: 10Jdrewniak) [20:21:42] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:23:15] (03PS5) 10Urbanecm: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [20:23:19] (03CR) 10Urbanecm: [C: 03+2] beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [20:24:03] (03Merged) 10jenkins-bot: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno) [20:24:08] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 17caf0359b99b69c0b3e0d7a5fa2f5c7fb7464ef: Enable EventGate logging for WikipediaPortal schema (T271163) (duration: 01m 54s) [20:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:15] T271163: TranslationRecommendation* Schemas Event Platform Migration - https://phabricator.wikimedia.org/T271163 [20:24:17] jan_drewniak: should be live! [20:24:31] mewoph: your patch is at mwdebug1001. can you have a look? [20:24:48] checking now [20:25:02] urbanecm: great! thanks [20:25:09] happy to help :) [20:26:19] and ottomata: just tested an event, getting 201 so I think that's good :) [20:26:45] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:14] urbanecm: lgtm thanks! [20:28:22] mewoph: syncing, thanks for checking [20:28:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr) cloudstore1010 B7 U41 port12 cableid #5014 cloudstore1011 C4 U1 port23. cableid #20220273 [20:29:00] mewoph: fyi, the beta part will be deployed automatically within ~30 minutes (if not, please let me know and I can investigate) [20:29:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr) [20:29:47] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ce18d4eeb255349e27163d5e5472fbe21c320322: testwiki: enable testing of topics match mode for GLAM events (T301825) (duration: 01m 06s) [20:29:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [20:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:52] T301825: Account creation: add toggle to enable AND selection of interest topics - https://phabricator.wikimedia.org/T301825 [20:29:56] mewoph: and the testwiki part is live now [20:30:05] anyone anything else to deploy? [20:30:10] jan_drewniak: that's good! [20:30:11] let's see! [20:31:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:31:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:09] oh jan_drewniak the WP code is not out yet, right? this was just the config change? [20:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:40] ottomata: that's true, just testing it by sending the event from my local. [20:32:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:06] PROBLEM - Juniper virtual chassis ports on asw2-b-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [20:32:34] With the expected payload that I have in the portals patch. I'll deploy the portal change tomorrow though. [20:32:36] !log UTC late backport window done [20:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:41] jan_drewniak: awesome, but that's good! [20:33:11] once the code is out and looking good, i can finalize the migration process in the backend [20:33:14] :) [20:34:08] ottomata: cool. the event from my local won't show up because it's from non-wikimedia domain right? [20:34:27] hmmmm, it won't show up in the event table, iirc, but it is in kafka [20:34:41] you can consume from kafka and grep somethign for your event [20:34:46] then produce and see [20:35:27] jan_drewniak: do you have access to a stat box? [20:36:01] ottomata: honestly I haven't looked at that stuff in ages, I probably don't even have access right now. [20:36:05] okay [20:36:16] i'll grep for you, what's the event you are posting? [20:37:01] jan_drewniak: you're still in the access group though, so I think you should be able to ssh to stat1004.eqiad.wmnet, for instance [20:37:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:17] jan_drewniak: if you can ssh to stat1004 i'll give you a command to grep [20:37:25] it'll be like https://wikitech.wikimedia.org/wiki/Kafka#Consume [20:37:46] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [20:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS b... [20:37:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:37:59] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:47] ottomata: I'm sending a request like this [20:38:49] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:04] https://www.irccloud.com/pastebin/B0GyjMg8/ [20:39:37] ok gonna grep for c5860eb99af6d7d9 [20:39:43] that's via postman (I like my guis). [20:40:10] ok jan_drewniak post again plz [20:40:13] i'm grepping :) [20:40:58] perfect jan_drewniak i see it! [20:41:14] 10SRE, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10LGoto) 05Open→03Resolved a:03LGoto [20:41:23] 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10LGoto) [20:42:12] and thanks for reminding me urbanecm, I do still have access to the stats boxes (like stat1004.eqiad.wmnet) [20:42:22] no problem :) [20:42:36] fwiw, this is the command I ran: [20:42:52] kafkacat -C -u -b kafka-jumbo1001.eqiad.wmnet:9092 -t eventlogging_WikipediaPortal | grep --line-buffered c5860eb99af6d7d9 | jq . [20:43:44] ottomata: thanks! I see it too [20:44:01] nice [20:45:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Andrew) @Jclark-ctr just swapped the network cables and now I see: ` Lifecycle Controller: Done No PXE-capable device available.... [20:45:18] I have verfied my changes thanks urbanecm [20:45:24] happy to help [20:58:28] (03PS2) 10RLazarus: slo: Move most of the text panel content to a description field, so it can be overridden [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842) [21:05:58] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye [21:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bulls... [21:12:04] RECOVERY - ElasticSearch setting check - 9600 on elastic1083 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:04] RECOVERY - ElasticSearch setting check - 9400 on elastic1076 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:04] RECOVERY - ElasticSearch setting check - 9600 on elastic1073 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:04] RECOVERY - ElasticSearch setting check - 9600 on elastic1075 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:04] RECOVERY - ElasticSearch setting check - 9400 on elastic1068 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:12:05] RECOVERY - ElasticSearch setting check - 9400 on elastic1057 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [21:15:17] (03PS1) 10Ryan Kemper: Revert "elastic: fix cirrus settings check false negative" [puppet] - 10https://gerrit.wikimedia.org/r/772893 [21:15:40] (03PS2) 10Ryan Kemper: Revert "elastic: fix cirrus settings check false negative" [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511) [21:16:17] (03CR) 10jerkins-bot: [V: 04-1] Revert "elastic: fix cirrus settings check false negative" [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper) [21:17:59] (03PS3) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511) [21:18:25] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:35] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bu... [21:18:58] RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:19:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Papaul) @Jclark-ctr swapped the cable and now the server NIC 1 is connected to the right switch port ` papaul@cloudsw2-d5-eqiad> ..... [21:20:22] (03CR) 10Bking: [C: 03+1] elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper) [21:21:56] (03PS4) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511) [21:21:58] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:22:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:08] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:22:33] (03PS5) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511) [21:23:36] (03CR) 10Ryan Kemper: [C: 03+2] elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper) [21:26:14] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:29:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300775)', diff saved to https://phabricator.wikimedia.org/P22989 and previous config saved to /var/cache/conftool/dbconfig/20220322-212939-marostegui.json [21:29:41] (03PS1) 10Andrew Bogott: Changes to reuse-labvirt.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/772932 [21:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:46] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [21:31:09] (03CR) 10Andrew Bogott: [C: 03+2] Changes to reuse-labvirt.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/772932 (owner: 10Andrew Bogott) [21:31:15] (03PS1) 10Ryan Kemper: Revert "Revert "elastic: fix cirrus settings check false negative"" [puppet] - 10https://gerrit.wikimedia.org/r/772894 [21:32:39] (03PS2) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772894 (https://phabricator.wikimedia.org/T301511) [21:33:24] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772894 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper) [21:33:56] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1026.eqiad.wmnet with OS bullseye [21:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:08] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with O... [21:35:27] !log T301511 Fixed elastic* eqiad cross-cluster search settings (see https://phabricator.wikimedia.org/T301511#7798267) to resolve the `ElasticSearch setting check` alerts in eqiad [21:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:35:31] T301511: Address false negatives in Elasticsearch cross-cluster monitoring checks - https://phabricator.wikimedia.org/T301511 [21:38:59] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Papaul) Downgrade NIC firmware on cloudvrit1025 and cloudvirt1026 from 22.00.07.60 to 21.60.22.11 fixed the `` Failed to load ldlinux.c32'' ` is... [21:39:32] !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1026.eqiad.wmnet with OS bullseye [21:39:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:43] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bu... [21:40:35] FYI @cjming and I are running some database maintenance scripts so if you see any slight changes in https://grafana.wikimedia.org/d/GpL5R8CGz/mysql-query-rate?orgId=1&from=now-15m&to=now&viewPanel=14 that's to be expected. [21:44:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P22990 and previous config saved to /var/cache/conftool/dbconfig/20220322-214445-marostegui.json [21:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:09] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:45:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:17] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye executed... [21:46:23] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1026.eqiad.wmnet with OS bullseye [21:46:25] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:46:31] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye [21:46:36] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye [21:58:45] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1026.eqiad.wmnet with reason: host reimage [21:58:46] (03PS1) 10Daniel Kinzler: Set MW_USE_CONFIG_SCHEMA constant of the file use-config-schema exists. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772937 (https://phabricator.wikimedia.org/T304460) [21:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:15] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1025.eqiad.wmnet with reason: host reimage [21:59:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:29] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [21:59:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:41] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with O... [21:59:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P22991 and previous config saved to /var/cache/conftool/dbconfig/20220322-215950-marostegui.json [21:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:23] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1026.eqiad.wmnet with reason: host reimage [22:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:34] (03PS1) 10RLazarus: envoy: Remove v2 config API support [puppet] - 10https://gerrit.wikimedia.org/r/772938 (https://phabricator.wikimedia.org/T303770) [22:03:20] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (8) node(s) change every puppet run: build2001, cloudcontrol1003, cloudcontrol1004, cloudvirt1025, cp1085, deploy1002, deploy2002, ms-be1068 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [22:04:15] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1025.eqiad.wmnet with reason: host reimage [22:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:10] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 66 probes of 676 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:09:20] PROBLEM - ElasticSearch setting check - 9400 on elastic2047 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2048.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:09:20] PROBLEM - ElasticSearch setting check - 9600 on elastic2049 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:09:20] PROBLEM - ElasticSearch setting check - 9200 on elastic2025 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:09:20] PROBLEM - ElasticSearch setting check - 9200 on elastic2042 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:09:21] PROBLEM - ElasticSearch setting check - 9400 on elastic2042 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2048.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:09:21] PROBLEM - ElasticSearch setting check - 9200 on elastic2031 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:09:21] PROBLEM - ElasticSearch setting check - 9600 on elastic2027 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:09:22] PROBLEM - ElasticSearch setting check - 9600 on elastic2029 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration [22:09:23] !log T301511 Forcing recheck of codfw cirrus setting check [22:09:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:09:27] T301511: Address false negatives in Elasticsearch cross-cluster monitoring checks - https://phabricator.wikimedia.org/T301511 [22:10:02] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (8) node(s) change every puppet run: build2001, cloudcontrol1003, cloudcontrol1004, cloudvirt1025, cp1085, deploy1002, deploy2002, ms-be1068 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [22:11:27] (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34501/console" [puppet] - 10https://gerrit.wikimedia.org/r/772938 (https://phabricator.wikimedia.org/T303770) (owner: 10RLazarus) [22:11:36] ACKNOWLEDGEMENT - ElasticSearch setting check - 9200 on elastic2025 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:11:36] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic2027 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:11:36] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic2029 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:11:36] ACKNOWLEDGEMENT - ElasticSearch setting check - 9200 on elastic2031 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:11:37] ACKNOWLEDGEMENT - ElasticSearch setting check - 9200 on elastic2042 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:11:37] ACKNOWLEDGEMENT - ElasticSearch setting check - 9400 on elastic2042 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2048.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:11:37] ACKNOWLEDGEMENT - ElasticSearch setting check - 9400 on elastic2047 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2048.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:11:38] ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic2049 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:11:43] (sorry for the noise) [22:13:37] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 59 probes of 676 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [22:14:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300775)', diff saved to https://phabricator.wikimedia.org/P22992 and previous config saved to /var/cache/conftool/dbconfig/20220322-221455-marostegui.json [22:14:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [22:14:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance [22:14:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:00] T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775 [22:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T300775)', diff saved to https://phabricator.wikimedia.org/P22993 and previous config saved to /var/cache/conftool/dbconfig/20220322-221503-marostegui.json [22:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:56] !log T301511 Mutated cirrus codfw cluster settings to what [I think] they should be, see https://phabricator.wikimedia.org/T301511#7798415; forcing re-check [22:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:02] T301511: Address false negatives in Elasticsearch cross-cluster monitoring checks - https://phabricator.wikimedia.org/T301511 [22:21:52] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1026.eqiad.wmnet with OS bullseye [22:21:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:01] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye completed... [22:22:04] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10MoritzMuehlenhoff) >>! In T303776#7798337, @Papaul wrote: > Downgrade NIC firmware on cloudvrit1025 and cloudvirt1026 from 22.00.07.60 to 21.60.22... [22:22:24] RECOVERY - ElasticSearch setting check - 9600 on elastic2049 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:24] RECOVERY - ElasticSearch setting check - 9400 on elastic2047 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:24] RECOVERY - ElasticSearch setting check - 9200 on elastic2025 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:24] RECOVERY - ElasticSearch setting check - 9200 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:24] RECOVERY - ElasticSearch setting check - 9400 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:25] RECOVERY - ElasticSearch setting check - 9200 on elastic2031 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:25] RECOVERY - ElasticSearch setting check - 9600 on elastic2029 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:26] RECOVERY - ElasticSearch setting check - 9600 on elastic2027 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [22:22:32] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1025.eqiad.wmnet with OS bullseye [22:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:40] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye completed... [22:24:35] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [22:24:37] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [22:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:42] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye [22:24:43] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye [22:24:44] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [22:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:48] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye [22:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:24:51] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye executed... [22:24:58] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed... [22:25:29] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [22:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:25:36] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye [22:26:08] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye [22:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:16] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye [22:26:26] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:27:14] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Andrew) 05Open→03Resolved These hosts are now reimaged and running VMs. Thanks for all the attention everyone! [22:27:41] !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye [22:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:54] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bu... [22:34:00] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1024.eqiad.wmnet DHCP problems - https://phabricator.wikimedia.org/T303773 (10Andrew) Last run: ` CLIENT MAC ADDR: B0 26 28 29 5D F0 GUID: 4C4C4544-005A-5910-805A-C4C04F515032 CLIENT IP: 10.64.20.43 MASK: 255.255.255.0 DHCP IP:... [22:41:30] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye [22:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:34] !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1024.eqiad.wmnet with OS bullseye [22:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:38] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye executed... [22:41:41] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed... [22:46:57] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [22:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:06] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye [22:50:32] RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:04:00] (03PS2) 10Esanders: Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966) [23:11:05] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye [23:11:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:11:13] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed... [23:23:21] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye [23:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:33] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with O... [23:35:25] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1024.eqiad.wmnet with reason: host reimage [23:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:03] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1024.eqiad.wmnet with reason: host reimage [23:39:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10thcipriani) >>! In T303398#7796390, @jbond wrote: > @thcipriani are you able to approve @TThoabala membership of the deployment group Approved! [23:56:32] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1024.eqiad.wmnet with OS bullseye [23:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:42] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bu... [23:59:40] PROBLEM - ensure kvm processes are running on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting