[00:02:04] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:03:32] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 70 probes of 667 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[00:05:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[00:07:04] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:10:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[00:10:26] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 62 probes of 667 (alerts on 65) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[00:12:38] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:13:12] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:15:56] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:19:56] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:20:05] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus) As in T300324#7752134, I've rolled out all the k8s services where Envoy version was the only diff. We're now up to 1.18 everywhere, except for k8s servi...
[00:20:15] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10RLazarus)
[00:22:10] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:24:50] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[00:26:12] <icinga-wm>	 PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:26:26] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[00:30:50] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:31:24] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:31:58] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:33:44] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:35:06] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[00:35:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:35:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bu...
[00:39:40] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:40:28] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[00:45:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:48:22] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:51:06] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:52:52] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:57:04] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:58:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:59:20] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:00:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[01:00:33] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[01:02:08] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:04:24] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:05:30] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:10:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10Aklapper) 05Stalled→03Open > I'm stalling the task since it'll likely be resolvable once we've decom'd all the old swift backends that still use old...
[01:18:32] <icinga-wm>	 RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:25:32] <wikibugs>	 (03PS2) 10SBassett: Revert "Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang)
[01:26:34] <icinga-wm>	 RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:27:06] <icinga-wm>	 PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:29:18] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:32:40] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:35:34] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:35:35] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye
[01:35:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:35:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudvirt1047.eqiad.wmnet with O...
[01:36:06] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:37:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:41:52] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:42:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:44:10] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:47:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:47:38] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:54:54] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:57:42] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:00:05] <jouncebot>	 Deploy window Automatic 🚂🧪Trainsperiment Week branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T0200)
[02:01:04] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:03:46] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye
[02:03:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:03:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bu...
[02:04:42] <wikibugs>	 (03CR) 10SBassett: [C: 03+1] "Per my comment at T300978#7795258" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang)
[02:05:58] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:07:32] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.3 [core] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772545
[02:07:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.3 [core] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772545 (owner: 10TrainBranchBot)
[02:07:38] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:08:36] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-joal-singleuser.service,session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,session-259530.scope,session-259534.scope,session-259540.scope,user@20171.service,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:08:48] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[02:09:18] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:11:36] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[02:14:26] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:16:01] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[02:16:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:16:12] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with O...
[02:17:02] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-joal-singleuser.service,session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,session-259530.scope,session-259534.scope,session-259540.scope,user@20171.service,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:17:46] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:18:58] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:20:40] <icinga-wm>	 RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:25:46] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.3 [core] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772545 (owner: 10TrainBranchBot)
[02:29:14] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:31:22] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-joal-singleuser.service,session-259166.scope,session-259172.scope,session-259184.scope,session-259210.scope,session-259530.scope,session-259534.scope,session-259540.scope,user@20171.service,user@38373.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:31:34] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:33:14] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:37:18] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:41:16] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:44:46] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:47:00] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:50:28] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:54:28] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:56:10] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:00:44] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:21:42] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[03:23:06] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:29:24] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:30:30] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:31:40] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:32:46] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:39:10] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:40:16] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:41:26] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:43:42] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:47:48] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:48:54] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:50:38] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:51:46] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:52:20] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:57:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:02:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (4) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:03:48] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:07:14] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:07:48] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:10:40] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:12:58] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:21:00] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:24:18] <icinga-wm>	 PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:25:02] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:26:10] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:27:54] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:29:38] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:35:22] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:35:56] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:41:38] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:42:14] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:43:58] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:45:06] <wikibugs>	 (03PS2) 10Aaron Schulz: Add "db-mainstash" entry to $wgObjectCaches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/752807 (https://phabricator.wikimedia.org/T212129)
[04:45:10] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:46:14] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:49:40] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:58:16] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:00:33] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[05:02:18] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:04:00] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:05:14] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:06:18] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:07:28] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:10:20] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:18:22] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:24:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:25:20] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:26:02] <icinga-wm>	 RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[05:38:44] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:41:28] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:41:59] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1175.eqiad.wmnet with OS bullseye
[05:42:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:08] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:43:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[05:43:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[05:43:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:43:40] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:45:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[05:47:46] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:50:04] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:50:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[05:51:42] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:52:54] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[05:53:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage
[05:53:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:56:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1175.eqiad.wmnet with reason: host reimage
[05:56:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:57:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300775)', diff saved to https://phabricator.wikimedia.org/P22916 and previous config saved to /var/cache/conftool/dbconfig/20220322-055707-marostegui.json
[05:57:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:57:12] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[05:57:23] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:58:31] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:09:12] <wikibugs>	 10SRE, 10ops-eqiad: db1175 not booting up - https://phabricator.wikimedia.org/T304280 (10Marostegui) Thanks Chris! The server was able to get reimaged
[06:10:13] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772482
[06:11:29] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:12:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P22917 and previous config saved to /var/cache/conftool/dbconfig/20220322-061212-marostegui.json
[06:12:13] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1175.eqiad.wmnet with OS bullseye
[06:12:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:18:38] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1132 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/772665 (https://phabricator.wikimedia.org/T301879)
[06:19:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1132 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/772665 (https://phabricator.wikimedia.org/T301879) (owner: 10Marostegui)
[06:21:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1132 to dbctl T301879', diff saved to https://phabricator.wikimedia.org/P22918 and previous config saved to /var/cache/conftool/dbconfig/20220322-062140-marostegui.json
[06:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:21:46] <stashbot>	 T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879
[06:23:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1132 to s1 with minimal weight T301879', diff saved to https://phabricator.wikimedia.org/P22919 and previous config saved to /var/cache/conftool/dbconfig/20220322-062310-marostegui.json
[06:23:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:23:46] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:27:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P22920 and previous config saved to /var/cache/conftool/dbconfig/20220322-062717-marostegui.json
[06:27:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:27:34] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:30:06] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:32:16] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[06:32:18] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[06:32:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1127 (T298557)', diff saved to https://phabricator.wikimedia.org/P22921 and previous config saved to /var/cache/conftool/dbconfig/20220322-063223-marostegui.json
[06:32:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:32:27] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[06:35:16] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:35:42] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[06:36:50] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:37:04] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:38:48] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:41:33] <wikibugs>	 (03PS4) 10Juan90264: Create "editautopatrolprotected" protection level for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772481 (https://phabricator.wikimedia.org/T303579)
[06:42:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (5) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[06:42:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T300775)', diff saved to https://phabricator.wikimedia.org/P22922 and previous config saved to /var/cache/conftool/dbconfig/20220322-064222-marostegui.json
[06:42:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[06:42:25] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1142.eqiad.wmnet with reason: Maintenance
[06:42:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:28] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[06:42:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T300775)', diff saved to https://phabricator.wikimedia.org/P22923 and previous config saved to /var/cache/conftool/dbconfig/20220322-064230-marostegui.json
[06:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:42:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:44:48] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:45:49] <wikibugs>	 (03PS3) 10Elukey: Set bullseye + overlayfs for kubernetes1007 [puppet] - 10https://gerrit.wikimedia.org/r/770440 (https://phabricator.wikimedia.org/T300744)
[06:47:14] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:50:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1007 [puppet] - 10https://gerrit.wikimedia.org/r/770440 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[06:51:59] <wikibugs>	 (03PS1) 10Urbanecm: MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772483 (https://phabricator.wikimedia.org/T304353)
[06:52:14] <wikibugs>	 (03PS1) 10Urbanecm: MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772484 (https://phabricator.wikimedia.org/T304353)
[06:52:26] <wikibugs>	 (03PS1) 10Urbanecm: MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772485 (https://phabricator.wikimedia.org/T304353)
[06:54:00] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1007.eqiad.wmnet with OS bullseye
[06:54:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:36] <Juan_90264>	 Hello
[06:56:53] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[06:57:59] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[07:00:05] <jouncebot>	 Amir1, awight, Urbanecm, and taavi: Dear deployers, time to do the UTC morning backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T0700).
[07:00:05] <jouncebot>	 koi and Juan_90264: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:16] <koi>	 o/
[07:00:24] <urbanecm>	 Hello!
[07:00:28] <urbanecm>	 I can deploy today
[07:00:29] <Juan_90264>	 I'm present
[07:00:34] <urbanecm>	 And i also have my own fixes
[07:00:35] <taavi>	 o/ here, but I'd rather not deploy today
[07:00:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:01:45] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772484 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm)
[07:01:49] <Juan_90264>	 Impressive, these hours only help me to be available. Thankful for that changed that!
[07:01:54] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772485 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm)
[07:02:18] <Juan_90264>	 I'm creating one more change and I'm going to send it to this backport
[07:02:57] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:03:58] <urbanecm>	 taavi: glad you're around, since you were the one to do https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/760552, any objections to reverting the revert today? :)
[07:04:11] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Create "editautopatrolprotected" protection level for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772481 (https://phabricator.wikimedia.org/T303579) (owner: 10Juan90264)
[07:04:18] <urbanecm>	 Juan_90264: I'll start with your patch
[07:04:44] <Juan_90264>	 Okay
[07:04:53] <wikibugs>	 (03Merged) 10jenkins-bot: Create "editautopatrolprotected" protection level for viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772481 (https://phabricator.wikimedia.org/T303579) (owner: 10Juan90264)
[07:04:57] <taavi>	 urbanecm: I don't have any objections as long as secteam is still happy with it
[07:05:05] <urbanecm>	 thanks taavi
[07:05:46] <urbanecm>	 hashar: jnuche: good morning, if either of you is around, for the T304353 fix, i guess i don't have to do the wmf.1 patch as well, since we're now fully at wmf.2, is that right?
[07:05:47] <stashbot>	 T304353: PHP Warning: preg_match() expects parameter 2 to be string, array given - https://phabricator.wikimedia.org/T304353
[07:06:02] <urbanecm>	 Juan_90264: your patch is at mwdebug1001
[07:06:04] <urbanecm>	 please test
[07:06:17] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: host reimage
[07:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:06:29] <Juan_90264>	 Okay, I will test
[07:08:37] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:08:59] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1007.eqiad.wmnet with reason: host reimage
[07:09:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:43] <wikibugs>	 (03PS3) 10Urbanecm: Revert "Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang)
[07:09:57] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang)
[07:10:13] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:10:37] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772466 (https://phabricator.wikimedia.org/T300978) (owner: 10Stang)
[07:11:06] <Juan_90264>	 Urbanecm: I tested and approved
[07:11:07] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:11:11] <urbanecm>	 syncing
[07:11:13] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:12:13] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:13:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772483 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm)
[07:13:22] <urbanecm>	 :(
[07:13:38] <wikibugs>	 (03PS1) 10Elukey: Set bullseye + overlayfs for kubernetes1008 [puppet] - 10https://gerrit.wikimedia.org/r/772686 (https://phabricator.wikimedia.org/T300744)
[07:13:44] <wikibugs>	 (03PS4) 10Juan90264: Allow flooders to remove the group from themselves in viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772666 (https://phabricator.wikimedia.org/T303578)
[07:13:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772484 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm)
[07:14:08] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: b4a9935: Create "editautopatrolprotected" protection level for viwiki (T303579) (duration: 00m 57s)
[07:14:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:14:13] <stashbot>	 T303579: Create "editautopatrolprotected" protection level for viwiki - https://phabricator.wikimedia.org/T303579
[07:14:15] <urbanecm>	 Juan_90264: should be live now
[07:14:41] <urbanecm>	 koi: your patch is at mwdebug1001, please have a look
[07:15:17] <icinga-wm>	 RECOVERY - PHP opcache health on mw1414 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[07:15:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772485 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm)
[07:15:25] <koi>	 urbanecm, lgtm
[07:15:30] <urbanecm>	 syncing
[07:16:16] <Juan_90264>	 It already seems to be working, thanks Urbanecm.
[07:16:52] <Juan_90264>	 So I'm going to put in one more change now.
[07:17:03] <urbanecm>	 okay
[07:17:47] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/CommonSettings.php: caad5a4df35c0daa5fd3423e4abf5aa4d5c38a7a: wgCrossSiteAJAXdomains: Add foundationwiki and {ee,ge,punjabi}wikimedia (T300978) (duration: 00m 49s)
[07:17:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:17:52] <stashbot>	 T300978: Update $wgCrossSiteAJAXdomains to include {foundation, ee, ge, punjabi}.wm - https://phabricator.wikimedia.org/T300978
[07:18:23] <urbanecm>	 koi: and, live
[07:18:31] <koi>	 ty!
[07:18:33] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:18:39] <urbanecm>	 np
[07:20:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1007.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[07:21:02] <Juan_90264>	 Hello, I already put
[07:21:23] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1007.eqiad.wmnet with OS bullseye
[07:21:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:21:59] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Allow flooders to remove the group from themselves in viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772666 (https://phabricator.wikimedia.org/T303578) (owner: 10Juan90264)
[07:22:42] <wikibugs>	 (03Merged) 10jenkins-bot: Allow flooders to remove the group from themselves in viwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772666 (https://phabricator.wikimedia.org/T303578) (owner: 10Juan90264)
[07:23:01] <Juan_90264>	 Okay merged
[07:23:22] <urbanecm>	 Juan_90264: and pulled to mwdebug1001
[07:23:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772484 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm)
[07:23:24] <urbanecm>	 can you test?
[07:23:31] <Juan_90264>	 Yes
[07:24:15] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "all tests passed in master, most tests passed here as well, to unbreak the feature" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.2) - 10https://gerrit.wikimedia.org/r/772484 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm)
[07:24:25] <wikibugs>	 10SRE, 10observability, 10Patch-For-Review: Move Kafka logging to the new intermediate PKI - https://phabricator.wikimedia.org/T300130 (10elukey) >>! In T300130#7791627, @elukey wrote: >  > If the Beta experiment works, I think that we are ready for https://gerrit.wikimedia.org/r/c/operations/puppet/+/763172...
[07:24:49] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "per the wmf.2 variant" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.3) - 10https://gerrit.wikimedia.org/r/772485 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm)
[07:25:17] <Juan_90264>	 Urbanecm: I tested and approved
[07:25:34] <urbanecm>	 deploying
[07:26:50] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 8151bf2: Allow flooders to remove the group from themselves in viwiki (T303578) (duration: 00m 50s)
[07:26:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:26:55] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:26:55] <stashbot>	 T303578: Allow viwiki flooders to remove the group from themselves - https://phabricator.wikimedia.org/T303578
[07:28:27] <urbanecm>	 scap complains about mw1448, saying `/wiki/{title} (Special Version) timed out before a response was received`
[07:28:46] <urbanecm>	 I SSH'ed into the host and all its cores are very busy
[07:28:55] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:28:55] <urbanecm>	 (=at 100%)
[07:29:23] <urbanecm>	 can someone check what's with that host?
[07:30:52] <Juan_90264>	 Wasn't it then?
[07:31:09] <urbanecm>	 my backports are fetched to the debug server and work, waiting on info re mw1448 before i deploy
[07:31:44] <Juan_90264>	 Okay
[07:32:15] <elukey>	 urbanecm: from a quick check it seems that php-fpm is consuming cpu, and it started yesterday at around 21 UTC 
[07:32:18] <elukey>	 https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw1448&var-datasource=eqiad%20prometheus%2Fops&orgId=1&var-cluster=api_appserver&from=now-2d&to=now
[07:32:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298557)', diff saved to https://phabricator.wikimedia.org/P22924 and previous config saved to /var/cache/conftool/dbconfig/20220322-073243-marostegui.json
[07:32:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:32:48] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[07:33:01] <elukey>	 and there was a deployment around that time
[07:33:10] <Juan_90264>	 Urbanecm: The change already seems to be working too
[07:33:55] <urbanecm>	 elukey: I'm wondering why it didn't happen with the prior few syncs. Or maybe it did, but since it printed the msg in the middle, i didn't see it?
[07:34:10] <RhinosF1>	 elukey: which deployment? The train
[07:34:12] <urbanecm>	 Juan_90264: yep, i synced your config patch :)
[07:34:48] <Juan_90264>	 Thank you Urbanecm, bye and good morning!
[07:34:56] <urbanecm>	 See you later Juan_90264 !
[07:35:26] <elukey>	 urbanecm: afaics only mw1448 and mw1449 are showing up this behavior, is it urgent to unblock the deployment or can we spend 10/15 mins in debugging them? Otherwise we can restart php-fpm on one, and depool the other 
[07:35:44] <urbanecm>	 elukey: we can definitely wait 15 mins, no problem :)
[07:36:15] <elukey>	 ack, checking a few things :)
[07:36:28] <urbanecm>	 Okay -- thanks. Please ping me once i can continue. 
[07:36:53] <elukey>	 ther are also 3 api appservers depooled https://config-master.wikimedia.org/pybal/eqiad/api-https
[07:38:13] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:40:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:40:56] <elukey>	 more metrics about the node: 
[07:40:57] <elukey>	 https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-node=mw1448&from=now-24h&to=now
[07:42:17] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:42:21] <elukey>	 it seems that it started to slow down a lot
[07:42:40] <elukey>	 same thing for mw1449
[07:43:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:43:42] <RhinosF1>	 elukey: that's perfectly in line with group2 wmf.2
[07:43:58] <RhinosF1>	 But why only them few servers
[07:46:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:47:21] <elukey>	 !log depool mw1448 manually on the node (high cpu usage from php-fpm)
[07:47:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:37] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:47:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22925 and previous config saved to /var/cache/conftool/dbconfig/20220322-074748-marostegui.json
[07:47:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:53] <elukey>	 depooling it causes the cpu usage to drop
[07:48:55] <icinga-wm>	 RECOVERY - PHP opcache health on mw1448 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[07:49:29] <elukey>	 !log restart php-fpm on mw1448 - high cpu usage right after yesterday's deployment at 21 UTC
[07:49:31] <RhinosF1>	 That alerted during the train
[07:49:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:40] <RhinosF1>	 why wasn't in checked after scap
[07:49:49] <RhinosF1>	 (Which should have restarted anyway)
[07:50:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[07:50:21] <elukey>	 urbanecm: I just restarted php-fpm on mw1448, I want to check how it behaves with some requests
[07:50:46] <urbanecm>	 Ack. Take the time needed :)
[07:51:49] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:51:51] <elukey>	 for the curious, the opcache stats before the restart are https://phabricator.wikimedia.org/P22926
[07:52:07] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:52:21] <elukey>	 so it seems an issue with opcache
[07:53:00] <elukey>	 !log restart php-fpm on mw1449 - opcache full after deployment
[07:53:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:49] <icinga-wm>	 RECOVERY - PHP opcache health on mw1449 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[07:53:53] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:54:13] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:55:27] <elukey>	 urbanecm: you can proceed from my point of view, metrics are good now
[07:55:31] <elukey>	 lemme know how it goes
[07:55:34] <urbanecm>	 elukey: thanks! Syncing
[07:57:03] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.2/extensions/GrowthExperiments/modules/ext.growthExperiments.MentorDashboard/MenteeOverview/MenteeOverviewPresets.js: 84877bd: MenteeOverviewPresets.getUsersToShow: Fix typo (T304353) (duration: 00m 49s)
[07:57:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:57:07] <elukey>	 RhinosF1: to answer your question see https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#PHP7_opcache_health - we have a daily systemd timer that checks the opcache status, in this case the deployment caused some increase in usage and two appservers were waiting for a run of the timer to get php-fpm restarted (this is my understanding)
[07:57:07] <stashbot>	 T304353: PHP Warning: preg_match() expects parameter 2 to be string, array given - https://phabricator.wikimedia.org/T304353
[07:57:16] <urbanecm>	 elukey: everything went just fine now
[07:57:18] <urbanecm>	 thanks again
[07:57:21] <elukey>	 super
[07:57:32] <urbanecm>	 !log UTC morning backport window completed
[07:57:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:13] <RhinosF1>	 elukey: doesn't scap also do it during the train if high
[07:59:24] <RhinosF1>	 (It did alert then, not sure why releng didn't react)
[08:00:05] <jouncebot>	 dancy, hashar, brennen, dduvall, jeena, and jnuche: Your horoscope predicts another unfortunate 🚂🧪Trainsperiment Week Deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T0800).
[08:00:16] <urbanecm>	 hashar: jnuche: fyi T304353 errors should no longer happen
[08:00:24] <urbanecm>	 (I just synced the fix for it and merged to wmf.3)
[08:00:49] <jnuche>	 urbanecm: thanks!
[08:00:50] <elukey>	 RhinosF1: it may increase right after a deployment, this is why we have the timer, and the alerts probably fell through the cracks (it happens, nothing major)
[08:02:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127', diff saved to https://phabricator.wikimedia.org/P22927 and previous config saved to /var/cache/conftool/dbconfig/20220322-080253-marostegui.json
[08:02:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/pcc-worker1001/34467/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/772448 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi)
[08:05:01] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:05:03] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:07:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1132 some more weight T301879', diff saved to https://phabricator.wikimedia.org/P22928 and previous config saved to /var/cache/conftool/dbconfig/20220322-080713-marostegui.json
[08:07:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:17] <stashbot>	 T301879: Test MariaDB 10.6 on Bullseye - https://phabricator.wikimedia.org/T301879
[08:10:35] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:10:52] <wikibugs>	 (03PS1) 10Elukey: role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130)
[08:12:20] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Set bullseye + overlayfs for kubernetes1008 [puppet] - 10https://gerrit.wikimedia.org/r/772686 (https://phabricator.wikimedia.org/T300744) (owner: 10Elukey)
[08:14:11] <hashar>	 good morning
[08:14:17] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:15:48] <hashar>	 urbanecm: thanks for the patch!
[08:16:31] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772667
[08:17:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P22929 and previous config saved to /var/cache/conftool/dbconfig/20220322-081702-root.json
[08:17:05] <wikibugs>	 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup): Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi) I thought about this a little bit, perhaps the easiest to start with would be to revert the following reviews:  * ht...
[08:17:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:10] <urbanecm>	 np hashar :
[08:17:33] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:17:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772667 (owner: 10Marostegui)
[08:17:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1127 (T298557)', diff saved to https://phabricator.wikimedia.org/P22930 and previous config saved to /var/cache/conftool/dbconfig/20220322-081758-marostegui.json
[08:18:00] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[08:18:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1101.eqiad.wmnet with reason: Maintenance
[08:18:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:03] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[08:18:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1101:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22931 and previous config saved to /var/cache/conftool/dbconfig/20220322-081806-marostegui.json
[08:18:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:19:35] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1008.eqiad.wmnet with OS bullseye
[08:19:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:09] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:22:23] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34468/console" [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[08:23:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10MatthewVernon) @Aklapper I don't think that's right: ` mvernon@cumin1001:~$ sudo cumin O:swift::storage 'id swift' #[...] ===== NODE GROUP =====...
[08:24:34] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34469/console" [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[08:24:47] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:25:52] <wikibugs>	 (03PS2) 10Elukey: role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130)
[08:26:45] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34470/console" [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[08:28:25] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:28:25] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:29:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubernetes1008.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:30:33] <wikibugs>	 (03PS3) 10Elukey: role::kafka::logging: add PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130)
[08:31:13] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:31:35] <elukey>	 (downtiming k8s alerts for 1008, reimage in progress)
[08:32:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P22932 and previous config saved to /var/cache/conftool/dbconfig/20220322-083206-root.json
[08:32:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:21] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:35:05] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: host reimage
[08:35:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:36:45] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10Aklapper) Eh, thanks (and sorry). In that case, this task should depend on whatever task is about decommissioning all the old swift backends that still u...
[08:36:57] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:37:49] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1008.eqiad.wmnet with reason: host reimage
[08:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:46] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:44:58] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:47:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P22933 and previous config saved to /var/cache/conftool/dbconfig/20220322-084710-root.json
[08:47:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:00] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:49:30] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:49:56] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1008.eqiad.wmnet with OS bullseye
[08:49:58] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubernetes1008.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[08:49:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:51:18] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10MatthewVernon) I think the newest host with the old id is ms-be2056, which arrived on  2019-09-18, so we won't be decommissioning the last of these nodes...
[08:51:33] <wikibugs>	 (03PS1) 10Jaime Nuche: testwikis wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772789
[08:51:35] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] testwikis wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772789 (owner: 10Jaime Nuche)
[08:52:20] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772789 (owner: 10Jaime Nuche)
[08:52:26] <logmsgbot>	 !log jnuche@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.3  refs T300203
[08:52:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:30] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[08:53:20] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[08:53:36] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[08:55:11] <wikibugs>	 10SRE, 10SRE Observability, 10observability, 10Patch-For-Review, and 2 others: Unquoted URL parameter - https://phabricator.wikimedia.org/T304323 (10fgiunchedi)
[08:55:23] <wikibugs>	 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi)
[08:59:19] <wikibugs>	 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10Volans) I agree with this direction, as long as all the involved parties that were adding them are aware of...
[08:59:46] <XioNoX>	 !log drmrs propagate LVS med to core routers
[08:59:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P22934 and previous config saved to /var/cache/conftool/dbconfig/20220322-090214-root.json
[09:02:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:02:50] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:03:38] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:03:47] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, is unlikely but this could cause some alert to fire, and that's a good thing :)" [puppet] - 10https://gerrit.wikimedia.org/r/772448 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi)
[09:08:44] <wikibugs>	 (03PS1) 10JMeybohm: Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772790 (https://phabricator.wikimedia.org/T304237)
[09:09:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:11:56] <dcausse>	 !log restarted blazegraph on wdqs2002 (deadlocked)
[09:11:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:06] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:14:15] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772790 (https://phabricator.wikimedia.org/T304237) (owner: 10JMeybohm)
[09:17:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1175 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P22935 and previous config saved to /var/cache/conftool/dbconfig/20220322-091718-root.json
[09:17:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:21:10] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:22:28] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772790 (https://phabricator.wikimedia.org/T304237) (owner: 10JMeybohm)
[09:23:58] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:24:23] <Lucas_WMDE>	 huh, Wikibase wmf.3 is missing a backport that we did for wmf.1 and mwf.2
[09:24:33] <Lucas_WMDE>	 and that (I think?) was also merged on master
[09:24:43] <Lucas_WMDE>	 I don’t know how it dropped out of wmf.3…
[09:24:55] <Lucas_WMDE>	 oh wait, sorry, nevermind. it is in there
[09:24:59] <Lucas_WMDE>	 all good :)
[09:25:12] <mmandere>	 !log depool cp1077 for reimage - T290005
[09:25:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:25:16] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:25:16] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[09:28:06] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:28:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22936 and previous config saved to /var/cache/conftool/dbconfig/20220322-092830-marostegui.json
[09:28:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:36] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[09:28:40] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:29:48] <wikibugs>	 (03PS1) 10JMeybohm: Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772792 (https://phabricator.wikimedia.org/T304237)
[09:29:51] <wikibugs>	 (03CR) 10MMandere: [C: 03+2] site: Reimage cp1077 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772431 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[09:30:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772792 (https://phabricator.wikimedia.org/T304237) (owner: 10JMeybohm)
[09:31:34] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Renew certificates for appservers and apiservers [puppet] - 10https://gerrit.wikimedia.org/r/772792 (https://phabricator.wikimedia.org/T304237) (owner: 10JMeybohm)
[09:34:13] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp1077.eqiad.wmnet with OS buster
[09:34:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:22] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp1077.eqiad.wmnet with OS buster
[09:34:24] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:40:32] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Patch-For-Review: 'swift' user/group IDs should be consistent across the fleet - https://phabricator.wikimedia.org/T123918 (10Aklapper) 05Open→03Stalled Let's do that
[09:43:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22937 and previous config saved to /var/cache/conftool/dbconfig/20220322-094335-marostegui.json
[09:43:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:01] <wikibugs>	 (03PS1) 10MMandere: site: Reimage cp1079 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772793 (https://phabricator.wikimedia.org/T290005)
[09:44:36] <wikibugs>	 (03PS2) 10Filippo Giunchedi: nagios: quote check_http url/string parameters [puppet] - 10https://gerrit.wikimedia.org/r/772448 (https://phabricator.wikimedia.org/T304323)
[09:45:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks Volans for the review!" [puppet] - 10https://gerrit.wikimedia.org/r/772448 (https://phabricator.wikimedia.org/T304323) (owner: 10Filippo Giunchedi)
[09:46:02] <logmsgbot>	 !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cloudcontrol1005.wikimedia.org with reason: dcaro testing backups
[09:46:04] <logmsgbot>	 !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cloudcontrol1005.wikimedia.org with reason: dcaro testing backups
[09:46:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:48:16] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:48:46] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:50:18] <jinxer-wm>	 (CertAlmostExpired) firing: (2) Certificate for api-https:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[09:50:20] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[09:51:07] <logmsgbot>	 !log mmandere@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1077.eqiad.wmnet with reason: host reimage
[09:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:52:35] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[09:53:22] <icinga-wm>	 RECOVERY - LVS appservers-https codfw port 443/tcp - Main MediaWiki application server cluster- appservers.svc.codfw.wmnet -https- IPv4 #page on appservers.svc.codfw.wmnet is OK: OK - Certificate appservers-rw.discovery.wmnet will expire on Mon 06 Jul 2026 02:13:19 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[09:54:34] <logmsgbot>	 !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.3  refs T300203 (duration: 62m 07s)
[09:54:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:54:38] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[09:54:58] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1077.eqiad.wmnet with reason: host reimage
[09:55:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:15] <wikibugs>	 (03Abandoned) 10Urbanecm: MenteeOverviewPresets.getUsersToShow: Fix typo [extensions/GrowthExperiments] (wmf/1.39.0-wmf.1) - 10https://gerrit.wikimedia.org/r/772483 (https://phabricator.wikimedia.org/T304353) (owner: 10Urbanecm)
[09:58:17] <icinga-wm>	 PROBLEM - PHP opcache health on mw1406 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[09:58:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317', diff saved to https://phabricator.wikimedia.org/P22938 and previous config saved to /var/cache/conftool/dbconfig/20220322-095841-marostegui.json
[09:58:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:01] <icinga-wm>	 PROBLEM - Docker registry health on registry1004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 143 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker
[09:59:01] <icinga-wm>	 PROBLEM - Docker registry health on registry2004 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 143 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Docker
[10:00:55] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:00:57] <icinga-wm>	 PROBLEM - PHP opcache health on mw1426 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:00:59] <icinga-wm>	 PROBLEM - PHP opcache health on mw1411 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:01:13] <wikibugs>	 (03PS1) 10Jaime Nuche: group0 wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772794
[10:01:16] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] group0 wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772794 (owner: 10Jaime Nuche)
[10:01:19] <icinga-wm>	 PROBLEM - PHP opcache health on mw1404 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:01:39] <icinga-wm>	 PROBLEM - PHP opcache health on mw1354 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:01:45] <icinga-wm>	 PROBLEM - PHP opcache health on mw1361 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:01:54] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772794 (owner: 10Jaime Nuche)
[10:02:27] <icinga-wm>	 PROBLEM - PHP opcache health on mw1322 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:02:59] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:03:01] <icinga-wm>	 PROBLEM - PHP opcache health on mw1365 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:03:13] <icinga-wm>	 PROBLEM - PHP opcache health on mw1454 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:03:13] <icinga-wm>	 PROBLEM - PHP opcache health on mw1385 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:03:32] <elukey>	 this is not really good
[10:03:37] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.3  refs T300203
[10:03:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:42] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[10:03:45] <elukey>	 hashar: --^
[10:04:01] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:04:32] <elukey>	 _joe_ jayme around ?
[10:04:43] <_joe_>	 yes, already looking
[10:04:44] <jayme>	 elukey: we are
[10:04:56] <hashar>	 jnuche: here :]
[10:04:57] <hashar>	 o/
[10:04:59] <_joe_>	 elukey: scap should restart the servers in a few minutes
[10:05:02] <elukey>	 it happened after yesterday's deployment as well, only two api appservers though
[10:05:08] <elukey>	 okok
[10:05:08] <hashar>	 we have finished promoting to group 0 
[10:05:11] <_joe_>	 when it finished sending the updates
[10:05:13] <hashar>	 I am in a google meet with Jaime
[10:05:13] <_joe_>	 so let's wait
[10:05:18] <jinxer-wm>	 (CertAlmostExpired) resolved: Certificate for api-https:443 is about to expire   - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired
[10:05:21] <elukey>	 didn't know it okok
[10:05:30] <elukey>	 hashar: ack thanks :)
[10:05:33] <icinga-wm>	 PROBLEM - PHP opcache health on mw1418 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:05:37] <hashar>	 don't we restart the php7.2 opcache on deployment?
[10:05:48] <_joe_>	 hashar: yes but only at the end of the rsync
[10:05:51] <_joe_>	 which isn't ideal
[10:06:07] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:06:14] <_joe_>	 hashar: maybe we needed to actually disable opcache revalidation for the "trainsperiment"
[10:06:31] <icinga-wm>	 PROBLEM - PHP opcache health on mw1424 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:06:31] <icinga-wm>	 PROBLEM - PHP opcache health on mw1434 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:06:45] <icinga-wm>	 PROBLEM - PHP opcache health on mw1450 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:06:54] <_joe_>	 yeah this is going to get bad soon
[10:06:55] <hashar>	 could it also be filled up by the old mw versions we no more care about? I am not sure whether we cleaned them up
[10:07:05] <hashar>	 should we rollback?
[10:07:09] <icinga-wm>	 PROBLEM - PHP opcache health on mw1353 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:07:44] <_joe_>	 hashar: did scap finish?
[10:07:53] <_joe_>	 I'd wait for that
[10:07:54] <hashar>	 yes
[10:08:03] <_joe_>	 did it run check and restart of the appservers?
[10:08:25] <icinga-wm>	 PROBLEM - PHP opcache health on mw1420 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:08:30] <icinga-wm>	 PROBLEM - Etcd replication lag #page on conf2005 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 149 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Etcd
[10:08:53] * Emperor here
[10:08:58] * volans here
[10:09:05] <icinga-wm>	 PROBLEM - PHP opcache health on mw1345 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:09:05] <icinga-wm>	 PROBLEM - PHP opcache health on mw1351 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:09:07] <icinga-wm>	 PROBLEM - PHP opcache health on mw1380 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:09:07] <icinga-wm>	 PROBLEM - PHP opcache health on mw1405 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:09:07] <icinga-wm>	 PROBLEM - PHP opcache health on mw1320 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:09:08] <_joe_>	 ok, can someone look at the etcd replication thing?
[10:09:10] <jnuche>	 _joe_: scap finished deploying/promoting to group0, no idea what that implies for the appservers
[10:09:13] <_joe_>	 I have to work on appservers
[10:09:17] * volans looking at etcd
[10:09:24] <hashar>	 I would assume scap to have restarted the opcache
[10:09:31] <jnuche>	 should we rollback?
[10:09:35] <_joe_>	 no
[10:10:20] <_joe_>	 I'm not sure why the alerts are even firing tbh
[10:10:39] <Emperor>	 volans: I'm reading wikitech about etcd replication
[10:10:57] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:11:10] <Emperor>	 https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster#Replication does not fill me with joy
[10:11:46] <_joe_>	 can someone ack that alert?
[10:11:47] <icinga-wm>	 PROBLEM - PHP opcache health on mw1327 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:11:49] <icinga-wm>	 PROBLEM - PHP opcache health on mw1386 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:11:51] <icinga-wm>	 PROBLEM - PHP opcache health on mw1433 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:11:53] <_joe_>	 ok so
[10:12:00] <icinga-wm>	 RECOVERY - LVS appservers-https eqiad port 443/tcp - Main MediaWiki application server cluster- appservers.svc.eqiad.wmnet -https- IPv4 #page on appservers.svc.eqiad.wmnet is OK: OK - Certificate appservers-rw.discovery.wmnet will expire on Mon 06 Jul 2026 02:13:19 PM GMT +0000. https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:12:20] <_joe_>	 I am going to run a rolling restart of appservers
[10:12:35] <_joe_>	 volans: is replication actually running or not?
[10:12:58] <_joe_>	 because that tells me how should I do the rolling restart
[10:13:30] <hashar>	 if I look at https://grafana.wikimedia.org/d/GuHySj3mz/mediawiki-php-service?orgId=1&from=now-2d&to=now  
[10:13:43] <volans>	 _joe_: the /lag endpoint returns -1, the cdfw cluster is heathy (from etcdctl), I'm checking the replication process now
[10:13:46] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1101:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22939 and previous config saved to /var/cache/conftool/dbconfig/20220322-101346-marostegui.json
[10:13:47] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[10:13:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance
[10:13:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:51] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[10:13:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22940 and previous config saved to /var/cache/conftool/dbconfig/20220322-101354-marostegui.json
[10:13:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:13:57] <hashar>	 the used memory / number of keys seem to get flushed from time to time  since yesterday
[10:13:59] <icinga-wm>	 PROBLEM - LVS datahubsearch eqiad port 9200/tcp - Search cluster serving DataHub IPv4 on datahubsearch.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 495 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:07] <hashar>	 so I am guessing we are now overflowing the opcache 
[10:14:28] <hashar>	 https://grafana.wikimedia.org/d/GuHySj3mz/mediawiki-php-service?orgId=1&from=now-2d&to=now&viewPanel=34
[10:14:31] <volans>	 etcdmirror is logging things
[10:14:31] <volans>	 Mar 22 10:13:54 conf2005 etcdmirror-conftool-eqiad-wmnet[1440]: [etcd-mirror] INFO: Replicating key /conftool/v1/mediawiki-config/eqiad/dbconfig at index 460777
[10:14:34] <volans>	 so seems replicating
[10:14:42] <_joe_>	 volans: yes replication works
[10:14:47] <_joe_>	 not sure what the page is about
[10:14:57] <_joe_>	 ok I'll work on the actual production problem
[10:15:17] <volans>	 _joe_: the check checks for
[10:15:18] <volans>	 check_http_url_for_regexp_on_port!conf2005.codfw.wmnet!8000!/lag!'^(-[1-9]|[0-5][^0-9]+)'
[10:15:36] <volans>	 nad that endpoint currently returns -1
[10:15:51] <jayme>	 "HTTP/1.1 200 OK - pattern not found" is thrown by docker registry servers as well...maybe something in the check changed
[10:15:53] <volans>	 but this might be an artifact of the added quotes to URL
[10:15:55] <volans>	 checking
[10:16:00] <jayme>	 +1
[10:16:07] <Emperor>	 etcdctl cluster-health says OK (sorry, I'm starting from ~0 knowledge here)
[10:16:13] <icinga-wm>	 RECOVERY - PHP opcache health on mw1433 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:16:37] <volans>	 yes I think this might be an artifact of added quotes to the URL parameter in icinga command, I'm checking
[10:16:57] <Emperor>	 ack
[10:16:59] <icinga-wm>	 PROBLEM - PHP opcache health on mw1314 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:17:01] <icinga-wm>	 PROBLEM - PHP opcache health on mw1333 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:17:01] <icinga-wm>	 PROBLEM - PHP opcache health on mw1343 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:17:03] <icinga-wm>	 PROBLEM - PHP opcache health on mw1371 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:17:05] <icinga-wm>	 PROBLEM - PHP opcache health on mw1394 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:17:23] <icinga-wm>	 PROBLEM - Docker registry health on registry1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 143 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker
[10:17:23] <icinga-wm>	 PROBLEM - Docker registry health on registry2003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 143 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Docker
[10:17:37] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:18:04] <volans>	 _joe_: confirmed etcd is all good, I'm sending a patch to fix the check
[10:18:04] <_joe_>	 ok I understood what the problem is
[10:18:16] <_joe_>	 volans: what happened?
[10:18:18] <hashar>	 _joe_: should we clean up the old mediawiki versions?
[10:18:27] <volans>	 double quoting, one on the check definition oe on the commands
[10:18:36] <_joe_>	 hashar: that doesn't matter about that
[10:18:43] <_joe_>	 volans: ok who changed that?
[10:18:51] <icinga-wm>	 RECOVERY - PHP opcache health on mw1424 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:18:51] <_joe_>	 what changed that I mean
[10:18:52] <hashar>	 cause the opcache is only filed when files are being read isn't it ?
[10:19:13] <Emperor>	 _joe_: we changed some quoting on Sunday when looking at the cert expiry page
[10:19:13] <_joe_>	 hashar: correct
[10:19:14] <RhinosF1>	 _joe_: https://github.com/wikimedia/puppet/commit/033278f474e09e1ef2d24ceced220c0673e2b840
[10:19:18] <volans>	 _joe_: filippo's patch to fi the unquote URL parameter earlier
[10:19:43] <icinga-wm>	 PROBLEM - PHP opcache health on mw1313 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:19:45] <icinga-wm>	 PROBLEM - PHP opcache health on mw1326 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:19:49] <icinga-wm>	 PROBLEM - PHP opcache health on mw1400 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:20:32] <Emperor>	 To clarify: do we think all these pages are in fact the quoting issue, or is there also something unhappy?
[10:20:37] <icinga-wm>	 RECOVERY - PHP opcache health on mw1354 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:20:37] <_joe_>	 hashar: can you paste me somewhere the output of your scap command?
[10:20:43] <Emperor>	 (sorry to still be asking the stupid questions)
[10:20:47] <hashar>	 jnuche is running the train
[10:20:52] <_joe_>	 jnuche then
[10:21:02] <_joe_>	 because I'm not sure why the restart didn't happen.
[10:21:07] <jayme>	 volans: are you fixing that on the caller-side?
[10:21:28] <jnuche>	 _joe_ one sec
[10:21:39] <wikibugs>	 (03PS1) 10Volans: icinga: remove quotes from ereg parameter [puppet] - 10https://gerrit.wikimedia.org/r/772802
[10:21:52] <RhinosF1>	 _joe_: I don't think it did yesterday either as there was still some with issues from yesterday that elukey had to restart this morning
[10:21:59] <volans>	 patch here ^^^ Emperor
[10:22:09] <jayme>	 ah
[10:22:16] <volans>	 jayme: no on the command because there are multple callers and some with many escapes
[10:22:19] <_joe_>	 !log running check-and-restart on mw-eqiad-appservers
[10:22:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:22] <jayme>	 Emperor: some are due to that, but opcache is different
[10:22:24] <volans>	 so the quick fix is to reveer the added quote
[10:22:29] <volans>	 the TODO is to do it properly later
[10:22:34] <jayme>	 yes
[10:22:36] <icinga-wm>	 RECOVERY - PHP opcache health on mw1327 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:22:43] <jayme>	 wanted to point that out :)
[10:22:48] <jnuche>	 https://usercontent.irccloud-cdn.com/file/RlXEwrXK/trainsperiment-tues.log
[10:22:58] <jnuche>	 _joe_: ^^
[10:23:02] <icinga-wm>	 RECOVERY - PHP opcache health on mw1351 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:23:04] <icinga-wm>	 RECOVERY - PHP opcache health on mw1405 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:23:04] <icinga-wm>	 RECOVERY - PHP opcache health on mw1320 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:23:15] <_joe_>	 jnuche: please use phabricator's pastes
[10:23:24] <_joe_>	 so we can refer to them in tasks
[10:23:38] <icinga-wm>	 RECOVERY - PHP opcache health on mw1365 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:23:46] <wikibugs>	 (03PS2) 10Volans: icinga: remove quotes from ereg parameter [puppet] - 10https://gerrit.wikimedia.org/r/772802 (https://phabricator.wikimedia.org/T304323)
[10:23:48] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] icinga: remove quotes from ereg parameter [puppet] - 10https://gerrit.wikimedia.org/r/772802 (https://phabricator.wikimedia.org/T304323) (owner: 10Volans)
[10:23:49] <_joe_>	 jnuche: wait so the sync-apaches is still not finished?
[10:23:50] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/772802 (https://phabricator.wikimedia.org/T304323) (owner: 10Volans)
[10:24:02] <godog>	 I am in transit, my apologies for the disruption :(
[10:24:03] <wikibugs>	 (03CR) 10Volans: [V: 03+2 C: 03+2] icinga: remove quotes from ereg parameter [puppet] - 10https://gerrit.wikimedia.org/r/772802 (https://phabricator.wikimedia.org/T304323) (owner: 10Volans)
[10:24:10] <TheresNoTime>	 (got 5 mins left on the ack by the way, just FYI)
[10:24:12] <icinga-wm>	 RECOVERY - PHP opcache health on mw1326 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:24:12] <icinga-wm>	 RECOVERY - PHP opcache health on mw1353 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:24:30] <_joe_>	 TheresNoTime: which ack?
[10:24:39] * volans running puppet on alert1001
[10:24:40] <jnuche>	 _joe_: no, it finished, it seems the file didn't flush
[10:24:54] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:24:58] <TheresNoTime>	 `alertname="PHP opcache health"`, someone asked for the alert to be ack'd?
[10:25:02] <_joe_>	 jnuche: please post a complete log to phabricator
[10:25:08] <_joe_>	 TheresNoTime: not that one :)
[10:25:12] <jnuche>	 _joe_: on it
[10:25:15] <TheresNoTime>	 oh, sorry _joe_ 
[10:25:31] <TheresNoTime>	 (removed)
[10:26:07] <_joe_>	 TheresNoTime: sorry, where did you "ack" it?
[10:26:22] <icinga-wm>	 RECOVERY - PHP opcache health on mw1385 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:26:24] <icinga-wm>	 RECOVERY - PHP opcache health on mw1434 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:26:32] <TheresNoTime>	 s/ack/silence at alerts.wikimedia.org
[10:26:34] <icinga-wm>	 PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: The command defined for service Gerrit JSON does not exist https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring
[10:26:38] <icinga-wm>	 PROBLEM - PHP opcache health on mw1447 is CRITICAL: CRITICAL: opcache full on php 7.2. https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:26:44] <icinga-wm>	 RECOVERY - PHP opcache health on mw1345 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:26:54] <_joe_>	 !log running check-restart-php on api appservers
[10:26:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:56] <Emperor>	 volans: I've resolved the etcd incident in VO
[10:27:00] <hashar>	 the `script` command refuses to flush the log :]
[10:27:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (6) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[10:27:22] <volans>	 Emperor: ack, it should recover shorthly and would have done that automatically
[10:27:27] <volans>	 but that's ok too :)
[10:27:27] <RhinosF1>	 volans: is the gerrit alert another monitoring issue
[10:27:54] <wikibugs>	 (03PS3) 10Jgiannelos: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249)
[10:27:58] <icinga-wm>	 RECOVERY - PHP opcache health on mw1386 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:28:00] <volans>	 the docker registry should be the same
[10:28:10] <icinga-wm>	 RECOVERY - PHP opcache health on mw1447 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:28:11] <jayme>	 yes. So is datahubsearch
[10:28:12] <icinga-wm>	 RECOVERY - PHP opcache health on mw1343 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:28:18] <icinga-wm>	 RECOVERY - PHP opcache health on mw1380 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:28:21] <_joe_>	 hashar, jnche to be clear, scap *should have* restarted php-fpm
[10:28:22] <icinga-wm>	 RECOVERY - Docker registry health on registry2003 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Docker
[10:28:31] <jayme>	 Gerrit could be something different, though
[10:28:39] <volans>	 RhinosF1: the Gerrit JSON one?
[10:28:50] <RhinosF1>	 volans: yes
[10:28:54] <hashar>	 _joe_: yeah that is my expectation. jnuche script output doesn't have the full output most probably cause script has output buffering 
[10:28:58] <icinga-wm>	 RECOVERY - PHP opcache health on mw1394 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:29:10] <hashar>	 maybe the scap logs in kibana have some details. I am digging there
[10:29:14] <volans>	 yes it's similar, has quotes in the URL parameter on the caller side
[10:29:17] <volans>	 fixing thx
[10:29:21] <Emperor>	 looking at gerrit dashboards, so far nothing obvious
[10:30:40] <icinga-wm>	 RECOVERY - PHP opcache health on mw1415 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:30:40] <icinga-wm>	 RECOVERY - PHP opcache health on mw1411 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:30:46] <jnuche>	 _joe_: https://phabricator.wikimedia.org/P22941
[10:30:48] <icinga-wm>	 RECOVERY - LVS datahubsearch eqiad port 9200/tcp - Search cluster serving DataHub IPv4 on datahubsearch.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 495 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[10:30:48] <icinga-wm>	 RECOVERY - Docker registry health on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker
[10:30:48] <icinga-wm>	 RECOVERY - Docker registry health on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker
[10:30:49] <icinga-wm>	 RECOVERY - Etcd replication lag #page on conf2005 is OK: HTTP OK: HTTP/1.1 200 OK - 149 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Etcd
[10:30:49] <icinga-wm>	 RECOVERY - Docker registry health on registry2004 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Docker
[10:31:03] <wikibugs>	 (03PS1) 10Volans: icinga: avoid double quoting for the URL [puppet] - 10https://gerrit.wikimedia.org/r/772806 (https://phabricator.wikimedia.org/T304323)
[10:31:08] <volans>	 RhinosF1: ^^^
[10:31:23] <hashar>	 Mar 22, 2022 @ 09:54:25 sync-world Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s)
[10:31:23] <hashar>	 Mar 22, 2022 @ 10:03:28 sync-wikiversions Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s)
[10:31:36] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] icinga: avoid double quoting for the URL [puppet] - 10https://gerrit.wikimedia.org/r/772806 (https://phabricator.wikimedia.org/T304323) (owner: 10Volans)
[10:31:56] <_joe_>	 10:03:28 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 86 host(s)
[10:32:00] <_joe_>	 ok this is the issue
[10:32:05] <RhinosF1>	 volans: looks ok
[10:32:05] <_joe_>	 it just ran on 86 hosts
[10:32:08] <_joe_>	 no idea why
[10:32:11] <hashar>	 yeah no idea why
[10:32:12] <hashar>	 ahah
[10:32:16] <wikibugs>	 (03CR) 10Volans: [C: 03+2] icinga: avoid double quoting for the URL [puppet] - 10https://gerrit.wikimedia.org/r/772806 (https://phabricator.wikimedia.org/T304323) (owner: 10Volans)
[10:32:43] <Emperor>	 the check for gerrit still looks to have too many quotes
[10:32:46] <icinga-wm>	 RECOVERY - PHP opcache health on mw1400 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:32:46] <icinga-wm>	 RECOVERY - PHP opcache health on mw1418 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:32:57] <volans>	 Emperor: see the patch just merged
[10:32:59] <wikibugs>	 (03PS1) 10David Caro: Removed mdipietro and added vrook [puppet] - 10https://gerrit.wikimedia.org/r/772807
[10:33:01] <hashar>	 _joe_: have you manually restarted php on all app servers?
[10:33:12] <_joe_>	 hashar: yes
[10:33:27] <_joe_>	 hashar: with this scap bug unsolved, we can't proceed further.
[10:33:34] <Emperor>	 volans: as ever you are ahead of me :)
[10:33:41] <wikibugs>	 (03CR) 10RhinosF1: [C: 04-1] "took already has a data.yaml entry" [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro)
[10:34:06] <RhinosF1>	 dcaro: that data.yaml is completely wrong
[10:34:28] <_joe_>	 hashar: will you open a task or should I?
[10:34:49] <RhinosF1>	 a) you're adding rook to absented there b) we never replace people c) rook is already in data.yaml
[10:36:07] <hashar>	 _joe_: please do :)
[10:36:20] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:36:23] <_joe_>	 hashar: frankly, I'd prefer if you did own the issue.
[10:36:30] <_joe_>	 but ok
[10:36:37] <hashar>	 looks like that issue has been there for a while. March 7th had 86 hosts,  March 3rd 91 hosts
[10:36:42] <_joe_>	 how do I make it a blocker for all trains this week?
[10:36:50] <icinga-wm>	 RECOVERY - PHP opcache health on mw1314 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:37:05] <hashar>	 _joe_: I will file it :)
[10:37:17] <hashar>	 I don't want you to be burden by too many tasks! :D
[10:37:26] <Emperor>	 volans: since you're working on icinga, I see it's complaining about config errors
[10:37:26] <RhinosF1>	 _joe_: it's all one task for this week
[10:37:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1175: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/772482 (owner: 10Marostegui)
[10:37:59] <volans>	 Emperor: that's for dcaro
[10:38:00] <volans>	 Error: Could not find any contact matching 'mdipietro' (config file '/etc/icinga/objects/contactgroups.cfg', starting on line 67)
[10:38:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10JMeybohm)
[10:38:28] <icinga-wm>	 RECOVERY - PHP opcache health on mw1426 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:38:46] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:39:04] <icinga-wm>	 RECOVERY - PHP opcache health on mw1404 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:39:26] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[10:39:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10JMeybohm) Certs have been renewed (with cergen managed ones). Thanks @Joe for pairing!
[10:40:15] <wikibugs>	 (03CR) 10David Caro: Removed mdipietro and added vrook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro)
[10:40:59] <wikibugs>	 (03CR) 10David Caro: Removed mdipietro and added vrook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro)
[10:41:19] <logmsgbot>	 !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1077.eqiad.wmnet with OS buster
[10:41:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:23] <dcaro>	 volans: yep, just changed that
[10:41:26] <icinga-wm>	 RECOVERY - PHP opcache health on mw1361 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:41:27] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp1077.eqiad.wmnet with OS buster com...
[10:41:32] <RhinosF1>	 dcaro: if you're removing an entry from
[10:41:35] <RhinosF1>	 Data.yaml
[10:41:44] <RhinosF1>	 Or replacing someone's shell name, it's probably wrong
[10:41:50] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: deploy_to_mwdebug.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:10] <icinga-wm>	 RECOVERY - PHP opcache health on mw1313 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:42:10] <icinga-wm>	 RECOVERY - PHP opcache health on mw1333 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:42:37] <_joe_>	 jayme: can you take a look at deploy_to_mwdebug ?
[10:42:59] <dcaro>	 RhinosF1: can you elaborate on what's the right thing?
[10:43:21] <RhinosF1>	 dcaro: if you're adding a new person to data.yaml, you add a new entry
[10:43:32] <wikibugs>	 (03CR) 10MSantos: [C: 03+1] "This is ready to go. LGTM." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249) (owner: 10Jgiannelos)
[10:43:36] <dcaro>	 RhinosF1: what if I'm replacing a person?
[10:43:38] <RhinosF1>	 And move your now left worker to absent, drop their ssh key and add them to absented
[10:43:41] <dcaro>	 (renaming)
[10:43:41] <RhinosF1>	 dcaro: you don't
[10:43:50] <volans>	 dcaro: and vrook should not be in the absented group IMHO
[10:43:57] <RhinosF1>	 we don't replace shell accounts in data.yaml
[10:44:05] <RhinosF1>	 because a new staff member joined
[10:44:10] <RhinosF1>	 And old left
[10:44:14] <volans>	 that's different though
[10:44:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw1371 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:44:46] <_joe_>	 RhinosF1: there's context you're clearly missing.
[10:44:58] <RhinosF1>	 dcaro: https://github.com/wikimedia/puppet/commit/115af1f6971775168cb49fc21c5809c280badbcb was done ages ago for the same user though
[10:45:04] <RhinosF1>	 They already have a shell account
[10:45:14] <icinga-wm>	 RECOVERY - PHP opcache health on mw1454 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:45:23] <_joe_>	 RhinosF1: please let's stop discussing this here.
[10:45:57] <Emperor>	 +1 there's quite a lot of noise right now
[10:46:12] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:46:19] <mmandere>	 !log pool cp1077 with HAProxy as TLS termination layer - T290005
[10:46:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:46:24] <stashbot>	 T290005: Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005
[10:46:42] <icinga-wm>	 RECOVERY - PHP opcache health on mw1420 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:48:18] <icinga-wm>	 RECOVERY - PHP opcache health on mw1406 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:48:46] <jayme>	 _joe_: yes
[10:49:00] <icinga-wm>	 RECOVERY - PHP opcache health on mw1322 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:49:50] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:51:08] <icinga-wm>	 RECOVERY - PHP opcache health on mw1450 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[10:51:10] <jayme>	 _joe_: seems broken since friday
[10:51:51] <hashar>	 filed as https://phabricator.wikimedia.org/T304414
[10:52:05] <_joe_>	 jayme: uhh since after I fixed it?
[10:52:22] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[10:52:26] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:52:33] <_joe_>	 jayme: anyways, can you take a look and fix it? I have to work on other stuff
[10:52:33] <jayme>	 _joe_: not sure when exactly that was. error file is from 2022-03-18T14:43:22.703871
[10:52:39] <jayme>	 sure, sure
[10:54:02] <wikibugs>	 (03PS2) 10David Caro: Removed mdipietro and added vrook [puppet] - 10https://gerrit.wikimedia.org/r/772807
[10:54:10] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:55:38] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_wikidata-updateQueryServiceLag.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[10:56:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:28] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:56:38] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2003.codfw.wmnet, wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:56:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10Volans) Thanks! I think we can now destroy the ones in the Puppet CA mentioned in T304237#7790839 at this point.
[10:57:02] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2003.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2001.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2003.codfw.wmnet, wdqs2004.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[10:57:05] <dcausse>	 ^ looking
[10:58:00] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:59:26] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[10:59:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[11:00:08] <wikibugs>	 10SRE, 10SRE Observability, 10observability, 10Patch-For-Review, and 2 others: Unquoted URL parameter - https://phabricator.wikimedia.org/T304323 (10Volans) Unfortunately this had some follow up alert (some expected) due to double quoting, done both in the caller and the command definition. I think we shou...
[11:00:50] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro)
[11:01:16] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:01:38] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Removed mdipietro and added vrook [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro)
[11:01:56] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] Removed mdipietro and added vrook (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/772807 (owner: 10David Caro)
[11:02:44] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:03:31] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[11:03:32] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[11:03:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:03:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:32] <wikibugs>	 (03CR) 10Jgiannelos: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249) (owner: 10Jgiannelos)
[11:07:38] <icinga-wm>	 RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: OK - Certificate gerrit.wikimedia.org will expire on Sat 28 May 2022 08:33:22 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring
[11:08:00] <volans>	 RhinosF1: here the recovery you asked for ^
[11:08:17] <RhinosF1>	 volans: :), thanks for looking into it
[11:09:27] <wikibugs>	 (03PS3) 10Majavah: metricsinfra: Use prometheus-configurator [puppet] - 10https://gerrit.wikimedia.org/r/763664 (https://phabricator.wikimedia.org/T286299)
[11:09:50] <jayme>	 _joe_: fyi the failing release did not get ready because "Readiness probe failed: HTTP probe failed with statuscode: 503" - should be good now
[11:10:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[11:10:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:10:06] <_joe_>	 jayme: uh
[11:10:12] <_joe_>	 that's pretty bad though :P
[11:10:22] <jayme>	 yeah...it rolled back ofc
[11:10:27] <_joe_>	 also, imagine we have to do a release during an outage
[11:10:35] <_joe_>	 sigh.
[11:11:44] <jayme>	 in that case, we should potentially force it. But it's part of the deployment strategy of k8s to not continue when new pods don't come up healthy
[11:13:06] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[11:15:25] <wikibugs>	 (03PS3) 10Ladsgroup: idp: Open up orchestrator to cumin host, take IV [puppet] - 10https://gerrit.wikimedia.org/r/771866 (https://phabricator.wikimedia.org/T281249) (owner: 10Jbond)
[11:16:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22942 and previous config saved to /var/cache/conftool/dbconfig/20220322-111607-marostegui.json
[11:16:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:16:16] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[11:22:44] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:25:11] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update draft/article quality docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/772811 (https://phabricator.wikimedia.org/T300270)
[11:27:14] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:28:44] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:29:22] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:29:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1100 for reboot', diff saved to https://phabricator.wikimedia.org/P22943 and previous config saved to /var/cache/conftool/dbconfig/20220322-112931-marostegui.json
[11:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1123 for reboot', diff saved to https://phabricator.wikimedia.org/P22944 and previous config saved to /var/cache/conftool/dbconfig/20220322-113003-marostegui.json
[11:30:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:01] <marostegui>	 !log Reboot db1100 and db1123 for kernel upgrade before master swap
[11:31:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22945 and previous config saved to /var/cache/conftool/dbconfig/20220322-113113-marostegui.json
[11:31:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:44] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:35:54] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:36:16] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:36:42] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:40:51] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P22946 and previous config saved to /var/cache/conftool/dbconfig/20220322-114051-root.json
[11:40:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P22948 and previous config saved to /var/cache/conftool/dbconfig/20220322-114102-root.json
[11:41:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:20] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:46:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317', diff saved to https://phabricator.wikimedia.org/P22949 and previous config saved to /var/cache/conftool/dbconfig/20220322-114618-marostegui.json
[11:46:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:46:48] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:48:49] <hashar>	 _joe_: I think the issue is the `appserver`  dsh group which is empty. It is generated from a hiera value having `service: apache2`  but apparently that is now using `nginx`
[11:49:08] <hashar>	 scap to all servers work though cause it uses another group: `mediawiki-installation`
[11:49:22] <hashar>	 my debug digging is in https://phabricator.wikimedia.org/T304414#7796144  and following comment
[11:49:35] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Also @valerio.bozzolan you should feel free to email the IPs to noc@wikimedia.org if you wish to avoid putting them here wh...
[11:49:35] <hashar>	 essentially /etc/dsh/group/appserver is empty
[11:49:39] <_joe_>	 hashar: yeah that's probably it, I was sure I did change it when we removed the cluster
[11:49:45] <hashar>	 so we do not restart php opcache there
[11:49:48] <_joe_>	 hashar: yeah I'll fix that
[11:50:01] <_joe_>	 hashar: although some of the servers having issues were apis
[11:50:07] <_joe_>	 so I guess there's more going on
[11:50:19] <hashar>	 with https://gerrit.wikimedia.org/r/c/operations/puppet/+/767203 you have updated mediawiki-installation but haven't updated the appserver group
[11:50:31] <_joe_>	 yeah I was looking at that exactly
[11:50:39] <_joe_>	 it's an easy fix thankfully
[11:51:07] <hashar>	 and I have no idea why we run the opcache restart against hosts of `appserver,api_appserver,jobrunner,testserver,parsoid_php`  
[11:51:08] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan)
[11:51:15] <hashar>	 instead of all the ones from `mediawiki-installation` 
[11:51:20] <hashar>	 maybe cause of dumps host 
[11:51:22] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan) I've added all the details in a nice private Paste visible to you (P22947) and added it in the Task description. T...
[11:51:26] <hashar>	 anyway issue found ;]
[11:51:28] <_joe_>	 hashar: so that we run in parallel on multiple clusters
[11:51:41] <_joe_>	 instead of running sequentially through all mw servers
[11:51:51] <_joe_>	 we can run on 10% of each cluster safely
[11:51:59] <hashar>	 ah maybe
[11:52:09] <_joe_>	 instead of being forced to run on 10% of the smallest cluster to be safe
[11:52:22] <hashar>	 anyway problem solved!  I am going to have lunch and we will resume the train :]
[11:53:13] <hashar>	 jnuche: I have found the issue. The list of servers to restart php opcache on is incomplete ^
[11:53:17] <_joe_>	 hashar: yeah gimme the time to fix the issue :)
[11:53:30] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:53:31] <hashar>	 s/apache2/nginx/ !
[11:53:40] <hashar>	 I am getting lucnh &
[11:54:33] <wikibugs>	 10SRE, 10Traffic: Remove image check on Varnish Dockerized Test Environment - https://phabricator.wikimedia.org/T303794 (10MMandere) 05Open→03Resolved
[11:55:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P22950 and previous config saved to /var/cache/conftool/dbconfig/20220322-115557-root.json
[11:56:00] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan)
[11:56:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P22951 and previous config saved to /var/cache/conftool/dbconfig/20220322-115606-root.json
[11:56:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22952 and previous config saved to /var/cache/conftool/dbconfig/20220322-120123-marostegui.json
[12:01:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[12:01:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[12:01:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:29] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[12:01:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:36] <wikibugs>	 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10Ottomata) This would be easier if {T276972} was done, but it doesn't look like there's enthusiasm for it.  I'd love to be able to automate ingestion fro...
[12:04:28] <wikibugs>	 (03PS1) 10Cathal Mooney: Re-enable direct path to Seabone / Telecom Italia in Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/772815
[12:04:30] <wikibugs>	 (03PS1) 10Ladsgroup: Enable WRITE BOTH for templatelinks normalization in wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772816 (https://phabricator.wikimedia.org/T299421)
[12:04:44] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:05:53] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Re-enable direct path to Seabone / Telecom Italia in Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/772815 (owner: 10Cathal Mooney)
[12:06:19] <wikibugs>	 (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Re-enable direct path to Seabone / Telecom Italia in Eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/772815 (owner: 10Cathal Mooney)
[12:08:23] <Amir1>	 jouncebot: nowandnext
[12:08:23] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 51 minute(s)
[12:08:23] <jouncebot>	 In 0 hour(s) and 51 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T1300)
[12:08:42] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Enable WRITE BOTH for templatelinks normalization in wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772816 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup)
[12:09:05] <wikibugs>	 (03CR) 10EllenR: "This looks good; however I like having the tags (T123456) for the various changes. It is very helpful to understand why particular pieces " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan)
[12:09:24] <wikibugs>	 (03Merged) 10jenkins-bot: Enable WRITE BOTH for templatelinks normalization in wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772816 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup)
[12:09:47] <wikibugs>	 (03CR) 10EllenR: [C: 03+1] "Sorry, forgot to get the code review number in -" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan)
[12:11:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P22953 and previous config saved to /var/cache/conftool/dbconfig/20220322-121101-root.json
[12:11:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:11:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P22954 and previous config saved to /var/cache/conftool/dbconfig/20220322-121110-root.json
[12:11:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:10] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:12:12] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:772816|Enable WRITE BOTH for templatelinks normalization in wikitech (T299421)]] (duration: 01m 41s)
[12:12:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:16] <stashbot>	 T299421: Turn on write both in production for templatelinks normalization - https://phabricator.wikimedia.org/T299421
[12:13:26] <wikibugs>	 (03PS1) 10Ladsgroup: Enable WRITE BOTH on rest of s6 for templatelinks normalization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772817 (https://phabricator.wikimedia.org/T299421)
[12:14:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10DMburugu) I approve the request.
[12:15:02] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:15:26] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:15:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[12:15:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:49] <wikibugs>	 (03PS1) 10Cathal Mooney: Revert "Re-enable direct path to Seabone / Telecom Italia in Eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/772668
[12:16:04] <marostegui>	 !log dbmaint s8@eqiad T300992
[12:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:16:07] <wikibugs>	 (03PS2) 10Ladsgroup: Enable WRITE BOTH on rest of s6 for templatelinks normalization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772817 (https://phabricator.wikimedia.org/T299421)
[12:16:08] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:16:08] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[12:17:31] <wikibugs>	 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10jbond) +1 i think the -C change was mostly introduced by me and happy for it to be reverted, other options...
[12:17:34] <marostegui>	 !log dbmaint s5@eqiad T300992
[12:17:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:14] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Revert "Re-enable direct path to Seabone / Telecom Italia in Eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/772668 (owner: 10Cathal Mooney)
[12:18:38] <wikibugs>	 (03CR) 10Cathal Mooney: [V: 03+2 C: 03+2] Revert "Re-enable direct path to Seabone / Telecom Italia in Eqiad" [homer/public] - 10https://gerrit.wikimedia.org/r/772668 (owner: 10Cathal Mooney)
[12:18:53] <marostegui>	 !log dbmaint s6@eqiad T300992
[12:18:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[12:19:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[12:19:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:14] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Thanks for the info @valerio.bozzolan   It seems the return traffic to that address was routing out of our network to Telia...
[12:20:40] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Enable WRITE BOTH on rest of s6 for templatelinks normalization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772817 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup)
[12:21:15] <marostegui>	 !log dbmaint s7@eqiad T300992
[12:21:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:19] <stashbot>	 T300992: Add linter_template and linter_tag columns to the Linter table - https://phabricator.wikimedia.org/T300992
[12:21:23] <wikibugs>	 (03Merged) 10jenkins-bot: Enable WRITE BOTH on rest of s6 for templatelinks normalization [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772817 (https://phabricator.wikimedia.org/T299421) (owner: 10Ladsgroup)
[12:23:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:23:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:36] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:772817|Enable WRITE BOTH on rest of s6 for templatelinks normalization (T299421)]] (duration: 00m 54s)
[12:24:38] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:41] <stashbot>	 T299421: Turn on write both in production for templatelinks normalization - https://phabricator.wikimedia.org/T299421
[12:24:49] <marostegui>	 !log dbmaint s3@eqiad T300600
[12:24:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:55] <stashbot>	 T300600: Upgrade s3 to Bullseye - https://phabricator.wikimedia.org/T300600
[12:26:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P22955 and previous config saved to /var/cache/conftool/dbconfig/20220322-122605-root.json
[12:26:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:26:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P22956 and previous config saved to /var/cache/conftool/dbconfig/20220322-122613-root.json
[12:26:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:02] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: networktests: discard even more hostkey checking stuff [puppet] - 10https://gerrit.wikimedia.org/r/772818 (https://phabricator.wikimedia.org/T304420)
[12:28:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[12:28:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:30:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1132 after testing', diff saved to https://phabricator.wikimedia.org/P22957 and previous config saved to /var/cache/conftool/dbconfig/20220322-123056-marostegui.json
[12:30:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[12:32:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[12:32:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:50] <wikibugs>	 (03CR) 10Aklapper: "Please abandon if this is not wanted/needed anymore" [deployment-charts] - 10https://gerrit.wikimedia.org/r/748734 (owner: 10Varac)
[12:33:14] <wikibugs>	 (03CR) 10JMeybohm: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966) (owner: 10JMeybohm)
[12:33:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:33:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:08] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 16 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:36:45] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[12:36:46] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[12:36:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1123 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P22958 and previous config saved to /var/cache/conftool/dbconfig/20220322-124109-root.json
[12:41:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:18] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1100 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P22959 and previous config saved to /var/cache/conftool/dbconfig/20220322-124117-root.json
[12:41:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:22] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[12:41:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[12:41:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:41:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:00] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: networktests: discard even more hostkey checking stuff [puppet] - 10https://gerrit.wikimedia.org/r/772818 (https://phabricator.wikimedia.org/T304420)
[12:44:18] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:44:24] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:45:18] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Ok I've emailed Seabone/TI NOC now, hopefully they come back with something meaningful.  There isn't a whole lot more we ca...
[12:51:04] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[12:51:06] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2121.codfw.wmnet with reason: Maintenance
[12:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:07] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 10 hosts with reason: Maintenance
[12:51:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:14] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 10 hosts with reason: Maintenance
[12:51:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:52] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:52:05] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[12:52:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1122.eqiad.wmnet with reason: Maintenance
[12:52:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:52:20] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:54:28] <moritzm>	 !log installing 5.10.103 kernels on servers running a kernel from buster backports T303179
[12:54:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:54:44] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:54:46] <wikibugs>	 10SRE, 10Znuny, 10serviceops, 10Patch-For-Review: Move VTRS db passwords to a different hiera location - https://phabricator.wikimedia.org/T303272 (10jbond) The change has been made on the private repo  ` git show b9303238                                                                              [12:52:...
[12:55:50] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:56:05] <wikibugs>	 (03PS1) 10Jbond: vtrs: move password to profile name space [labs/private] - 10https://gerrit.wikimedia.org/r/772821 (https://phabricator.wikimedia.org/T303272)
[12:56:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] vtrs: move password to profile name space [labs/private] - 10https://gerrit.wikimedia.org/r/772821 (https://phabricator.wikimedia.org/T303272) (owner: 10Jbond)
[12:56:56] <hashar>	 o/
[12:56:58] <hashar>	 backkk
[12:57:20] <_joe_>	 hashar: sorry I got diverted by other stuff, will do the patch now
[12:57:37] <hashar>	 :D
[12:58:00] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:58:20] <hashar>	 the expected outcome is  `/etc/dsh/group/appserver` should have hosts defined
[12:58:33] <wikibugs>	 (03PS5) 10Jbond: mariadb: Reference the actual VRTS passwords in the m2 grants file. [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) (owner: 10Kormat)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T1300).
[13:00:05] <jouncebot>	 nemo-yiannis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:27] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34472/console" [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) (owner: 10Kormat)
[13:01:24] <Lucas_WMDE>	 o/
[13:01:28] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:01:32] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:01:33] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/764744 (https://phabricator.wikimedia.org/T303272) (owner: 10Kormat)
[13:02:47] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: scap: fix dsh targets for php restarts [puppet] - 10https://gerrit.wikimedia.org/r/772822 (https://phabricator.wikimedia.org/T304414)
[13:03:16] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:03:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: fix dsh targets for php restarts [puppet] - 10https://gerrit.wikimedia.org/r/772822 (https://phabricator.wikimedia.org/T304414) (owner: 10Giuseppe Lavagetto)
[13:04:02] <wikibugs>	 (03PS6) 10Ladsgroup: mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890
[13:06:32] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:07:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond)
[13:07:56] <wikibugs>	 10SRE, 10SRE-Access-Requests: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) @thcipriani are you able to approve @TThoabala membership of the deployment group  @Tchanders Sounds good to me, ill get all the approvals in [lace and create the change...
[13:08:23] <wikibugs>	 (03PS7) 10Ladsgroup: mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890
[13:08:32] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: DRY username of wikiuser to hiera [puppet] - 10https://gerrit.wikimedia.org/r/770890 (owner: 10Ladsgroup)
[13:08:42] <_joe_>	 hashar: fixed
[13:08:56] <_joe_>	 thanks for the analysis and sorry for the issue arising in the first place :/
[13:08:58] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:10:06] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:10:08] <hashar>	 _joe_: it happens :D
[13:10:32] <hashar>	 there are so many layers of config it is hard to figure it out entirely
[13:10:42] <hashar>	 conftool / hiera / dsh files / scap itself etc
[13:11:21] <wikibugs>	 (03PS1) 10Jbond: admin: add tsepothoabala to deployment [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398)
[13:11:26] <hashar>	 if we ran the php opcache restart via scap, we surely would have noticed it
[13:11:52] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 10 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:12:56] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:13:00] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:14:50] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "ill -1 this until TsepoThoabala returns" [puppet] - 10https://gerrit.wikimedia.org/r/772823 (https://phabricator.wikimedia.org/T303398) (owner: 10Jbond)
[13:15:10] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:16:26] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10jbond) 05Open→03Stalled Change to stalled until TsepoThoabala return
[13:19:25] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudgw2002-dev.codfw.wmnet
[13:19:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:28] <logmsgbot>	 !log aborrero@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudgw2002-dev.codfw.wmnet
[13:19:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:01] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudgw2002-dev.codfw.wmnet
[13:20:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:35] <wikibugs>	 (03PS30) 10Btullis: Add helm charts and a helmfile configuration for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454)
[13:21:09] <wikibugs>	 (03PS16) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117)
[13:21:14] <hashar>	 we are promoting 1.39.0-wmf.3 to group 1
[13:21:44] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/pcc-worker1001/34473/" [puppet] - 10https://gerrit.wikimedia.org/r/772818 (https://phabricator.wikimedia.org/T304420) (owner: 10Arturo Borrero Gonzalez)
[13:22:32] <wikibugs>	 (03PS1) 10Jcrespo: Initial release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824
[13:23:00] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[13:23:16] <wikibugs>	 (03PS1) 10Jaime Nuche: group1 wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772826
[13:23:18] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] group1 wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772826 (owner: 10Jaime Nuche)
[13:23:58] <wikibugs>	 (03CR) 10Btullis: Add helm charts and a helmfile configuration for datahub (0312 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/764375 (https://phabricator.wikimedia.org/T301454) (owner: 10Btullis)
[13:24:08] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772826 (owner: 10Jaime Nuche)
[13:24:14] <wikibugs>	 (03PS1) 10Majavah: update cloud-vps bastion ip to bastion-eqiad1-03 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/772827
[13:24:26] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:25:37] <wikibugs>	 (03PS2) 10Jcrespo: Initial release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445)
[13:25:54] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2002-dev.codfw.wmnet
[13:25:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:02] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudgw2001-dev.codfw.wmnet
[13:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:07] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.3  refs T300203
[13:26:10] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:26:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:12] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[13:26:38] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:27:00] <logmsgbot>	 !log jnuche@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.3  refs T300203 (duration: 00m 52s)
[13:27:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:52] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudmetrics1003.eqiad.wmnet
[13:27:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:04] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:29:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:29:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:06] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:30:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:30:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:30:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:31:09] <wikibugs>	 (03PS1) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503)
[13:31:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:31:56] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) @valerio.bozzolan the affected users are direct Telecom Italia customers is that correct?  It certainly wouldn't hurt if th...
[13:32:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata)
[13:33:00] <wikibugs>	 (03CR) 10MVernon: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[13:33:58] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw2001-dev.codfw.wmnet
[13:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:52] <wikibugs>	 (03PS2) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503)
[13:35:06] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:35:40] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1003.eqiad.wmnet
[13:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:50] <jnuche>	 promoting  1.39.0-wmf.3 to group 2 now
[13:36:09] <wikibugs>	 (03PS1) 10Jaime Nuche: all wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772830
[13:36:11] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+2] all wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772830 (owner: 10Jaime Nuche)
[13:36:26] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudmetrics1004.eqiad.wmnet
[13:36:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:49] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.3  refs T300203 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772830 (owner: 10Jaime Nuche)
[13:37:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata)
[13:37:36] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:38:54] <wikibugs>	 (03PS3) 10Jcrespo: Initial release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445)
[13:39:55] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.3  refs T300203
[13:40:05] <jnuche>	 _joe_: 13:39:27 Running '/usr/local/sbin/check-and-restart-php php7.2-fpm 100' on 347 host(s)
[13:40:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:09] <hashar>	 \o/
[13:40:10] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[13:40:15] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet
[13:40:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:16] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:41:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:41] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[13:41:43] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[13:41:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1181 (T298557)', diff saved to https://phabricator.wikimedia.org/P22960 and previous config saved to /var/cache/conftool/dbconfig/20220322-134148-marostegui.json
[13:41:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:52] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[13:42:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:42:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:42:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:15] <wikibugs>	 (03PS4) 10Jgiannelos: Remove unused wgKartographerDfltStyle after tegola roll out [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772428 (https://phabricator.wikimedia.org/T298249)
[13:43:18] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:43:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:43:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:46] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:44:07] <wikibugs>	 (03PS3) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503)
[13:44:34] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudmetrics1004.eqiad.wmnet
[13:44:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:44:43] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Install package and its dependencies through .deb [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020)
[13:45:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediabackups: Install package and its dependencies through .deb [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020) (owner: 10Jcrespo)
[13:46:11] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet
[13:46:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:46:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata)
[13:47:57] <wikibugs>	 (03PS2) 10Jcrespo: mediabackups: Install package and its dependencies through .deb [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020)
[13:49:00] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:49:26] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:49:49] <wikibugs>	 (03PS3) 10Ssingh: P:icinga: add profile for performance tweaking [puppet] - 10https://gerrit.wikimedia.org/r/771610 (https://phabricator.wikimedia.org/T303593)
[13:50:04] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:51:19] <wikibugs>	 (03CR) 10Ssingh: P:icinga: add profile for performance tweaking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/771610 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh)
[13:51:32] <wikibugs>	 (03CR) 10Ssingh: "rebased and added bug #, no code change." [puppet] - 10https://gerrit.wikimedia.org/r/771610 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh)
[13:52:07] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet
[13:52:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:56] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:56:09] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 03+2] update cloud-vps bastion ip to bastion-eqiad1-03 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/772827 (owner: 10Majavah)
[13:57:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (6) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[13:57:04] <wikibugs>	 10SRE, 10SRE-OnFire (FY2021/2022-Q3), 10Infrastructure-Foundations, 10SRE Observability (FY2021/2022-Q3): Implement an accurate and easy to understand status page for all wikis - https://phabricator.wikimedia.org/T202061 (10Elitre) >>! In T202061#7774033, @CDanis wrote: > @lmata yeah, sorry, that's been on...
[13:58:31] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet
[13:58:38] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:58:40] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:59:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] hiera: add dummy tokens for ML staging k8s setup [labs/private] - 10https://gerrit.wikimedia.org/r/772430 (owner: 10Klausman)
[14:01:08] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Not until a definitive (even if first iteration) package is ready and uploaded." [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020) (owner: 10Jcrespo)
[14:01:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] update cloud-vps bastion ip to bastion-eqiad1-03 (bullseye) [puppet] - 10https://gerrit.wikimedia.org/r/772827 (owner: 10Majavah)
[14:03:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298557)', diff saved to https://phabricator.wikimedia.org/P22961 and previous config saved to /var/cache/conftool/dbconfig/20220322-140331-marostegui.json
[14:07:10] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:07:12] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:09:48] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle)
[14:10:00] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:10:31] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle)
[14:11:03] <wikibugs>	 10SRE, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2022 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle)
[14:11:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] openstack: networktests: discard even more hostkey checking stuff [puppet] - 10https://gerrit.wikimedia.org/r/772818 (https://phabricator.wikimedia.org/T304420) (owner: 10Arturo Borrero Gonzalez)
[14:12:22] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:13:52] <wikibugs>	 (03PS4) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503)
[14:15:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] openstack: networktests: discard even more hostkey checking stuff [puppet] - 10https://gerrit.wikimedia.org/r/772818 (https://phabricator.wikimedia.org/T304420) (owner: 10Arturo Borrero Gonzalez)
[14:16:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata)
[14:18:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P22962 and previous config saved to /var/cache/conftool/dbconfig/20220322-141836-marostegui.json
[14:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:27] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Change wikiuser to wikiuser2022 [puppet] - 10https://gerrit.wikimedia.org/r/772834
[14:19:31] <wikibugs>	 (03PS5) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503)
[14:19:54] <wikibugs>	 (03PS4) 10Klausman: hiera: add dummy tokens for ML staging k8s setup [labs/private] - 10https://gerrit.wikimedia.org/r/772430
[14:19:59] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: add dummy tokens for ML staging k8s setup [labs/private] - 10https://gerrit.wikimedia.org/r/772430 (owner: 10Klausman)
[14:21:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata)
[14:21:56] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: add dummy tokens for ML staging k8s setup [labs/private] - 10https://gerrit.wikimedia.org/r/772430 (owner: 10Klausman)
[14:22:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Change wikiuser to wikiuser2022 [puppet] - 10https://gerrit.wikimedia.org/r/772834 (owner: 10Ladsgroup)
[14:22:16] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3226 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[14:22:27] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] mariadb: Change wikiuser to wikiuser2022 [puppet] - 10https://gerrit.wikimedia.org/r/772834 (owner: 10Ladsgroup)
[14:23:01] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/771610 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh)
[14:25:02] <wikibugs>	 (03PS6) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117)
[14:25:20] <wikibugs>	 (03PS3) 10Elukey: Initial debianization of istio-cni [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612)
[14:26:07] <wikibugs>	 (03PS6) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503)
[14:27:05] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Hmm ok.  I can see in the traceroute it now makes it a few hops further: ` cmooney@re0.cr2-eqiad> traceroute wait 1 no-reso...
[14:27:10] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:27:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[14:27:58] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[14:28:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata)
[14:29:26] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 15 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:29:32] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10cmooney) Hmm ok.  I can see in the traceroute it now makes it a few hops further: ` cmooney@re0.cr2-eqiad> traceroute wait 1 no-reso...
[14:30:13] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops: Q2(Need By: TBD) rack/setup/install new mr1-eqsin - https://phabricator.wikimedia.org/T294872 (10ayounsi) The SRX300 is ready to be put in production.  Because the way it was staged, it will need a small config change (renumber irb.900 from 10.132.128.3 to 10.132.128.1) for devi...
[14:33:38] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03226 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[14:33:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P22963 and previous config saved to /var/cache/conftool/dbconfig/20220322-143341-marostegui.json
[14:33:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:35:22] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Chinese-Sites: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Stang)
[14:35:36] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:35:38] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:35:45] <wikibugs>	 (03PS7) 10MVernon: puppetmaster: rsync swift rings from each cluster's ring manager [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117)
[14:36:06] <wikibugs>	 10SRE, 10Wikimedia-Site-requests, 10Chinese-Sites: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Stang) Consensus reached hundreds of years ago, removing tag
[14:37:59] <wikibugs>	 (03PS1) 10Hashar: Merge tag 'v3.3.10' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772838 (https://phabricator.wikimedia.org/T304226)
[14:38:13] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Merge tag 'v3.3.10' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772838 (https://phabricator.wikimedia.org/T304226) (owner: 10Hashar)
[14:38:22] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[14:40:32] <wikibugs>	 (03PS7) 10Ottomata: Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503)
[14:40:53] <wikibugs>	 (03PS1) 10David Caro: wmcs.backy2: add link to the runbook for backup_vms [puppet] - 10https://gerrit.wikimedia.org/r/772839 (https://phabricator.wikimedia.org/T304408)
[14:41:28] <wikibugs>	 (03PS17) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117)
[14:44:47] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[14:45:36] <wikibugs>	 (03CR) 10MVernon: swift: deploy swift_ring_manager to one node per cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[14:45:56] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] site: Reimage cp1079 as cache::text_haproxy [puppet] - 10https://gerrit.wikimedia.org/r/772793 (https://phabricator.wikimedia.org/T290005) (owner: 10MMandere)
[14:46:26] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:47:05] <wikibugs>	 (03PS18) 10MVernon: swift: deploy swift_ring_manager to one node per cluster [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117)
[14:48:44] <wikibugs>	 (03Merged) 10jenkins-bot: Merge tag 'v3.3.10' into wmf/stable-3.3 [software/gerrit] (wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772838 (https://phabricator.wikimedia.org/T304226) (owner: 10Hashar)
[14:48:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T298557)', diff saved to https://phabricator.wikimedia.org/P22964 and previous config saved to /var/cache/conftool/dbconfig/20220322-144847-marostegui.json
[14:48:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[14:48:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[14:48:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:52] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[14:48:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298557)', diff saved to https://phabricator.wikimedia.org/P22965 and previous config saved to /var/cache/conftool/dbconfig/20220322-144855-marostegui.json
[14:48:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:18] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[14:49:48] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:49:50] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:50:43] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] P:icinga: add profile for performance tweaking [puppet] - 10https://gerrit.wikimedia.org/r/771610 (https://phabricator.wikimedia.org/T303593) (owner: 10Ssingh)
[14:53:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata)
[14:53:23] <wikibugs>	 (03PS1) 10Ayounsi: Setup new mr1-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/772843 (https://phabricator.wikimedia.org/T294872)
[14:53:51] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add 2 new alerts for data-engineering gobblin [alerts] - 10https://gerrit.wikimedia.org/r/772829 (https://phabricator.wikimedia.org/T286503) (owner: 10Ottomata)
[14:54:13] <wikibugs>	 (03PS2) 10Ayounsi: Setup new mr1-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/772843 (https://phabricator.wikimedia.org/T294872)
[14:54:54] <wikibugs>	 (03PS6) 10JMeybohm: Switch service type to ClusterIP in case Ingress is enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/770556 (https://phabricator.wikimedia.org/T290966)
[14:57:26] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:57:42] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstac: networktests: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/772844
[14:58:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstac: networktests: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/772844 (owner: 10Arturo Borrero Gonzalez)
[14:59:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM modulo testing the CR chain in Pontoon" [puppet] - 10https://gerrit.wikimedia.org/r/769941 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[14:59:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM modulo testing the CR chain in Pontoon" [puppet] - 10https://gerrit.wikimedia.org/r/769942 (https://phabricator.wikimedia.org/T265117) (owner: 10MVernon)
[14:59:47] <wikibugs>	 (03PS11) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195)
[15:00:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I'll let Cole vote though since I'm not super familiar with the changes, idea LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/772788 (https://phabricator.wikimedia.org/T300130) (owner: 10Elukey)
[15:00:44] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:01:06] <wikibugs>	 (03PS1) 10Hashar: Update Gerrit to v3.3.10 [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772846 (https://phabricator.wikimedia.org/T304226)
[15:01:40] <wikibugs>	 (03PS12) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195)
[15:02:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (6) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:02:19] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "git fat works!" [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772846 (https://phabricator.wikimedia.org/T304226) (owner: 10Hashar)
[15:02:42] <wikibugs>	 (03Merged) 10jenkins-bot: Update Gerrit to v3.3.10 [software/gerrit] (deploy/wmf/stable-3.3) - 10https://gerrit.wikimedia.org/r/772846 (https://phabricator.wikimedia.org/T304226) (owner: 10Hashar)
[15:05:12] <wikibugs>	 (03Abandoned) 10SBassett: admin: replace existing ssh key for sbassett [puppet] - 10https://gerrit.wikimedia.org/r/772410 (https://phabricator.wikimedia.org/T304319) (owner: 10SBassett)
[15:06:22] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@967b0d7]: Gerrit to 3.3.10 on gerrit2001 T304226
[15:06:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:27] <stashbot>	 T304226: Gerrit security release 3.3.10 - https://phabricator.wikimedia.org/T304226
[15:06:35] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@967b0d7]: Gerrit to 3.3.10 on gerrit2001 T304226 (duration: 00m 12s)
[15:06:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:06:39] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1003.eqiad.wmnet with OS bullseye
[15:06:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:08:05] <wikibugs>	 (03PS9) 10Btullis: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi)
[15:08:52] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/2 UP : 3 v2 P2P interfaces vs. 2 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:10:41] <hashar>	 !log Upgrading and starting Gerrit on gerrit2001 (replica)
[15:10:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:05] <hashar>	 jouncebot: now
[15:13:05] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 46 minute(s)
[15:13:46] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@967b0d7]: Gerrit to 3.3.10 on gerrit1001 T304226
[15:13:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:13:50] <stashbot>	 T304226: Gerrit security release 3.3.10 - https://phabricator.wikimedia.org/T304226
[15:13:56] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@967b0d7]: Gerrit to 3.3.10 on gerrit1001 T304226 (duration: 00m 10s)
[15:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:14] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:14:31] <hashar>	 !log Stopping Gerrit for security update T304226
[15:14:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:04] <hashar>	 !log Gerrit 3.3.10 up and running T304226
[15:17:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:30] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:21:22] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10Papaul) I asked @Cmjohnson to connect cloudvrit1024 to asw2-b4 yesterday for testing, the result was the same  ` Failed to load ld...
[15:21:46] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add dummy secrets for datahub deployment [labs/private] - 10https://gerrit.wikimedia.org/r/771563 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[15:22:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1174 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P22967 and previous config saved to /var/cache/conftool/dbconfig/20220322-152247-root.json
[15:22:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:02] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[15:25:04] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[15:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:09] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T298557)', diff saved to https://phabricator.wikimedia.org/P22968 and previous config saved to /var/cache/conftool/dbconfig/20220322-152508-marostegui.json
[15:25:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:15] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[15:26:08] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 14 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:26:34] <wikibugs>	 (03PS1) 10Klausman: hiera: Add k8s dummy tokens for ML staging env [labs/private] - 10https://gerrit.wikimedia.org/r/772866 (https://phabricator.wikimedia.org/T302195)
[15:29:42] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage
[15:29:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:54] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: Add k8s dummy tokens for ML staging env [labs/private] - 10https://gerrit.wikimedia.org/r/772866 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[15:30:00] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: Add k8s dummy tokens for ML staging env [labs/private] - 10https://gerrit.wikimedia.org/r/772866 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[15:30:04] <wikibugs>	 (03CR) 10Btullis: karapace: add karapace role (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi)
[15:30:05] <wikibugs>	 (03PS13) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195)
[15:30:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add a kubeconfig configuration for datahub [puppet] - 10https://gerrit.wikimedia.org/r/771407 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[15:30:46] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:32:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (6) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:32:08] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:32:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Feel free to ping me when you're ready to merge/deploy this (if you feel like you want somebody around)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/771409 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[15:33:13] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage
[15:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:12] <wikibugs>	 (03PS10) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562)
[15:36:21] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:36:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a kubeconfig configuration for datahub [puppet] - 10https://gerrit.wikimedia.org/r/771407 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[15:38:03] <wikibugs>	 (03PS11) 10Razzi: karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562)
[15:38:15] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Ladsgroup) The code to do db switchover is https://github.com/wikimedia...
[15:38:55] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:39:26] <wikibugs>	 (03CR) 10Btullis: [V: 03+2 C: 03+2] Add dummy secrets for datahub deployment [labs/private] - 10https://gerrit.wikimedia.org/r/771563 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[15:39:51] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:42:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr)
[15:42:19] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.7419 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[15:42:29] <wikibugs>	 (03PS14) 10Klausman: hiera: Add ML staging k8s ctrl node config [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195)
[15:43:22] <wikibugs>	 (03PS1) 10Cathal Mooney: Add new Analytics subnets to static Capirca net definitions [homer/public] - 10https://gerrit.wikimedia.org/r/772868 (https://phabricator.wikimedia.org/T299758)
[15:43:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298557)', diff saved to https://phabricator.wikimedia.org/P22969 and previous config saved to /var/cache/conftool/dbconfig/20220322-154349-marostegui.json
[15:43:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:54] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[15:45:02] <wikibugs>	 10SRE, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-fgiunchedi: Most Icinga http checks ignore the URL parameter - https://phabricator.wikimedia.org/T304321 (10fgiunchedi) Thank you for the feedback!   >>! In T304321#7795644, @Volans wrote: > I agree with this direct...
[15:45:18] <wikibugs>	 10SRE-tools, 10DBA, 10Infrastructure-Foundations, 10Patch-For-Review, and 2 others: Create or modify an existing tool that quickly shows the db replication status in case of master failure - https://phabricator.wikimedia.org/T281249 (10Marostegui) That's the main thing and what {T196366} also needs. The di...
[15:46:10] <wikibugs>	 (03PS1) 10Filippo Giunchedi: nagios_common: remove -C from check_http [puppet] - 10https://gerrit.wikimedia.org/r/772869 (https://phabricator.wikimedia.org/T304321)
[15:47:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (6) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:47:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Cmjohnson) @Vgutierrez the new DIMM is here, please let me know when I can make the swap
[15:48:06] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34482/console" [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi)
[15:48:23] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/4 UP : 5 v2 P2P interfaces vs. 4 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:50:37] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:50:59] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi)
[15:51:02] <wikibugs>	 (03CR) 10Razzi: [V: 03+1 C: 03+2] karapace: add karapace role [puppet] - 10https://gerrit.wikimedia.org/r/771419 (https://phabricator.wikimedia.org/T301562) (owner: 10Razzi)
[15:51:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:53:01] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3548 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[15:53:12] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add a namespace for datahub [deployment-charts] - 10https://gerrit.wikimedia.org/r/771409 (https://phabricator.wikimedia.org/T303049) (owner: 10Btullis)
[15:54:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:54:44] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1003.eqiad.wmnet with OS bullseye
[15:54:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:05] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:56:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:56:56] <wikibugs>	 (03PS1) 10Sergio Gimeno: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825)
[15:57:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (5) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[15:57:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno)
[15:58:31] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:58:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P22970 and previous config saved to /var/cache/conftool/dbconfig/20220322-155854-marostegui.json
[15:58:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:16] <wikibugs>	 (03PS2) 10Sergio Gimeno: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825)
[15:59:36] <moritzm>	 !log imported jvmquake 1.0.1 for stretch/buster (JDK8) and bullseye (JDK11)
[15:59:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:04] <jouncebot>	 jbond and rzl: Time to snap out of that daydream and deploy Puppet request window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T1600).
[16:00:04] <jouncebot>	 taavi: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno)
[16:00:15] <rzl>	 taavi: 👋 looking
[16:00:46] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1003.eqiad.wmnet with OS bullseye
[16:00:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:01:19] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:02:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (5) Elasticsearch instance cloudelastic1001-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:02:05] <taavi>	 o/ hey rzl 
[16:02:12] <rzl>	 _joe_: if you're around, do you have a moment to look at https://gerrit.wikimedia.org/r/724049 for the puppet request window? I want to make sure your -1 is addressed
[16:02:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Cmjohnson) 05Open→03Resolved Received the DIMM and replaced it, resolving this task
[16:02:43] <rzl>	 taavi: looking in the meantime but I'll be a sec to chew through these regexes myself :)
[16:03:00] <wikibugs>	 (03PS3) 10Sergio Gimeno: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825)
[16:03:21] <taavi>	 I know my patch may not be the exact fit to this window per https://wikitech.wikimedia.org/wiki/Puppet_request_window#What_kind_of_patches_can_go_through_Puppet_request_windows? but I don't see any other way to push that patch forward, sorry :-/
[16:03:43] <_joe_>	 rzl: oof
[16:03:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno)
[16:03:53] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.06452 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:04:03] <_joe_>	 rzl: not really time actually to re-vet that
[16:04:37] <_joe_>	 taavi: sadly our team is down to 3 people and a bit thin on resources; if people need us to merge patches that are not immediate blockers they'll have to wait.
[16:05:02] <wikibugs>	 (03PS1) 10Klausman: labs: Add dummy keyfile for ML staging k8s in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/772871 (https://phabricator.wikimedia.org/T302195)
[16:05:59] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] labs: Add dummy keyfile for ML staging k8s in codfw [labs/private] - 10https://gerrit.wikimedia.org/r/772871 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[16:07:02] <rzl>	 _joe_: ack, thanks for checking -- taavi: sorry, I probably can't get this merged in the puppet window but I'll keep it on my radar, and give it a proper review as soon as time permits
[16:07:09] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34485/console" [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[16:07:09] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudnet1003.eqiad.wmnet with OS bullseye
[16:07:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:26] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1003.eqiad.wmnet with OS bullseye
[16:07:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:36] <taavi>	 :/ fair enough, thanks anyways
[16:07:54] <_joe_>	 taavi: I'll put that patch at the end of my current queue of smaller things I can do in the leftover time though
[16:08:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:08:20] <taavi>	 <3
[16:08:24] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34486/console" [puppet] - 10https://gerrit.wikimedia.org/r/772417 (https://phabricator.wikimedia.org/T302195) (owner: 10Klausman)
[16:09:14] <logmsgbot>	 !log btullis@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[16:09:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:09:19] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[16:11:33] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: cp1085 memory errors on DIMM A5 - https://phabricator.wikimedia.org/T303183 (10Cmjohnson)
[16:11:48] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[16:11:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[16:13:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:46] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage
[16:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P22971 and previous config saved to /var/cache/conftool/dbconfig/20220322-161359-marostegui.json
[16:14:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:11] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Initial debianization of istio-cni [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[16:15:06] <wikibugs>	 10SRE, 10Analytics, 10Data-Engineering: Also intake Network Error Logging events into the Analytics Data Lake - https://phabricator.wikimedia.org/T304373 (10jbond) p:05Triage→03Medium
[16:15:35] <wikibugs>	 10SRE, 10Thumbor, 10Service-deployment-requests: New Service Request Wikimedia-Thumbor - https://phabricator.wikimedia.org/T304436 (10WDoranWMF)
[16:15:49] <wikibugs>	 10SRE, 10Thumbor, 10Service-deployment-requests: New Service Request Wikimedia-Thumbor - https://phabricator.wikimedia.org/T304436 (10WDoranWMF)
[16:16:44] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[16:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:51] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage
[16:16:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:25] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudnet1003.eqiad.wmnet with OS bullseye
[16:17:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:42] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[16:17:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:06] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudnet1003.eqiad.wmnet with OS bullseye
[16:18:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:20] <icinga-wm>	 PROBLEM - Check systemd state on karapace1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:56] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage
[16:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:19:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Seems sane (to the extent possible :-), two nits inline." [debs/istio] - 10https://gerrit.wikimedia.org/r/771670 (https://phabricator.wikimedia.org/T297612) (owner: 10Elukey)
[16:19:40] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis) 05Open→03Resolved
[16:19:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1158 - https://phabricator.wikimedia.org/T303910 (10Cmjohnson) 05Open→03Resolved The SSD has been replaced and is rebuilding.
[16:20:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Data-Engineering-Kanban: Requesting access to DataEngineering Team Resources for NOkafor - https://phabricator.wikimedia.org/T303516 (10BTullis)
[16:22:33] <wikibugs>	 (03PS2) 10Zabe: Stop writing to $wmfDatacenter(s) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771469 (https://phabricator.wikimedia.org/T45956)
[16:22:45] <wikibugs>	 (03PS3) 10Zabe: Stop writing to wmf* constants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/768255 (https://phabricator.wikimedia.org/T45956)
[16:23:40] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1003.eqiad.wmnet with reason: host reimage
[16:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:54] <logmsgbot>	 !log razzi@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on karapace1001.eqiad.wmnet with reason: Setting up karapace for the first time
[16:27:56] <logmsgbot>	 !log razzi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on karapace1001.eqiad.wmnet with reason: Setting up karapace for the first time
[16:27:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:04] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T298557)', diff saved to https://phabricator.wikimedia.org/P22972 and previous config saved to /var/cache/conftool/dbconfig/20220322-162904-marostegui.json
[16:29:06] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[16:29:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[16:29:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:09] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[16:29:09] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[16:29:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:12] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[16:29:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T298557)', diff saved to https://phabricator.wikimedia.org/P22973 and previous config saved to /var/cache/conftool/dbconfig/20220322-162917-marostegui.json
[16:29:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:04] <icinga-wm>	 RECOVERY - Check systemd state on karapace1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack) Arzhel and I discussed this a bit, and we're going add a few more countries manually for now before proceeding with the esams-resiliency...
[16:30:49] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1003.eqiad.wmnet with OS bullseye
[16:30:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:13] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops, 10Patch-For-Review: drmrs: initial geodns configuration - https://phabricator.wikimedia.org/T304089 (10BBlack)
[16:33:41] <wikibugs>	 (03PS1) 10BBlack: map Portugal to drmrs [dns] - 10https://gerrit.wikimedia.org/r/772876 (https://phabricator.wikimedia.org/T304089)
[16:35:14] <ebernhardson>	 !log T303548 start wikidatawiki reindexing on eqiad codfw and cloudelastic cirrus clusters
[16:35:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:18] <stashbot>	 T303548: CirrusSearchIndexTooOld - https://phabricator.wikimedia.org/T303548
[16:39:04] <icinga-wm>	 PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:39:10] <wikibugs>	 (03PS1) 10Btullis: Reenable the sflow job [puppet] - 10https://gerrit.wikimedia.org/r/772877 (https://phabricator.wikimedia.org/T302263)
[16:40:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10KFrancis) @jbond I am confirming the signed NDA.  Please proceed with the access request.  Thanks!
[16:41:20] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] map Portugal to drmrs [dns] - 10https://gerrit.wikimedia.org/r/772876 (https://phabricator.wikimedia.org/T304089) (owner: 10BBlack)
[16:42:01] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[17:06:11] <wikibugs>	 (03PS1) 10Jcrespo: mediabackups: Add reference key for file decryption on recovery config [puppet] - 10https://gerrit.wikimedia.org/r/772885 (https://phabricator.wikimedia.org/T300020)
[17:07:24] <wikibugs>	 (03PS2) 10Jcrespo: mediabackups: Add reference to key for decryption on recovery config too [puppet] - 10https://gerrit.wikimedia.org/r/772885 (https://phabricator.wikimedia.org/T300020)
[17:07:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releaser for MarkAHershberger - https://phabricator.wikimedia.org/T302287 (10jbond) >>! In T302287#7797378, @KFrancis wrote: > @jbond I am confirming the signed NDA.  Please proceed with the access request.  Thanks!  thanks :)  @MarkAHershberger  as a voluntee...
[17:08:19] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[17:08:29] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mediabackups: Add reference to key for decryption on recovery config too [puppet] - 10https://gerrit.wikimedia.org/r/772885 (https://phabricator.wikimedia.org/T300020) (owner: 10Jcrespo)
[17:09:19] <taavi>	 jouncebot: nowandnext
[17:09:20] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 50 minute(s)
[17:09:20] <jouncebot>	 In 0 hour(s) and 50 minute(s): 🚂🧪Trainsperiment Week Deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T1800)
[17:09:26] * taavi deploys a sec patch
[17:10:34] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1004.eqiad.wmnet with reason: host reimage
[17:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:51] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34487/console" [puppet] - 10https://gerrit.wikimedia.org/r/772877 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis)
[17:14:19] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1004.eqiad.wmnet with reason: host reimage
[17:14:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:12] <taavi>	 !log deploy security patch for T304354
[17:15:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:17:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P22979 and previous config saved to /var/cache/conftool/dbconfig/20220322-171748-marostegui.json
[17:17:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:18:20] <wikibugs>	 (03PS7) 10Eigyan: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956)
[17:18:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ayounsi) So on the `Failed to load ldlinux.c32`:  I got cloudvirt1024 to boot the debian installer using: ` install1003:~$ cat /etc/dhcp/automatio...
[17:19:36] <wikibugs>	 (03CR) 10Eigyan: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan)
[17:20:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[17:20:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:35] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[17:21:36] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[17:21:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[17:22:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:07] <brennen>	 !log trainsperiment (T300203): with 1.39.0-wmf.3 on all wikis, we're paused for a planned catchup window - nothing to do at the moment, we'll deploy 1.39.0-wmf.4 tomorrow (2022-03-23).
[17:25:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:11] <stashbot>	 T300203: 🧪🚂 Trainsperiment Week: 1.39.0-wmf.1, 1.39.0-wmf.2, 1.39.0-wmf.3, 1.39.0-wmf.4 deployment blockers - https://phabricator.wikimedia.org/T300203
[17:25:17] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10jbond)
[17:25:24] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10jbond) @Sgs the analytics users group is now deprecated.  i believe you will need analytics-privatedata-users with kerberos access, @Ottomata should be able to both confirm and aprove this.  P...
[17:25:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add new Analytics subnets to static Capirca net definitions [homer/public] - 10https://gerrit.wikimedia.org/r/772868 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[17:25:47] <wikibugs>	 (03PS4) 10Jcrespo: Initial release of mediabackups software [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445)
[17:26:01] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Reenable the sflow job [puppet] - 10https://gerrit.wikimedia.org/r/772877 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis)
[17:26:41] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10jbond) p:05Triage→03Medium
[17:32:22] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10jbond) @KFrancis yes please this still needs an NDA the previous ticket relates to signing L2
[17:32:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T298557)', diff saved to https://phabricator.wikimedia.org/P22980 and previous config saved to /var/cache/conftool/dbconfig/20220322-173253-marostegui.json
[17:32:55] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[17:32:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[17:32:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:32:59] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[17:33:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22981 and previous config saved to /var/cache/conftool/dbconfig/20220322-173301-marostegui.json
[17:33:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:44] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10valerio.bozzolan) Maybe totally unrelated, but maybe yes:  https://lists.wikimedia.org/hyperkitty/list/cloud@lists.wikimedia.org/thr...
[17:33:54] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Add new Analytics subnets to static Capirca net definitions [homer/public] - 10https://gerrit.wikimedia.org/r/772868 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[17:34:23] <wikibugs>	 (03Merged) 10jenkins-bot: Add new Analytics subnets to static Capirca net definitions [homer/public] - 10https://gerrit.wikimedia.org/r/772868 (https://phabricator.wikimedia.org/T299758) (owner: 10Cathal Mooney)
[17:43:13] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to nda/logstash for User:TheDJ - https://phabricator.wikimedia.org/T304120 (10KFrancis) @TheDJ Please send your personal email and mailing address to me at kfrancis@wikimedia.org and I'll put together the agreement.  Thank you!
[17:45:38] <wikibugs>	 (03PS8) 10Jdlrobson: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan)
[17:45:42] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan)
[17:47:16] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1004.eqiad.wmnet with OS bullseye
[17:47:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:50:51] <logmsgbot>	 !log dcausse@deploy1002 Started scap: (no justification provided)
[17:50:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:51:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Andrew) In case this is an additional data point: I just reimaged cloundnet1003 and cloudnet1004 without any pxe or image issues.
[17:51:55] <taavi>	 dcausse: ^^ hey, what's going on with that full scap?
[17:52:20] <dcausse>	 taavi: just wanted to /srv/deployment/wikimedia/discovery/analytics
[17:52:48] <taavi>	 umh that's not going to do it, `scap sync-file` and `scap sync-world` are mediawiki specific
[17:53:09] <taavi>	 you're likely looking for `scap deploy`
[17:53:35] <dcausse>	 yes my bad totally messed that up
[17:54:05] <dcausse>	 cancelled it (it just Finished l10n-update)
[17:55:43] <wikibugs>	 10SRE, 10Cloud-VPS, 10Infrastructure-Foundations, 10netops: Route problems from some gateways of Italy to WMCloud and Toolforge - https://phabricator.wikimedia.org/T304416 (10RhinosF1) That wasn't sent until way after your issues started nor were fixed.
[17:55:54] <logmsgbot>	 !log dcausse@deploy1002 Started deploy [wikimedia/discovery/analytics@c4d0736]: (no justification provided)
[17:55:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:50] <wikibugs>	 (03PS1) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[18:00:02] <wikibugs>	 (03PS2) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[18:00:05] <jouncebot>	 dancy, hashar, brennen, dduvall, jeena, and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for 🚂🧪Trainsperiment Week Deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T1800).
[18:01:10] <logmsgbot>	 !log dcausse@deploy1002 Finished deploy [wikimedia/discovery/analytics@c4d0736]: (no justification provided) (duration: 05m 16s)
[18:01:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:29] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "Sounds great to me." [puppet] - 10https://gerrit.wikimedia.org/r/772335 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[18:04:01] <wikibugs>	 (03PS5) 10Jcrespo: Initial release of mediabackups software [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445)
[18:04:11] <wikibugs>	 (03PS3) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[18:05:24] <wikibugs>	 (03PS4) 10Sergio Gimeno: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825)
[18:05:26] <wikibugs>	 (03PS6) 10Jcrespo: Initial release of mediabackups software [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445)
[18:06:00] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] Reenable the sflow job [puppet] - 10https://gerrit.wikimedia.org/r/772877 (https://phabricator.wikimedia.org/T302263) (owner: 10Btullis)
[18:06:59] <wikibugs>	 (03PS4) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[18:08:25] <wikibugs>	 (03CR) 10Jcrespo: [V: 03+2 C: 03+2] Initial release of mediabackups software [software/mediabackups] - 10https://gerrit.wikimedia.org/r/772824 (https://phabricator.wikimedia.org/T276445) (owner: 10Jcrespo)
[18:09:23] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10Ottomata) Approved.  But, @sgs can you edit the description and describe a little more what access you need?  See https://wikitech.wikimedia.org/wiki/Analytics/Data_access#What_access_should_I...
[18:13:54] <wikibugs>	 (03PS3) 10Jcrespo: mediabackups: Install package and its dependencies through .deb [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020)
[18:14:44] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mediabackups: Install package and its dependencies through .deb [puppet] - 10https://gerrit.wikimedia.org/r/772831 (https://phabricator.wikimedia.org/T300020) (owner: 10Jcrespo)
[18:19:03] <wikibugs>	 (03PS1) 10Razzi: karapace: use karapace included python; set hostname [puppet] - 10https://gerrit.wikimedia.org/r/772912
[18:19:57] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10Sgs)
[18:20:48] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10Sgs) >>! In T304361#7797589, @jbond wrote: > @Sgs the analytics users group is now deprecated.  i believe you will need analytics-privatedata-users with kerberos access, @Ottomata should be ab...
[18:21:17] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: [WiP] Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342
[18:22:35] <wikibugs>	 (03PS5) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[18:25:20] <wikibugs>	 (03PS2) 10Razzi: karapace: use karapace included python; set hostname [puppet] - 10https://gerrit.wikimedia.org/r/772912 (https://phabricator.wikimedia.org/T301565)
[18:25:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22982 and previous config saved to /var/cache/conftool/dbconfig/20220322-182531-marostegui.json
[18:25:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:25:37] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[18:26:14] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34492/console" [puppet] - 10https://gerrit.wikimedia.org/r/772912 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi)
[18:28:05] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence-Backup, 10media-backups, and 3 others: WMF media storage must be adequately backed up - https://phabricator.wikimedia.org/T262668 (10jcrespo)
[18:28:46] <wikibugs>	 10SRE, 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Create a first release of the media backups automation tools - https://phabricator.wikimedia.org/T276445 (10jcrespo) 05Open→03Resolved Done: * https://phabricator.wikimedia.org/diffusion/OSMB/history/master/;v0.1 * https://github.com/wikimed...
[18:29:01] <wikibugs>	 (03PS6) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[18:30:03] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34493/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey)
[18:30:52] <razzi>	 !log remove old karapace1001 known hosts following reimage: `razzi@puppetmaster1001:~$ ssh-keygen -f "/etc/ssh/ssh_known_hosts" -R "karapace1001.eqiad.wmnet"`
[18:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:32:02] <wikibugs>	 (03PS7) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[18:32:46] <elukey>	 razzi: what are you trying to do? We shouldn't change that file on puppet masters
[18:33:48] <elukey>	 it gets populated after each puppet run, but in general it doesn't need to be touched
[18:33:52] <razzi>	 elukey: I reimaged a virtual machine following https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Virtual_hosts, and it was giving a warning since it had the old hostname
[18:34:54] <icinga-wm>	 PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:35:26] <elukey>	 razzi: what warning are you talking about? (to understand what is the problem)
[18:35:39] <elukey>	 you need to clean the old host tls certificate first
[18:35:58] <elukey>	 and sign the new one after using install console
[18:36:16] <elukey>	 https://wikitech.wikimedia.org/wiki/Ganeti#Reinstall_/_Reimage_a_VM
[18:36:32] <elukey>	 and https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Manual_installation
[18:37:32] <razzi>	 elukey: here's the paste of where I got the warning, it was upon the /usr/local/bin/install_console https://phabricator.wikimedia.org/P22983
[18:40:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22984 and previous config saved to /var/cache/conftool/dbconfig/20220322-184037-marostegui.json
[18:40:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:42:18] <icinga-wm>	 RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[18:42:18] <elukey>	 razzi: it is not a big problem if you get that warning, the root console is available. But you need to clean the old puppet host cert first, then install console + run puppet (that generates a new csr for the vm to the puppetmaster), sign on puppetmaster to accept the new key and finally you'll be able to run puppet on install console
[18:43:18] <elukey>	 eventually the new fingerprint will be available for the new node
[18:44:30] <wikibugs>	 (03PS8) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[18:47:31] <wikibugs>	 (03CR) 10Herron: "Shall we give this a try?" [puppet] - 10https://gerrit.wikimedia.org/r/769749 (https://phabricator.wikimedia.org/T300119) (owner: 10Herron)
[18:49:04] <wikibugs>	 (03PS9) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[18:50:04] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34496/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey)
[18:51:26] <wikibugs>	 (03Abandoned) 10Jcrespo: WIP [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/665383 (owner: 10Jcrespo)
[18:51:57] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10ssingh) >>! In T303593#7796742, @gerritbot wrote: > Change 771610 **merged** by Ssingh: > %%%[operations/puppet@production] P:icinga: add prof...
[18:52:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1006 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:53:07] <wikibugs>	 (03Abandoned) 10Jcrespo: [QIP] Add second prototype to handle File metadata directly from the db [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/637769 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo)
[18:53:12] <wikibugs>	 (03PS10) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[18:54:02] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34497/console" [puppet] - 10https://gerrit.wikimedia.org/r/772909 (owner: 10Elukey)
[18:54:26] <wikibugs>	 (03Abandoned) 10Jcrespo: Add 4 line naive prototype for downloading all images from a wiki [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/636007 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo)
[18:55:21] <wikibugs>	 (03Abandoned) 10Jcrespo: POC: Testing interfacing with swift to gather metadata [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/638665 (https://phabricator.wikimedia.org/T264189) (owner: 10Jcrespo)
[18:55:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P22985 and previous config saved to /var/cache/conftool/dbconfig/20220322-185542-marostegui.json
[18:55:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:57:10] <razzi>	 elukey: ok yeah I did all those steps; I didn't realize the new fingerprint would update automatically. The new server is online
[18:57:10] <razzi>	 I added my understanding to https://wikitech.wikimedia.org/wiki/Server_Lifecycle/Reimage#Virtual_hosts, let me know how that looks
[18:59:40] <elukey>	 razzi: it is a bit generic, I'd suggest to review all the bits involved and to add a more precise explanation (nothing big)
[19:00:52] <elukey>	 (I mean to familiarize with how the host fingerprints are populated etc..)
[19:01:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:01:24] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: Introduce requestctl [software/conftool] - 10https://gerrit.wikimedia.org/r/772342 (https://phabricator.wikimedia.org/T302471)
[19:01:57] <wikibugs>	 (03CR) 10Razzi: [V: 03+1] "Small changes that should fix karapace1001." [puppet] - 10https://gerrit.wikimedia.org/r/772912 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi)
[19:02:53] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[19:02:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bu...
[19:04:17] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye
[19:04:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:04:27] <wikibugs>	 (03PS11) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[19:04:29] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with O...
[19:06:42] <wikibugs>	 (03CR) 10MewOphaswongse: [C: 03+1] beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno)
[19:07:32] <wikibugs>	 (03PS12) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[19:10:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T298557)', diff saved to https://phabricator.wikimedia.org/P22986 and previous config saved to /var/cache/conftool/dbconfig/20220322-191049-marostegui.json
[19:10:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:10:56] <stashbot>	 T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557
[19:13:22] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to stat1007 for sgimeno - https://phabricator.wikimedia.org/T304361 (10Ottomata) +1 sounds good.  Approved!
[19:14:11] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] karapace: use karapace included python; set hostname [puppet] - 10https://gerrit.wikimedia.org/r/772912 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi)
[19:14:45] <wikibugs>	 (03PS13) 10Elukey: WIP - Refactor Calico's CNI plugin config [puppet] - 10https://gerrit.wikimedia.org/r/772909
[19:20:01] <wikibugs>	 (03CR) 10Razzi: [V: 03+1 C: 03+2] karapace: use karapace included python; set hostname [puppet] - 10https://gerrit.wikimedia.org/r/772912 (https://phabricator.wikimedia.org/T301565) (owner: 10Razzi)
[19:23:15] <wikibugs>	 (03PS1) 10RLazarus: slo: Move most of the text panel content to a description field, so it can be overridden [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842)
[19:24:52] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] beta, testwiki: enable testing of topics match mode for GLAM events (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno)
[19:31:55] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10herron) >>! In T303593#7797822, @ssingh wrote: > - as per the commit message above, "We first start by setting interface::rps to the alerting_...
[19:32:07] <wikibugs>	 10SRE, 10Data-Engineering: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10Ottomata)
[19:34:09] <wikibugs>	 10SRE, 10Data-Engineering: Create conda .deb and docker image - https://phabricator.wikimedia.org/T304450 (10Ottomata) @MoritzMuehlenhoff advice?  Can I import [[ https://docs.conda.io/projects/conda/en/latest/user-guide/install/rpm-debian.html | conda's official .deb ]] into our apt repo, or would you prefer...
[19:36:34] <icinga-wm>	 RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:45:04] <icinga-wm>	 PROBLEM - SSH on thumbor2004.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:46:36] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4839 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[19:47:45] <RhinosF1>	 That keeps going off
[19:50:06] <RhinosF1>	 _joe_: irc says you've been active a few minutes ago, is ^ worth a task? That's 3rd time I can remember it going off today
[19:59:33] <eigyan>	 greetings all!
[20:00:05] <jouncebot>	 RoanKattouw and Urbanecm: Dear deployers, time to do the UTC late backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220322T2000).
[20:00:05] <jouncebot>	 eigyan, jandrewniak, and mewoph: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:01:10] <mewoph>	 👋
[20:01:16] <eigyan>	 ✔️
[20:01:38] <jan_drewniak>	 👋
[20:05:41] <jan_drewniak>	 Is there anyone around to do the deploys? Or should we do it ourselves?
[20:06:24] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[20:07:59] <jan_drewniak>	 eigyan, mewoph, I can do the deploy if RoanKattouw and  Urbanecm are not around
[20:08:24] <RoanKattouw>	 I'm here if you want me to do it
[20:08:29] <urbanecm>	 I am, but i didn't see the ping
[20:08:30] <RoanKattouw>	 Sorry, was just getting back from lunch
[20:08:49] <urbanecm>	 jan_drewniak: if you're comfortable deploying, feel free to, otherwise me or Roan can do it :)
[20:09:02] <eigyan>	 hello all, I am happy with any decision made
[20:09:16] <eigyan>	 I am prepared to watch with vigor :)
[20:10:04] <eigyan>	 I have only attended one deploy training so far with many more to come :)
[20:10:05] <jan_drewniak>	 urbanecm: I comfortable copy & pasting some bash commands :P but since you guys do this everyday, I'll leave it to the pros :) 
[20:10:18] <urbanecm>	 Okay :))
[20:10:19] <eigyan>	 :)
[20:10:25] <urbanecm>	 In that case, I can deploy today
[20:10:39] <eigyan>	 urbanecm spoke from a pro!
[20:10:52] <eigyan>	 ^spoken
[20:13:08] <wikibugs>	 10SRE, 10Commons, 10MediaWiki-File-management, 10RESTBase-API, and 4 others: RFC: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847 (10LGoto)
[20:14:17] <urbanecm>	 eigyan: hello, can you please clarify why is hewiki => true removed from wmgUseQuickSurveys at https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/772433?
[20:14:39] <eigyan>	 Yes I can urbanecm
[20:15:20] <eigyan>	 Per J Robson's code review that was a redundant piece of code, it is mentioned in the patch
[20:15:25] <eigyan>	 to be removed
[20:15:33] <eigyan>	 per his suggestion
[20:16:13] <eigyan>	 it appears the wmgUseQuickSurveys value is set higher upstream
[20:16:23] <urbanecm>	 ah, ok, makes sense
[20:16:28] <wikibugs>	 (03PS9) 10Urbanecm: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan)
[20:16:32] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan)
[20:16:43] <eigyan>	 cool urbanecm
[20:17:01] <urbanecm>	 eigyan: since it is a beta-only patch, it will be deployed to beta automatically within ~30 minutes (if not, feel free to ping me and I can investigate)
[20:17:10] <icinga-wm>	 PROBLEM - SSH on db2090.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:17:22] <eigyan>	 excellent, thanks urbanecm
[20:17:25] <wikibugs>	 (03Merged) 10jenkins-bot: [wmf-config]: Deploy Safety Survey to EN, ES wikis on BETA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772433 (https://phabricator.wikimedia.org/T303956) (owner: 10Eigyan)
[20:18:31] <wikibugs>	 (03PS2) 10Urbanecm: Enable EventGate logging for WikipediaPortal schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772507 (https://phabricator.wikimedia.org/T271163) (owner: 10Jdrewniak)
[20:18:47] <urbanecm>	 jan_drewniak: your patch is next :). Will you be able to test it at a debug srv?
[20:19:36] <jan_drewniak>	 urbanecm: I don't think so, it's enabling an event-logging schema, so I'll test it by sending events after it's deployed.
[20:20:00] <ottomata>	 you can test on a debug server
[20:20:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:20:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:08] <ottomata>	 but it is also pretty safe to just deploy
[20:20:09] <ottomata>	 since it is new
[20:20:16] <urbanecm>	 okay
[20:20:30] <ottomata>	 i should probably add docs in event platform on wikitech on how to do that! :)
[20:20:33] <urbanecm>	 in that case I'll just sync it and let jan_drewniak test it later on
[20:20:35] <ottomata>	 ya
[20:20:36] <urbanecm>	 ottomata: would be great :)
[20:20:44] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.04839 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[20:20:48] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Enable EventGate logging for WikipediaPortal schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772507 (https://phabricator.wikimedia.org/T271163) (owner: 10Jdrewniak)
[20:20:53] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:20:54] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:20:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:37] <wikibugs>	 (03Merged) 10jenkins-bot: Enable EventGate logging for WikipediaPortal schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772507 (https://phabricator.wikimedia.org/T271163) (owner: 10Jdrewniak)
[20:21:42] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:21:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:23:15] <wikibugs>	 (03PS5) 10Urbanecm: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno)
[20:23:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno)
[20:24:03] <wikibugs>	 (03Merged) 10jenkins-bot: beta, testwiki: enable testing of topics match mode for GLAM events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772870 (https://phabricator.wikimedia.org/T301825) (owner: 10Sergio Gimeno)
[20:24:08] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 17caf0359b99b69c0b3e0d7a5fa2f5c7fb7464ef: Enable EventGate logging for WikipediaPortal schema (T271163) (duration: 01m 54s)
[20:24:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:24:15] <stashbot>	 T271163: TranslationRecommendation* Schemas Event Platform Migration - https://phabricator.wikimedia.org/T271163
[20:24:17] <urbanecm>	 jan_drewniak: should be live!
[20:24:31] <urbanecm>	 mewoph: your patch is at mwdebug1001. can you have a look?
[20:24:48] <mewoph>	 checking now
[20:25:02] <jan_drewniak>	 urbanecm: great! thanks
[20:25:09] <urbanecm>	 happy to help :)
[20:26:19] <jan_drewniak>	 and ottomata: just tested an event, getting 201 so I think that's good :) 
[20:26:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:26:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:28:14] <mewoph>	 urbanecm: lgtm thanks!
[20:28:22] <urbanecm>	 mewoph: syncing, thanks for checking
[20:28:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr) cloudstore1010  B7 U41 port12 cableid #5014 cloudstore1011  C4 U1 port23. cableid #20220273
[20:29:00] <urbanecm>	 mewoph: fyi, the beta part will be deployed automatically within ~30 minutes (if not, please let me know and I can investigate)
[20:29:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr)
[20:29:47] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ce18d4eeb255349e27163d5e5472fbe21c320322: testwiki: enable testing of topics match mode for GLAM events (T301825) (duration: 01m 06s)
[20:29:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, and 2 others: Q3:(Need By: TBD) rack/setup/install 2 new labstore hosts - https://phabricator.wikimedia.org/T302981 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson
[20:29:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:29:52] <stashbot>	 T301825: Account creation: add toggle to enable AND selection of interest topics - https://phabricator.wikimedia.org/T301825
[20:29:56] <urbanecm>	 mewoph: and the testwiki part is live now
[20:30:05] <urbanecm>	 anyone anything else to deploy?
[20:30:10] <ottomata>	 jan_drewniak: that's good!
[20:30:11] <ottomata>	 let's see!
[20:31:06] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:31:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:31:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:09] <ottomata>	 oh jan_drewniak  the WP code is not out yet, right?  this was just the config change?
[20:31:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:31:40] <jan_drewniak>	 ottomata: that's true, just testing it by sending the event from my local.
[20:32:01] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:32:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:06] <icinga-wm>	 PROBLEM - Juniper virtual chassis ports on asw2-b-eqiad is CRITICAL: CRIT: Down: 1 Unknown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status
[20:32:34] <jan_drewniak>	 With the expected payload that I have in the portals patch. I'll deploy the portal change tomorrow though.
[20:32:36] <urbanecm>	 !log UTC late backport window done
[20:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:32:41] <ottomata>	 jan_drewniak:  awesome, but that's good!
[20:33:11] <ottomata>	 once the code is out and looking good, i can finalize the migration process in the backend
[20:33:14] <ottomata>	 :)
[20:34:08] <jan_drewniak>	 ottomata: cool. the event from my local  won't show up because it's from non-wikimedia domain right? 
[20:34:27] <ottomata>	 hmmmm, it won't show up in the event table, iirc, but it is in kafka
[20:34:41] <ottomata>	 you can consume from kafka and grep somethign for your event
[20:34:46] <ottomata>	 then produce and see
[20:35:27] <ottomata>	 jan_drewniak:  do you have access to a stat box?
[20:36:01] <jan_drewniak>	 ottomata: honestly I haven't looked at that stuff in ages, I probably don't even have access right now.
[20:36:05] <ottomata>	 okay
[20:36:16] <ottomata>	 i'll grep for you, what's the event you are posting?
[20:37:01] <urbanecm>	 jan_drewniak: you're still in the access group though, so I think you should be able to ssh to stat1004.eqiad.wmnet, for instance
[20:37:03] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:37:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:17] <ottomata>	 jan_drewniak:  if you can ssh to stat1004 i'll give you a command to grep
[20:37:25] <ottomata>	 it'll be like https://wikitech.wikimedia.org/wiki/Kafka#Consume
[20:37:46] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye
[20:37:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS b...
[20:37:57] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:37:59] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:38:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:38:47] <jan_drewniak>	 ottomata: I'm sending a request like this 
[20:38:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:38:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:39:04] <jan_drewniak>	 https://www.irccloud.com/pastebin/B0GyjMg8/
[20:39:37] <ottomata>	 ok gonna grep for c5860eb99af6d7d9
[20:39:43] <jan_drewniak>	 that's via postman (I like my guis).
[20:40:10] <ottomata>	 ok jan_drewniak post again plz
[20:40:13] <ottomata>	 i'm grepping :)
[20:40:58] <ottomata>	 perfect jan_drewniak  i see it!
[20:41:14] <wikibugs>	 10SRE, 10Traffic, 10Wikipedia-iOS-App-Backlog, 10iOS-app-Bugs: Wikipedia iOS apps sending harmful bursts of traffic synchronized to the top of the hour, especially at 22:00 UTC - https://phabricator.wikimedia.org/T264881 (10LGoto) 05Open→03Resolved a:03LGoto
[20:41:23] <wikibugs>	 10SRE, 10Wikidata, 10serviceops, 10wdwb-tech: Hourly read spikes against s8 resulting in occasional user-visible latency & error spikes - https://phabricator.wikimedia.org/T264821 (10LGoto)
[20:42:12] <jan_drewniak>	 and thanks for reminding me urbanecm, I do still have access to the stats boxes (like stat1004.eqiad.wmnet)
[20:42:22] <urbanecm>	 no problem :)
[20:42:36] <ottomata>	 fwiw, this is the command I ran:
[20:42:52] <ottomata>	 kafkacat -C -u -b kafka-jumbo1001.eqiad.wmnet:9092 -t eventlogging_WikipediaPortal | grep --line-buffered c5860eb99af6d7d9 | jq .
[20:43:44] <jan_drewniak>	 ottomata: thanks! I see it too
[20:44:01] <ottomata>	 nice
[20:45:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Andrew) @Jclark-ctr just swapped the network cables and now I see:   ` Lifecycle Controller: Done   No PXE-capable device available....
[20:45:18] <eigyan>	 I have verfied my changes thanks urbanecm
[20:45:24] <urbanecm>	 happy to help
[20:58:28] <wikibugs>	 (03PS2) 10RLazarus: slo: Move most of the text panel content to a description field, so it can be overridden [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/772923 (https://phabricator.wikimedia.org/T302842)
[21:05:58] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye
[21:06:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:04] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bulls...
[21:12:04] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic1083 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:04] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic1076 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:04] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic1073 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:04] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic1075 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:04] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic1068 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:12:05] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic1057 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[21:15:17] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "elastic: fix cirrus settings check false negative" [puppet] - 10https://gerrit.wikimedia.org/r/772893
[21:15:40] <wikibugs>	 (03PS2) 10Ryan Kemper: Revert "elastic: fix cirrus settings check false negative" [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511)
[21:16:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "elastic: fix cirrus settings check false negative" [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper)
[21:17:59] <wikibugs>	 (03PS3) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511)
[21:18:25] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye
[21:18:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:18:35] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bu...
[21:18:58] <icinga-wm>	 RECOVERY - SSH on db2090.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:19:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q2:(Need By: TBD) rack/setup/install cloudvirt1047.eqiad.wmnet - https://phabricator.wikimedia.org/T293391 (10Papaul) @Jclark-ctr swapped the cable and now the server NIC 1 is connected to the right switch port   ` papaul@cloudsw2-d5-eqiad> .....
[21:20:22] <wikibugs>	 (03CR) 10Bking: [C: 03+1] elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper)
[21:21:56] <wikibugs>	 (03PS4) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511)
[21:21:58] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye
[21:22:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:22:08] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye
[21:22:33] <wikibugs>	 (03PS5) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511)
[21:23:36] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772893 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper)
[21:26:14] <icinga-wm>	 PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:29:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300775)', diff saved to https://phabricator.wikimedia.org/P22989 and previous config saved to /var/cache/conftool/dbconfig/20220322-212939-marostegui.json
[21:29:41] <wikibugs>	 (03PS1) 10Andrew Bogott: Changes to reuse-labvirt.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/772932
[21:29:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:29:46] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[21:31:09] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Changes to reuse-labvirt.cfg partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/772932 (owner: 10Andrew Bogott)
[21:31:15] <wikibugs>	 (03PS1) 10Ryan Kemper: Revert "Revert "elastic: fix cirrus settings check false negative"" [puppet] - 10https://gerrit.wikimedia.org/r/772894
[21:32:39] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772894 (https://phabricator.wikimedia.org/T301511)
[21:33:24] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] elastic: fix cirrus settings check false negative [puppet] - 10https://gerrit.wikimedia.org/r/772894 (https://phabricator.wikimedia.org/T301511) (owner: 10Ryan Kemper)
[21:33:56] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1026.eqiad.wmnet with OS bullseye
[21:33:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:34:08] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with O...
[21:35:27] <ryankemper>	 !log T301511 Fixed elastic* eqiad cross-cluster search settings (see https://phabricator.wikimedia.org/T301511#7798267) to resolve the `ElasticSearch setting check` alerts in eqiad
[21:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:35:31] <stashbot>	 T301511: Address false negatives in Elasticsearch cross-cluster monitoring checks - https://phabricator.wikimedia.org/T301511
[21:38:59] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Papaul) Downgrade NIC firmware on cloudvrit1025 and cloudvirt1026 from 22.00.07.60 to 21.60.22.11 fixed the  `` Failed to load ldlinux.c32''  ` is...
[21:39:32] <logmsgbot>	 !log pt1979@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1026.eqiad.wmnet with OS bullseye
[21:39:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:39:43] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bu...
[21:40:35] <Jdlrobson>	 FYI @cjming and I are running some database maintenance scripts so if you see any slight changes in https://grafana.wikimedia.org/d/GpL5R8CGz/mysql-query-rate?orgId=1&from=now-15m&to=now&viewPanel=14 that's to be expected.
[21:44:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P22990 and previous config saved to /var/cache/conftool/dbconfig/20220322-214445-marostegui.json
[21:44:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:09] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1025.eqiad.wmnet with OS bullseye
[21:45:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye executed...
[21:46:23] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1026.eqiad.wmnet with OS bullseye
[21:46:25] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1025.eqiad.wmnet with OS bullseye
[21:46:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:31] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye
[21:46:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye
[21:58:45] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1026.eqiad.wmnet with reason: host reimage
[21:58:46] <wikibugs>	 (03PS1) 10Daniel Kinzler: Set MW_USE_CONFIG_SCHEMA constant of the file use-config-schema exists. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/772937 (https://phabricator.wikimedia.org/T304460)
[21:58:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:15] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1025.eqiad.wmnet with reason: host reimage
[21:59:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:29] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye
[21:59:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:59:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with O...
[21:59:50] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P22991 and previous config saved to /var/cache/conftool/dbconfig/20220322-215950-marostegui.json
[21:59:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:23] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1026.eqiad.wmnet with reason: host reimage
[22:02:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:34] <wikibugs>	 (03PS1) 10RLazarus: envoy: Remove v2 config API support [puppet] - 10https://gerrit.wikimedia.org/r/772938 (https://phabricator.wikimedia.org/T303770)
[22:03:20] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (8) node(s) change every puppet run: build2001, cloudcontrol1003, cloudcontrol1004, cloudvirt1025, cp1085, deploy1002, deploy2002, ms-be1068 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[22:04:15] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on cloudvirt1025.eqiad.wmnet with reason: host reimage
[22:04:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:10] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 66 probes of 676 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[22:09:20] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic2047 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2048.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:09:20] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic2049 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:09:20] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9200 on elastic2025 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:09:20] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9200 on elastic2042 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:09:21] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9400 on elastic2042 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2048.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:09:21] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9200 on elastic2031 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:09:21] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic2027 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:09:22] <icinga-wm>	 PROBLEM - ElasticSearch setting check - 9600 on elastic2029 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:09:23] <ryankemper>	 !log T301511 Forcing recheck of codfw cirrus setting check
[22:09:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:09:27] <stashbot>	 T301511: Address false negatives in Elasticsearch cross-cluster monitoring checks - https://phabricator.wikimedia.org/T301511
[22:10:02] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin2002 is CRITICAL: CRITICAL: the following (8) node(s) change every puppet run: build2001, cloudcontrol1003, cloudcontrol1004, cloudvirt1025, cp1085, deploy1002, deploy2002, ms-be1068 https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[22:11:27] <wikibugs>	 (03CR) 10RLazarus: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34501/console" [puppet] - 10https://gerrit.wikimedia.org/r/772938 (https://phabricator.wikimedia.org/T303770) (owner: 10RLazarus)
[22:11:36] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch setting check - 9200 on elastic2025 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:11:36] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic2027 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:11:36] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic2029 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:11:36] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch setting check - 9200 on elastic2031 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:11:37] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch setting check - 9200 on elastic2042 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:11:37] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch setting check - 9400 on elastic2042 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2048.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:11:37] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch setting check - 9400 on elastic2047 is CRITICAL: CRITICAL - [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2048.codfw.wmnet:9700] does not match [elastic2027.codfw.wmnet:9700, elastic2029.codfw.wmnet:9700, elastic2049.codfw.wmnet:9700] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:11:38] <icinga-wm>	 ACKNOWLEDGEMENT - ElasticSearch setting check - 9600 on elastic2049 is CRITICAL: CRITICAL - [elastic2038.codfw.wmnet:9500, elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500] does not match [elastic2042.codfw.wmnet:9500, elastic2047.codfw.wmnet:9500, elastic2052.codfw.wmnet:9500] for .(cluster Ryan Kemper https://phabricator.wikimedia.org/T301511 https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:11:43] <ryankemper>	 (sorry for the noise)
[22:13:37] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 59 probes of 676 (alerts on 65) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[22:14:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T300775)', diff saved to https://phabricator.wikimedia.org/P22992 and previous config saved to /var/cache/conftool/dbconfig/20220322-221455-marostegui.json
[22:14:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[22:14:58] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1141.eqiad.wmnet with reason: Maintenance
[22:14:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:00] <stashbot>	 T300775: Add tl_target_id column to templatelinks - https://phabricator.wikimedia.org/T300775
[22:15:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T300775)', diff saved to https://phabricator.wikimedia.org/P22993 and previous config saved to /var/cache/conftool/dbconfig/20220322-221503-marostegui.json
[22:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:15:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:56] <ryankemper>	 !log T301511 Mutated cirrus codfw cluster settings to what [I think] they should be, see https://phabricator.wikimedia.org/T301511#7798415; forcing re-check
[22:20:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:02] <stashbot>	 T301511: Address false negatives in Elasticsearch cross-cluster monitoring checks - https://phabricator.wikimedia.org/T301511
[22:21:52] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1026.eqiad.wmnet with OS bullseye
[22:21:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:01] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1026.eqiad.wmnet with OS bullseye completed...
[22:22:04] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10MoritzMuehlenhoff) >>! In T303776#7798337, @Papaul wrote: > Downgrade NIC firmware on cloudvrit1025 and cloudvirt1026 from 22.00.07.60 to 21.60.22...
[22:22:24] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic2049 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:22:24] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic2047 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:22:24] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9200 on elastic2025 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:22:24] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9200 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:22:24] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9400 on elastic2042 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:22:25] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9200 on elastic2031 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:22:25] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic2029 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:22:26] <icinga-wm>	 RECOVERY - ElasticSearch setting check - 9600 on elastic2027 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:22:32] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1025.eqiad.wmnet with OS bullseye
[22:22:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:22:40] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1025.eqiad.wmnet with OS bullseye completed...
[22:24:35] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye
[22:24:37] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[22:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:42] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye
[22:24:43] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye
[22:24:44] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[22:24:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:48] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye
[22:24:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:24:51] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye executed...
[22:24:58] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed...
[22:25:29] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[22:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:25:36] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye
[22:26:08] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bullseye
[22:26:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:26:16] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye
[22:26:26] <icinga-wm>	 RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:27:14] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10Andrew) 05Open→03Resolved These hosts are now reimaged and running VMs.  Thanks for all the attention everyone!
[22:27:41] <logmsgbot>	 !log pt1979@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye
[22:27:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:27:54] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bu...
[22:34:00] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1024.eqiad.wmnet DHCP problems - https://phabricator.wikimedia.org/T303773 (10Andrew) Last run:   ` CLIENT MAC ADDR: B0 26 28 29 5D F0  GUID: 4C4C4544-005A-5910-805A-C4C04F515032 CLIENT IP: 10.64.20.43  MASK: 255.255.255.0  DHCP IP:...
[22:41:30] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1047.eqiad.wmnet with OS bullseye
[22:41:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:34] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[22:41:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1047.eqiad.wmnet with OS bullseye executed...
[22:41:41] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed...
[22:46:57] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[22:47:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:47:06] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye
[22:50:32] <icinga-wm>	 RECOVERY - SSH on thumbor2004.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:04:00] <wikibugs>	 (03PS2) 10Esanders: Disable autotopicsub user option by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771872 (https://phabricator.wikimedia.org/T297966)
[23:11:05] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[23:11:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:11:13] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): cloudvirt1025 and cloudvirt1026 fail to pxe boot - https://phabricator.wikimedia.org/T303776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bullseye executed...
[23:23:21] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1024.eqiad.wmnet with OS bullseye
[23:23:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:23:33] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with O...
[23:35:25] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1024.eqiad.wmnet with reason: host reimage
[23:35:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:39:03] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1024.eqiad.wmnet with reason: host reimage
[23:39:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:53:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: [WIP] Requesting access to deployment group for TThoabala - https://phabricator.wikimedia.org/T303398 (10thcipriani) >>! In T303398#7796390, @jbond wrote: > @thcipriani are you able to approve @TThoabala membership of the deployment group  Approved!
[23:56:32] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1024.eqiad.wmnet with OS bullseye
[23:56:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:56:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudvirt1016.eqiad.wmnet and cloudvirt1017.eqiad.wmnet fail to PXE boot - https://phabricator.wikimedia.org/T303296 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host cloudvirt1024.eqiad.wmnet with OS bu...
[23:59:40] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting