[00:00:20] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS bullseye
[00:00:33] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye
[00:00:51] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1046793 (owner: 10TrainBranchBot)
[00:02:35] <logmsgbot>	 !log zabe@deploy1002 Finished scap: T366649 (duration: 15m 16s)
[00:02:39] <stashbot>	 T366649: Create an 'Universal Code of Conduct Coordinating Committee (U4C)' private wiki - https://phabricator.wikimedia.org/T366649
[00:03:21] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046800
[00:03:22] <wikibugs>	 (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046800 (owner: 10Zabe)
[00:04:16] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046800 (owner: 10Zabe)
[00:04:51] <logmsgbot>	 !log zabe@deploy1002 Started scap: Update interwiki cache
[00:05:34] <zabe>	 !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=u4cwiki --cluster=all 2>&1 | tee /tmp/u4c.UpdateSearchIndexConfig.log # T366649
[00:05:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:10:12] <logmsgbot>	 !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4044.ulsfo.wmnet with OS bullseye
[00:10:20] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye execu...
[00:10:27] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS bullseye
[00:10:34] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901799 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye
[00:13:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P65124 and previous config saved to /var/cache/conftool/dbconfig/20240618-001316-ladsgroup.json
[00:13:40] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839 (10Zabe) 03NEW
[00:14:09] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9901810 (10Zabe)
[00:18:54] <logmsgbot>	 !log zabe@deploy1002 Finished scap: Update interwiki cache (duration: 14m 03s)
[00:24:51] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:25:43] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 2.873 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:28:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T352010)', diff saved to https://phabricator.wikimedia.org/P65125 and previous config saved to /var/cache/conftool/dbconfig/20240618-002823-ladsgroup.json
[00:28:29] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[00:29:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: firmware upgrade for mw1359.eqiad.wmnet, mw1364.eqiad.wmnet, mw1365.eqiad.wmnet, mw1412.eqiad.wmnet - https://phabricator.wikimedia.org/T367766#9901830 (10Jclark-ctr) @clement_goubert did you need just idrac updated we can do that easily.    B...
[00:31:27] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage
[00:34:56] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage
[00:50:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T364069)', diff saved to https://phabricator.wikimedia.org/P65126 and previous config saved to /var/cache/conftool/dbconfig/20240618-005054-marostegui.json
[00:50:59] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[00:57:08] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4044.ulsfo.wmnet with OS bullseye
[00:57:18] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye completed: - cp4044 (**PASS...
[01:06:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P65127 and previous config saved to /var/cache/conftool/dbconfig/20240618-010601-marostegui.json
[01:07:55] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.10 [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1046803 (https://phabricator.wikimedia.org/T361404)
[01:07:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.10 [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1046803 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot)
[01:10:48] <logmsgbot>	 !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4044.ulsfo.wmnet
[01:11:54] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901869 (10BCornwall)
[01:17:51] <wikibugs>	 (03PS4) 10Scott French: mediawiki: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978)
[01:17:51] <wikibugs>	 (03PS2) 10Scott French: mediawiki: enable securityContext in all canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978)
[01:17:51] <wikibugs>	 (03PS2) 10Scott French: mediawiki: enable securityContext everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978)
[01:21:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P65128 and previous config saved to /var/cache/conftool/dbconfig/20240618-012109-marostegui.json
[01:21:59] <wikibugs>	 (03PS1) 10BCornwall: hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891)
[01:24:36] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2948/console" [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall)
[01:31:11] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.10 [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1046803 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot)
[01:36:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T364069)', diff saved to https://phabricator.wikimedia.org/P65129 and previous config saved to /var/cache/conftool/dbconfig/20240618-013616-marostegui.json
[01:36:19] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[01:36:22] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[01:36:32] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[01:36:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65130 and previous config saved to /var/cache/conftool/dbconfig/20240618-013639-marostegui.json
[01:40:01] <wikibugs>	 (03CR) 10Scott French: "Alright, I think I've figured out how to make CI render the right diffs here." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[01:40:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[01:45:15] <jinxer-wm>	 FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[01:55:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0200)
[02:00:15] <jinxer-wm>	 RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[02:38:46] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:56:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[02:58:46] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0300)
[03:01:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[03:01:51] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046807 (https://phabricator.wikimedia.org/T361404)
[03:01:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046807 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot)
[03:02:29] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046807 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot)
[03:03:00] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.10  refs T361404
[03:03:05] <stashbot>	 T361404: 1.43.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T361404
[03:07:58] <icinga-wm_>	 PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[03:08:18] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:08:18] <icinga-wm_>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:08:48] <icinga-wm_>	 PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:51:31] <jinxer-wm>	 FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:55:48] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:56:31] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:58:46] <jinxer-wm>	 RESOLVED: [2x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0400)
[04:01:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:01:57] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.10  refs T361404 (duration: 58m 57s)
[04:02:04] <stashbot>	 T361404: 1.43.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T361404
[04:02:52] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.7 (duration: 02m 50s)
[04:20:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s4 T367378
[04:20:45] <stashbot>	 T367378: Switchover s4 master (db1160 -> db1238) - https://phabricator.wikimedia.org/T367378
[04:20:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1238 with weight 0 T367378', diff saved to https://phabricator.wikimedia.org/P65131 and previous config saved to /var/cache/conftool/dbconfig/20240618-042054-marostegui.json
[04:21:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s4 T367378
[04:21:40] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1238 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1042595 (https://phabricator.wikimedia.org/T367378) (owner: 10Gerrit maintenance bot)
[04:23:45] <wikibugs>	 (03PS1) 10Marostegui: db1201: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1046808
[04:34:35] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1201: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1046808 (owner: 10Marostegui)
[04:34:41] <wikibugs>	 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9902003 (10Marostegui) There is a problem before we can even check the grants, there's no connection between those two hosts and the proxies. I guess a FW rules needs to be added somewhere:  ` root@l...
[04:47:28] <marostegui>	 !log Starting s4 eqiad failover from db1160 to db1238 - T367378
[04:47:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:47:33] <stashbot>	 T367378: Switchover s4 master (db1160 -> db1238) - https://phabricator.wikimedia.org/T367378
[04:47:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T367378', diff saved to https://phabricator.wikimedia.org/P65132 and previous config saved to /var/cache/conftool/dbconfig/20240618-044747-marostegui.json
[04:48:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1238 to s4 primary and set section read-write T367378', diff saved to https://phabricator.wikimedia.org/P65133 and previous config saved to /var/cache/conftool/dbconfig/20240618-044806-marostegui.json
[04:48:41] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1042596 (https://phabricator.wikimedia.org/T367378)
[04:49:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1160 T367378', diff saved to https://phabricator.wikimedia.org/P65134 and previous config saved to /var/cache/conftool/dbconfig/20240618-044908-root.json
[04:49:23] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1042596 (https://phabricator.wikimedia.org/T367378) (owner: 10Gerrit maintenance bot)
[04:49:24] <wikibugs>	 (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1042596 (https://phabricator.wikimedia.org/T367378) (owner: 10Gerrit maintenance bot)
[04:51:44] <wikibugs>	 (03PS1) 10Marostegui: db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1046809
[04:51:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Long schema change
[04:51:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Long schema change
[04:52:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1046809 (owner: 10Marostegui)
[04:54:52] <marostegui>	 !log dbmaint eqiad s4 deploy schema change on db1160 T364299
[04:54:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:54:57] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[04:57:23] <marostegui>	 !log dbmaint eqiad s2 deploy schema change on db2207 T364299
[04:57:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:00:43] <marostegui>	 !log dbmaint codfw s5 deploy schema change on db2213 T364299
[05:00:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:00:48] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[05:15:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65135 and previous config saved to /var/cache/conftool/dbconfig/20240618-051517-marostegui.json
[05:15:22] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[05:23:46] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:30:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P65136 and previous config saved to /var/cache/conftool/dbconfig/20240618-053024-marostegui.json
[05:33:22] <wikibugs>	 (03PS1) 10KartikMistry: Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046810 (https://phabricator.wikimedia.org/T367838)
[05:38:19] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046810 (https://phabricator.wikimedia.org/T367838) (owner: 10KartikMistry)
[05:44:53] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.hosts.decommission for hosts db2102.codfw.wmnet
[05:45:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P65137 and previous config saved to /var/cache/conftool/dbconfig/20240618-054531-marostegui.json
[05:50:44] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.dns.netbox
[05:53:29] <logmsgbot>	 !log jynus@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2102.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin2002"
[05:54:35] <icinga-wm_>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:55:05] <icinga-wm_>	 RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:55:21] <icinga-wm_>	 RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 11, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:55:23] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2102.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin2002"
[05:55:23] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[05:55:24] <logmsgbot>	 !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2102.codfw.wmnet
[05:56:35] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and arnaudb: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0600)
[06:00:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65138 and previous config saved to /var/cache/conftool/dbconfig/20240618-060038-marostegui.json
[06:00:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[06:00:46] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[06:00:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[06:01:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T364069)', diff saved to https://phabricator.wikimedia.org/P65139 and previous config saved to /var/cache/conftool/dbconfig/20240618-060100-marostegui.json
[06:02:43] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Remove all remaining puppet references to db2102 [puppet] - 10https://gerrit.wikimedia.org/r/1046812 (https://phabricator.wikimedia.org/T366892)
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:20:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:20:48] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:21:55] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[06:21:57] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] mariadb: Remove all remaining puppet references to db2102 [puppet] - 10https://gerrit.wikimedia.org/r/1046812 (https://phabricator.wikimedia.org/T366892) (owner: 10Jcrespo)
[06:23:54] <wikibugs>	 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2102.codw.wmnet - https://phabricator.wikimedia.org/T366892#9902128 (10jcrespo) a:05jcrespo→03None
[06:31:24] <wikibugs>	 (03CR) 10Ayounsi: dnsbox: announce ntp-[abc].anycast.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[06:52:35] <logmsgbot>	 !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1240.eqiad.wmnet with reason: data reload
[06:52:49] <logmsgbot>	 !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1240.eqiad.wmnet with reason: data reload
[06:54:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM (I'll also send a patch to move this to firewall::service when the migration is completed)" [puppet] - 10https://gerrit.wikimedia.org/r/1046785 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[06:56:23] <icinga-wm_>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:56:23] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:00:04] <jouncebot>	 Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0700).
[07:00:04] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:03] <kart_>	 here
[07:01:48] <wikibugs>	 (03CR) 10Ayounsi: "That's nice ! it's great to see data being removed from those yaml files !" [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[07:02:42] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046810 (https://phabricator.wikimedia.org/T367838) (owner: 10KartikMistry)
[07:03:20] <wikibugs>	 (03Merged) 10jenkins-bot: Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046810 (https://phabricator.wikimedia.org/T367838) (owner: 10KartikMistry)
[07:04:20] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:1046810|Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% (T367838)]]
[07:04:24] <stashbot>	 T367838: Adjust the Machine translation limit for Telugu Wikipedia from 70% to 75% - https://phabricator.wikimedia.org/T367838
[07:04:45] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:08:32] <wikibugs>	 (03PS6) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392
[07:09:15] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:1046810|Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% (T367838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:10:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9902174 (10MoritzMuehlenhoff) >>! In T367071#9882394, @Jclark-ctr wrote: > @MoritzMuehlenhoff   after replacing failed drive  looked like it might boot but still fails....
[07:10:37] <wikibugs>	 (03CR) 10Jcrespo: "After rebase, those changes have already committed by someone else :-( :-) :-| . Only the heartbeat changes are left." [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo)
[07:11:06] <logmsgbot>	 !log kartik@deploy1002 kartik: Continuing with sync
[07:12:46] <marostegui>	 !log dbmaint codfw s4 deploy schema change 
[07:12:47] <wikibugs>	 (03CR) 10Jcrespo: [C:04-1] "I believe this is missing the new replication user. @Ladsgroup" [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo)
[07:12:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:51] <marostegui>	 !log dbmaint codfw s4 deploy schema change  T367261
[07:12:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:55] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[07:15:43] <marostegui>	 !log dbmaint eqiad s5 deploy schema change on primary master T364069
[07:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:15:47] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[07:19:29] <kart_>	 I'll also deploy cxserver since there is no other config/backport patches in the queue.
[07:19:43] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-06-13-045621-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042603 (https://phabricator.wikimedia.org/T364122) (owner: 10KartikMistry)
[07:19:43] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 31 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:20:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:20:33] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2024-06-13-045621-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042603 (https://phabricator.wikimedia.org/T364122) (owner: 10KartikMistry)
[07:20:56] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1046810|Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% (T367838)]] (duration: 16m 36s)
[07:21:01] <stashbot>	 T367838: Adjust the Machine translation limit for Telugu Wikipedia from 70% to 75% - https://phabricator.wikimedia.org/T367838
[07:21:50] <kart_>	 seems backport failing with: "07:20:56 backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=kartik', 'Backport for [[gerrit:1046810|Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% (T367838)]]']' returned non-zero exit status 1."
[07:21:55] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[07:24:39] <kart_>	 but change seems applied..
[07:25:59] <icinga-wm_>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 9438 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[07:26:08] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply
[07:26:30] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[07:26:34] <effie>	 jouncebot: now
[07:26:34] <jouncebot>	 For the next 0 hour(s) and 33 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0700)
[07:26:37] <effie>	 jouncebot: next
[07:26:37] <jouncebot>	 In 0 hour(s) and 33 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0800)
[07:28:02] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[07:28:10] <wikibugs>	 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9902206 (10Marostegui) a:05Ladsgroup→03eoghan
[07:28:34] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[07:29:11] <kart_>	 effie: I'm deploying cxserver now, since there weren't any more backport/config patches..
[07:29:29] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[07:29:46] <effie>	 kart_: thank you!
[07:30:02] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[07:31:30] <kart_>	 !log Updated cxserver to 2024-06-13-045621-production (T364122, T138401)
[07:31:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:36] <stashbot>	 T364122: In zgh.wikipedia Content Translation use machine translation with MinT Translation with tzm code - https://phabricator.wikimedia.org/T364122
[07:31:36] <stashbot>	 T138401: Replace jsduck with JSDoc3 across all Wikimedia code bases - https://phabricator.wikimedia.org/T138401
[07:31:41] <kart_>	 effie: I'm done.
[07:31:46] <effie>	 cheers 
[07:33:03] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad
[07:35:49] <wikibugs>	 (03CR) 10Slyngshede: "@dzahn@wikimedia.org Yes, I just checked on the servers as well and the CAS version of the Gitlab services have been removed. Very nice :-" [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) (owner: 10Dzahn)
[07:35:51] <effie>	 jnuche: I am rebooting about 23 k8s nodes, I expect not to delay the trauin much 
[07:36:13] <jnuche>	 effie: ack, thx for the headsup
[07:38:21] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 455.92 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:40:51] <marostegui>	 !log dbmaint codfw s7 deploy schema change on codfw master T364069
[07:40:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:56] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[07:42:21] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 55.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[07:42:21] <wikibugs>	 (03CR) 10Volans: "For the skip of rebooted hosts if not too urgent you could wait for https://phabricator.wikimedia.org/T366797" [cookbooks] - 10https://gerrit.wikimedia.org/r/1046780 (https://phabricator.wikimedia.org/T367592) (owner: 10Ryan Kemper)
[07:43:25] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "AFAIK, this was only useful for Cassandra. Druid connection time was not an issue, so +1! Yay to less hacks :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French)
[07:44:27] <wikibugs>	 (03PS2) 10Muehlenhoff: irc.w.o: Add support for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1046659 (https://phabricator.wikimedia.org/T331702)
[07:46:43] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:47:36] <wikibugs>	 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9902233 (10eoghan) That's right -- we'll be doing that as part of the maintenance work later today:  https://gerrit.wikimedia.org/r/c/operations/puppet/+/1046785 https://phabricator.wikimedia.org/T36...
[07:51:45] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 25 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:52:29] <wikibugs>	 (03PS5) 10Ayounsi: Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275)
[07:52:29] <wikibugs>	 (03PS6) 10Ayounsi: Netbox 4: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275)
[07:52:29] <wikibugs>	 (03PS2) 10Ayounsi: Rename ganeti-netbox-sync.py to ganeti_netbox_sync.py [puppet] - 10https://gerrit.wikimedia.org/r/1039697
[07:52:51] <wikibugs>	 (03CR) 10Ayounsi: Prepare for netbox-dev (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[07:53:20] <wikibugs>	 (03PS1) 10Slyngshede: Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487)
[07:53:37] <wikibugs>	 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9902239 (10Marostegui) Yes, we have that RW and RO users in other services.
[07:56:51] <moritzm>	 !log uploaded python-irc 8.5.3+dfsg-4+wmf1 to apt.wikimedia.org T331702
[07:56:53] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[07:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:56:56] <stashbot>	 T331702: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702
[07:59:42] <effie>	 jnuche: I will ping you when I am done alright ?
[07:59:45] <wikibugs>	 (03PS2) 10Slyngshede: Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487)
[08:00:04] <jnuche>	 effie: ok
[08:00:05] <jouncebot>	 jnuche and brennen: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0800).
[08:02:34] <wikibugs>	 (03PS3) 10Slyngshede: Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487)
[08:04:23] <wikibugs>	 (03PS4) 10Slyngshede: Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487)
[08:07:00] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] "I would that nobody knows it because we never had the opportunity to check that. So far those services only connect with Druid and, when s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French)
[08:07:53] <wikibugs>	 (03CR) 10Santiago Faci: [C:03+1] "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French)
[08:09:21] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9902257 (10ABran-WMF)
[08:12:52] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9902282 (10ABran-WMF)
[08:13:34] <wikibugs>	 (03PS1) 10KartikMistry: testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047014 (https://phabricator.wikimedia.org/T367852)
[08:21:58] <wikibugs>	 (03PS10) 10Arnaudb: mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496)
[08:24:53] <effie>	 jnuche: last 3 reboots
[08:25:48] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:28:33] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046659 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[08:29:26] <XioNoX>	 !log deploy pfw policy update 1718644831 - T367796
[08:29:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:48] <effie>	 jnuche: go ahead
[08:31:06] <effie>	 sorry for the delay, it has been hard finding enough time to do this 
[08:31:14] <jnuche>	 effie: thanks! I'll start the train deployment in a couple of minutes
[08:31:17] <jnuche>	 no prob
[08:34:51] <marostegui>	 !log dbmaint eqiad s6 deploy schema change on eqiad master T364069
[08:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:34:56] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[08:35:45] <icinga-wm_>	 PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:36:47] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:37:45] <icinga-wm_>	 RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:38:29] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] varnish: show better error for 429s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins)
[08:38:47] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047018 (https://phabricator.wikimedia.org/T361404)
[08:38:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047018 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot)
[08:39:37] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047018 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot)
[08:40:15] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad
[08:41:45] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[08:41:57] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet
[08:43:05] <fabfur>	 !log cp4037 currently depooled and puppet disabled for T367756
[08:43:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:10] <stashbot>	 T367756: Upgrade ulsfo hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[08:44:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] irc.w.o: Add support for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1046659 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[08:45:10] <logmsgbot>	 !log hashar@deploy1002 Started deploy [integration/docroot@7a92240]: doc: Add mwseaql Rust crate
[08:45:17] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [integration/docroot@7a92240]: doc: Add mwseaql Rust crate (duration: 00m 07s)
[08:46:10] <wikibugs>	 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9902380 (10Gehel) p:05Triage→03High
[08:47:44] <icinga-wm_>	 PROBLEM - Host db1165 #page is DOWN: PING CRITICAL - Packet loss = 100%
[08:47:56] <icinga-wm_>	 RECOVERY - Host db1165 #page is UP: PING WARNING - Packet loss = 66%, RTA = 314.72 ms
[08:48:34] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.06.17 - 2024.07.07): Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9902391 (10Gehel)
[08:49:24] <arnaudb>	 weird false positive
[08:50:01] <hnowlan>	 here 
[08:50:43] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047019
[08:50:58] <hnowlan>	 arnaudb: db1165 you mean? 
[08:50:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65140 and previous config saved to /var/cache/conftool/dbconfig/20240618-085057-root.json
[08:51:05] <arnaudb>	 yep
[08:51:19] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047019 (owner: 10Marostegui)
[08:51:25] <arnaudb>	 it looks like it has hardware issues, will downtime it
[08:51:44] <hnowlan>	 thanks 
[08:51:48] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1165.eqiad.wmnet with reason: repl issues
[08:51:49] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 7 days, 0:00:00 on db1165.eqiad.wmnet with reason: repl issues
[08:51:55] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1165.eqiad.wmnet with reason: hardware issues
[08:52:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1165.eqiad.wmnet with reason: hardware issues
[08:52:11] <Amir1>	 thanks arnaudb 
[08:52:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1165 depool to troubleshoot hardware issues', diff saved to https://phabricator.wikimedia.org/P65141 and previous config saved to /var/cache/conftool/dbconfig/20240618-085254-arnaudb.json
[08:53:48] <logmsgbot>	 !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.10  refs T361404
[08:53:53] <stashbot>	 T361404: 1.43.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T361404
[08:54:19] <icinga-wm_>	 RECOVERY - ircecho bot process on irc2002 is OK: PROCS OK: 1 process with command name python2, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho
[08:57:32] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1165 network flapping issues - https://phabricator.wikimedia.org/T367854 (10ABran-WMF) 03NEW
[08:57:39] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[08:58:34] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1165 network flapping issues - https://phabricator.wikimedia.org/T367854#9902429 (10ABran-WMF) 05Open→03In progress
[08:59:17] <icinga-wm_>	 PROBLEM - ircecho bot process on irc1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho
[09:01:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:01:49] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:03:13] <wikibugs>	 (03PS1) 10Muehlenhoff: mw-irc: Fix installation of Prometheus Python client package [puppet] - 10https://gerrit.wikimedia.org/r/1047021 (https://phabricator.wikimedia.org/T331702)
[09:03:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mw-irc: Fix installation of Prometheus Python client package [puppet] - 10https://gerrit.wikimedia.org/r/1047021 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[09:03:38] <wikibugs>	 (03PS4) 10Gehel: cloudelastic: enable IPIP for LVS [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T367511) (owner: 10Bking)
[09:04:15] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9902459 (10eoghan)
[09:05:55] <wikibugs>	 (03PS2) 10Muehlenhoff: mw-irc: Fix installation of Prometheus Python client package [puppet] - 10https://gerrit.wikimedia.org/r/1047021 (https://phabricator.wikimedia.org/T331702)
[09:05:58] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet
[09:06:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65142 and previous config saved to /var/cache/conftool/dbconfig/20240618-090603-root.json
[09:06:49] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:08:08] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/1046699 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol)
[09:08:37] <icinga-wm_>	 PROBLEM - Host acmechief2002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:08:43] <icinga-wm_>	 PROBLEM - Host logstash2023 is DOWN: PING CRITICAL - Packet loss = 100%
[09:08:49] <icinga-wm_>	 PROBLEM - Host ncredir2001 is DOWN: PING CRITICAL - Packet loss = 100%
[09:08:49] <icinga-wm_>	 PROBLEM - Host netboxdb2002 is DOWN: PING CRITICAL - Packet loss = 100%
[09:09:42] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Ouch, yeah...sounds plausible. Nice find!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[09:10:48] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service logstash2023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash2023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:10:49] <icinga-wm_>	 PROBLEM - ganeti-noded running on ganeti2029 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[09:10:52] <marostegui>	 !log dbmaint eqiad s4 deploy schema change  T367261
[09:10:53] <wikibugs>	 (03PS3) 10Brouberol: dse-k8s: setup a discovery record for all deployed applications [dns] - 10https://gerrit.wikimedia.org/r/1046699 (https://phabricator.wikimedia.org/T367768)
[09:10:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:56] <stashbot>	 T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261
[09:12:14] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] dse-k8s: setup a discovery record for all deployed applications [dns] - 10https://gerrit.wikimedia.org/r/1046699 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol)
[09:12:21] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[09:13:37] <moritzm>	 !log rebooting ganeti2029 
[09:13:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:57] <icinga-wm_>	 PROBLEM - Host ganeti2029 is DOWN: PING CRITICAL - Packet loss = 100%
[09:15:01] <wikibugs>	 (03CR) 10Vgutierrez: cloudelastic: enable IPIP for LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T367511) (owner: 10Bking)
[09:18:23] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] cephadm: install lvm2 on all target nodes, not just osds [puppet] - 10https://gerrit.wikimedia.org/r/1043809 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[09:18:36] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1047022 (https://phabricator.wikimedia.org/T367857)
[09:18:40] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1047023 (https://phabricator.wikimedia.org/T367857)
[09:18:43] <icinga-wm_>	 RECOVERY - Host ganeti2029 is UP: PING OK - Packet loss = 0%, RTA = 30.22 ms
[09:18:46] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service ganeti2029:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:18:49] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:18:51] <icinga-wm_>	 RECOVERY - ganeti-noded running on ganeti2029 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti
[09:19:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: firmware upgrade for mw1359.eqiad.wmnet, mw1364.eqiad.wmnet, mw1365.eqiad.wmnet, mw1412.eqiad.wmnet - https://phabricator.wikimedia.org/T367766#9902528 (10Clement_Goubert) Yes, idrac should be enough, thank you.
[09:20:29] <icinga-wm_>	 RECOVERY - Host logstash2023 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms
[09:20:37] <icinga-wm_>	 RECOVERY - Host acmechief2002 is UP: PING OK - Packet loss = 0%, RTA = 30.51 ms
[09:20:37] <icinga-wm_>	 RECOVERY - Host netboxdb2002 is UP: PING OK - Packet loss = 0%, RTA = 30.63 ms
[09:20:39] <icinga-wm_>	 RECOVERY - Host ncredir2001 is UP: PING OK - Packet loss = 0%, RTA = 30.58 ms
[09:20:48] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service ganeti2029:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:21:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65143 and previous config saved to /var/cache/conftool/dbconfig/20240618-092108-root.json
[09:23:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:23:46] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service ganeti2029:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:23:46] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:23:51] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 31 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:26:01] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9902539 (10MatthewVernon) Have the swift containers been generated for these wikis? I can't find any obviously-matching ones.
[09:27:03] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:27:35] <wikibugs>	 (03CR) 10Klausman: [C:03+1] "Yes, we would like to keep the alert, and for now, the threshold/duration should be good. We will see if we need to tune it, and then make" [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French)
[09:27:47] <moritzm>	 !log arm keyholder on acmechief2002
[09:27:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] mw-irc: Fix installation of Prometheus Python client package [puppet] - 10https://gerrit.wikimedia.org/r/1047021 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff)
[09:29:32] <wikibugs>	 (03CR) 10MVernon: [C:03+2] cephadm: install lvm2 on all target nodes, not just osds [puppet] - 10https://gerrit.wikimedia.org/r/1043809 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[09:31:53] <effie>	 jnuche: ping me please after the train is done 
[09:33:28] <jinxer-wm>	 RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:36:01] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9902607 (10MatthewVernon) ...further to @Ladsgroup's comment elsewhere, if the intention is that these wikis all have local upload disabled, t...
[09:36:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65144 and previous config saved to /var/cache/conftool/dbconfig/20240618-093614-root.json
[09:40:05] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:41:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2002.wikimedia.org
[09:41:30] <wikibugs>	 06SRE, 06Infrastructure-Foundations: DRBD kernel error on ganeti2031 led to kernel hang - https://phabricator.wikimedia.org/T348730#9902622 (10MoritzMuehlenhoff) Happened once more on ganeti2029 today. We're gradually moving nodes to Bookworm (the routed cluster and magru cluster are already running it and the...
[09:45:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2002.wikimedia.org
[09:48:24] <wikibugs>	 (03CR) 10Kamila Součková: service: add basic config for shellbox-video (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[09:48:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1002.wikimedia.org
[09:50:06] <jnuche>	 effie: train is done
[09:50:16] <effie>	 cheers!
[09:50:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702#9902628 (10Clement_Goubert) Yes, it is only for the `docker_pull_k8s` step, for which failures are not critical unless a lot of hosts fail it...
[09:51:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65145 and previous config saved to /var/cache/conftool/dbconfig/20240618-095119-root.json
[09:51:23] <icinga-wm_>	 RECOVERY - ircecho bot process on irc1002 is OK: PROCS OK: 1 process with command name python2, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho
[09:52:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1002.wikimedia.org
[09:52:52] <wikibugs>	 (03CR) 10Btullis: Initial import of ceph-csi-rbd chart for inspection (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis)
[09:53:21] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker1019.eqiad.wmnet|wikikube-worker1020.eqiad.wmnet|wikikube-worker1021.eqiad.wmnet),cluster=kubernetes,service=kubesvc
[09:55:51] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:57:39] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[09:59:57] <wikibugs>	 (03PS16) 10Btullis: Add a values file for the ceph-csi plugin on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031589 (https://phabricator.wikimedia.org/T327259)
[10:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1000)
[10:00:17] <wikibugs>	 (03PS11) 10Btullis: Deploy the ceph-csi-rbd chart to dse-k8s with default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028938 (https://phabricator.wikimedia.org/T364472)
[10:00:26] <wikibugs>	 (03PS7) 10Btullis: Add WMF customisations to the upstream ceph-csi-rbd chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472)
[10:00:51] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 26 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:00:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1046613 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede)
[10:01:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:03:01] <wikibugs>	 (03PS17) 10Btullis: Add a values file for the ceph-csi plugin on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031589 (https://phabricator.wikimedia.org/T327259)
[10:03:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki-image-download: Drop to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1039619 (https://phabricator.wikimedia.org/T366778) (owner: 10Alexandros Kosiaris)
[10:04:23] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "looks good, changes should be applied first on the realservers before restarting pybal on lvs1020 and lvs1018" [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T367511) (owner: 10Bking)
[10:04:44] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3066.esams.wmnet
[10:05:19] <fabfur>	 !log cp3066 currently depooled and puppet disabled for T367756
[10:05:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:05:23] <stashbot>	 T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[10:05:40] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] mw-on-k8s: Deploy statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[10:05:47] <Amir1>	 eoghan: shall we start?
[10:06:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65146 and previous config saved to /var/cache/conftool/dbconfig/20240618-100624-root.json
[10:08:06] <eoghan>	 Amir1: Yep! Just getting myself set up here. I suggest #wikimedia-sre-collab to keep the noise out of here, that ok with you? 
[10:08:17] <Amir1>	 sure
[10:08:55] <eoghan>	 Heads up, we're going to start the mailman migration to new hardware now, details can be found here: https://phabricator.wikimedia.org/T367521 
[10:08:57] <Amir1>	 hnowlan: note to oncall: We (I mean mostly eoghan, I'm just for emotional support) are migrating mailman to new hw and sw
[10:09:11] <Amir1>	 downtime of two hours
[10:09:12] <hnowlan>	 thanks for letting me know! 
[10:09:27] <wikibugs>	 (03CR) 10JMeybohm: Add WMF customisations to the upstream ceph-csi-rbd chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis)
[10:09:49] <wikibugs>	 (03CR) 10JMeybohm: [V:03+2 C:03+2] Allow multiple update files in one go [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/643912 (owner: 10JMeybohm)
[10:10:27] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Block incoming email on lists hosts during mailman migration [puppet] - 10https://gerrit.wikimedia.org/r/1043799 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[10:14:18] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lists[1001,1004,2001].wikimedia.org with reason: Mailman migration
[10:14:34] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lists[1001,1004,2001].wikimedia.org with reason: Mailman migration
[10:14:49] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9902692 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f70cad25-fba3-40c1-a3c3-abe8534eca40) set by eogha...
[10:14:57] <wikibugs>	 (03PS2) 10Hnowlan: service: add basic config for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309)
[10:16:08] <wikibugs>	 06SRE, 07SRE-Unowned, 10Wikimedia-IRC-RC-Server: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702#9902698 (10MoritzMuehlenhoff) Bullseye-based servers are up and running, one can connect to irc1002.wikimedia.org and irc2002.wikimedia.org the same way as for irc1001/irc2001....
[10:17:41] <wikibugs>	 (03PS1) 10Fabfur: hiera: enable benthos on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1047029 (https://phabricator.wikimedia.org/T367756)
[10:18:32] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9902700 (10eoghan)
[10:19:05] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-mcrouter: add ClusterIP for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047030 (https://phabricator.wikimedia.org/T346690)
[10:21:06] <wikibugs>	 (03CR) 10Hnowlan: service: add basic config for shellbox-video (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[10:21:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65147 and previous config saved to /var/cache/conftool/dbconfig/20240618-102130-root.json
[10:21:48] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047029 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[10:22:06] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-on-k8s: Deploy statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[10:22:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good, let's give this a shot. I'll upload the secondary openjdk-21 buildin a few, then we can attempt a build." [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[10:22:42] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690)
[10:22:55] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] SSH Key mgmt: Ensure that keys are trimmed [software/bitu] - 10https://gerrit.wikimedia.org/r/1046613 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede)
[10:23:03] <wikibugs>	 (03Merged) 10jenkins-bot: mw-on-k8s: Deploy statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert)
[10:23:17] <claime>	 jouncebot: nowandnext
[10:23:17] <jouncebot>	 For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1000)
[10:23:17] <jouncebot>	 In 1 hour(s) and 36 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1200)
[10:23:53] <wikibugs>	 (03PS1) 10MVernon: Move moss-fe{1,2}001 back to apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047033 (https://phabricator.wikimedia.org/T279621)
[10:24:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T364069)', diff saved to https://phabricator.wikimedia.org/P65148 and previous config saved to /var/cache/conftool/dbconfig/20240618-102418-marostegui.json
[10:24:21] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 48.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:24:23] <wikibugs>	 (03CR) 10Cathal Mooney: Include vlans with an IRB int in device vlans even if not on L2 port (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney)
[10:24:23] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[10:24:27] <wikibugs>	 (03Merged) 10jenkins-bot: SSH Key mgmt: Ensure that keys are trimmed [software/bitu] - 10https://gerrit.wikimedia.org/r/1046613 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede)
[10:24:40] <wikibugs>	 (03PS1) 10Ladsgroup: prometheus: Change footer icon ping url [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190)
[10:27:11] <wikibugs>	 (03CR) 10Ladsgroup: [C:04-1] "That's actually not the right url and will be removed too. I need to wait a week before pushing the correct url." [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190) (owner: 10Ladsgroup)
[10:27:29] <logmsgbot>	 !log cgoubert@deploy1002 Started scap: Deploy statsd exporter - T365265
[10:27:34] <stashbot>	 T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265
[10:29:18] <wikibugs>	 (03CR) 10EoghanGaffney: "I don't believe this is the case, I think that it only acts on those IPs if they're set -- for example puppet runs correctly on lists1004/" [puppet] - 10https://gerrit.wikimedia.org/r/1036610 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[10:29:26] <wikibugs>	 (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036610 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[10:30:39] <logmsgbot>	 !log cgoubert@deploy1002 Finished scap: Deploy statsd exporter - T365265 (duration: 03m 39s)
[10:30:41] <moritzm>	 !log upload openjdk-21 21.0.3+9-2~deb12u2  for bookworm/wikimedia (secondary rebuild on build2001 following the initial bootstrap build) https://phabricator.wikimedia.org/T367487
[10:30:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:31:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[10:31:21] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[10:31:50] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[10:32:04] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[10:32:10] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[10:32:12] <wikibugs>	 (03CR) 10Fabfur: [C:04-2] "Do not merge until haproxy is upgraded to 2.8.10 on the impacted hosts and benthos configuration is using rfc5424 syslog format" [puppet] - 10https://gerrit.wikimedia.org/r/1047029 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[10:32:19] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[10:32:22] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Migrate mailman primary host from lists1001 -> lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1036610 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[10:32:26] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[10:32:28] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Update code doc for replication grants [puppet] - 10https://gerrit.wikimedia.org/r/1047036
[10:32:35] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[10:32:38] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[10:32:47] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[10:32:55] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[10:33:06] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[10:33:11] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[10:33:20] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[10:33:31] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply
[10:33:41] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply
[10:33:45] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply
[10:33:54] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply
[10:35:21] <wikibugs>	 (03PS2) 10Cathal Mooney: Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348)
[10:37:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney)
[10:38:35] <wikibugs>	 (03PS2) 10Hnowlan: DNM: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309)
[10:39:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] DNM: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[10:39:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P65149 and previous config saved to /var/cache/conftool/dbconfig/20240618-103925-marostegui.json
[10:39:28] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Update code doc for replication grants [puppet] - 10https://gerrit.wikimedia.org/r/1047036
[10:39:33] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Update code doc for replication grants [puppet] - 10https://gerrit.wikimedia.org/r/1047036 (owner: 10Ladsgroup)
[10:42:36] <wikibugs>	 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9902761 (10SGupta-WMF) @Scott_French I am waiting for final go ahead from QA ....
[10:43:05] <wikibugs>	 (03PS3) 10Hnowlan: DNM: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309)
[10:45:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] DNM: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[10:47:14] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9902781 (10eoghan)
[10:48:10] <wikibugs>	 (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046785 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[10:48:47] <marostegui>	 !log dbmaint codfw s2 deploy schema change  T364069
[10:48:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:52] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[10:49:17] <wikibugs>	 (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756)
[10:49:24] <wikibugs>	 (03PS1) 10Brouberol: ATS: replace service by discovery record for datahub-next [puppet] - 10https://gerrit.wikimedia.org/r/1047040 (https://phabricator.wikimedia.org/T367768)
[10:49:26] <wikibugs>	 (03PS1) 10Brouberol: ATS: replace service by discovery record for all DSE services [puppet] - 10https://gerrit.wikimedia.org/r/1047041 (https://phabricator.wikimedia.org/T367768)
[10:49:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[10:51:02] <wikibugs>	 (03PS2) 10Fabfur: hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756)
[10:51:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[10:51:26] <wikibugs>	 (03PS1) 10Effie Mouzeli: mw-debug: point mediawiki to mw-mcrouter's clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047043 (https://phabricator.wikimedia.org/T346690)
[10:51:28] <wikibugs>	 (03Abandoned) 10Hnowlan: conftool: Remove thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1005728 (owner: 10Alexandros Kosiaris)
[10:52:32] <wikibugs>	 (03CR) 10Hnowlan: shellbox-video: initial helmfile configuration (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[10:54:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P65150 and previous config saved to /var/cache/conftool/dbconfig/20240618-105432-marostegui.json
[10:56:01] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "LGTM, except I have zero clue about the LVS part" [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[10:56:54] <wikibugs>	 (03PS1) 10Muehlenhoff: idp::build: Install Java 21 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487)
[10:57:56] <wikibugs>	 (03PS2) 10Effie Mouzeli: mw-mcrouter: add ClusterIP for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047030 (https://phabricator.wikimedia.org/T346690)
[10:58:00] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3066.esams.wmnet
[10:58:14] <fabfur>	 !log cp3066 repooled and puppet enabled (T367756)
[10:58:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:58:19] <stashbot>	 T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[10:58:50] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[10:59:15] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mw-mcrouter: add ClusterIP for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047030 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:00:53] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 41 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:01:18] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet
[11:01:59] <wikibugs>	 (03PS3) 10Fabfur: hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756)
[11:02:04] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:02:47] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: add ClusterIP for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047030 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:03:03] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "thanks for cleaning up my TODOs, greatly appreciated :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[11:03:33] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[11:03:39] <wikibugs>	 (03Merged) 10jenkins-bot: mw-mcrouter: add ClusterIP for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047030 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:04:12] <akosiaris>	 !next
[11:05:04] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1208.eqiad.wmnet with reason: Upgrading to bookworm
[11:05:12] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet
[11:05:17] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1208.eqiad.wmnet with reason: Upgrading to bookworm
[11:05:23] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:05:41] <akosiaris>	 we 'll be change the kubernetes service IPs for mcrouter in eqiad and codfw
[11:05:45] <akosiaris>	 changing*
[11:05:57] <wikibugs>	 (03PS2) 10Muehlenhoff: idp::build: Install Java 21 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487)
[11:07:51] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[11:07:57] <wikibugs>	 (03PS3) 10Cathal Mooney: Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348)
[11:08:18] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[11:08:23] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[11:08:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:08:30] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-ui1001.eqiad.wmnet
[11:09:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[11:09:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T364069)', diff saved to https://phabricator.wikimedia.org/P65151 and previous config saved to /var/cache/conftool/dbconfig/20240618-110939-marostegui.json
[11:09:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[11:09:44] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[11:09:49] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host db1208.eqiad.wmnet with OS bookworm
[11:09:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[11:10:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T364069)', diff saved to https://phabricator.wikimedia.org/P65152 and previous config saved to /var/cache/conftool/dbconfig/20240618-111001-marostegui.json
[11:12:13] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-ui1001.eqiad.wmnet
[11:13:08] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply
[11:13:14] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply
[11:13:25] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-druid1001.eqiad.wmnet
[11:13:27] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan)
[11:13:59] <wikibugs>	 (03PS4) 10Clément Goubert: wikikube: Use conftool for scap docker_pull_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047031 (https://phabricator.wikimedia.org/T367862)
[11:14:06] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[11:14:16] <wikibugs>	 (03Merged) 10jenkins-bot: kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan)
[11:14:28] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[11:15:49] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[11:15:56] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] mariadb: Update code doc for replication grants [puppet] - 10https://gerrit.wikimedia.org/r/1047036 (owner: 10Ladsgroup)
[11:16:08] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[11:16:11] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[11:16:11] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9902871 (10Clement_Goubert)
[11:16:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] idp::build: Install Java 21 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[11:16:37] <wikibugs>	 (03PS7) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392
[11:16:50] <wikibugs>	 (03CR) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo)
[11:18:54] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-druid1001.eqiad.wmnet
[11:20:55] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:22:23] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-master1002.eqiad.wmnet
[11:22:26] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[11:23:07] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 100% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047046 (https://phabricator.wikimedia.org/T362323)
[11:24:17] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1208.eqiad.wmnet with reason: host reimage
[11:25:28] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: move 95% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323)
[11:26:54] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1208.eqiad.wmnet with reason: host reimage
[11:27:55] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:28:57] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-master1002.eqiad.wmnet
[11:29:35] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[11:29:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[11:29:45] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Remove service IPs from lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521)
[11:29:58] <wikibugs>	 (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney)
[11:31:57] <wikibugs>	 (03CR) 10Jelto: "interface::alias will probably fail if we are aliasing the same address multiple times?" [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney)
[11:32:59] <wikibugs>	 (03CR) 10EoghanGaffney: "As with the comment in line above, it's a no-op when the variables are unset" [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney)
[11:33:01] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "this looks good with the default in $list_outbound_ips (I missed them)" [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney)
[11:33:05] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki: switch to using the mw-mcrouter daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047050 (https://phabricator.wikimedia.org/T346690)
[11:33:55] <wikibugs>	 (03PS1) 10Muehlenhoff: idp::build: Make the rsync setup depend on the OS [puppet] - 10https://gerrit.wikimedia.org/r/1047051 (https://phabricator.wikimedia.org/T367487)
[11:34:07] <wikibugs>	 (03CR) 10CI reject: [V:04-1] idp::build: Make the rsync setup depend on the OS [puppet] - 10https://gerrit.wikimedia.org/r/1047051 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[11:34:10] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Remove service IPs from lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney)
[11:34:29] <wikibugs>	 (03PS2) 10Muehlenhoff: idp::build: Make the rsync setup depend on the OS [puppet] - 10https://gerrit.wikimedia.org/r/1047051 (https://phabricator.wikimedia.org/T367487)
[11:34:45] <wikibugs>	 (03PS1) 10Hnowlan: kask: don't allocate service port twice when using mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047052 (https://phabricator.wikimedia.org/T363996)
[11:35:01] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: mw-debug: point mediawiki to mw-mcrouter's clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047043 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:35:10] <akosiaris_>	 eqiad mw-mcrouter has been recreated with the new hardcoded service IP btw, that above is to use it ^
[11:35:40] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki: switch eqiad to use the mw-mcrouter daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047050 (https://phabricator.wikimedia.org/T346690)
[11:36:22] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] mw-web, mw-api-ext: Raise replicas for 100% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047046 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[11:36:36] <wikibugs>	 (03PS1) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[11:37:22] <wikibugs>	 (03CR) 10Hnowlan: [C:04-1] trafficserver: move 95% of traffic to mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[11:37:23] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[11:37:23] <wikibugs>	 (03PS2) 10Clément Goubert: trafficserver: move 100% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323)
[11:37:23] <wikibugs>	 (03CR) 10Clément Goubert: trafficserver: move 100% of traffic to mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[11:37:53] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] kask: don't allocate service port twice when using mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047052 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan)
[11:39:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] kask: don't allocate service port twice when using mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047052 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan)
[11:39:57] <effie>	 jouncebot: now 
[11:39:57] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 20 minute(s)
[11:40:01] <effie>	 jouncebot: next
[11:40:01] <jouncebot>	 In 0 hour(s) and 19 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1200)
[11:40:04] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521)
[11:40:10] <marostegui>	 !log Rename ipblocks table on db1169 (enwiki) T367632
[11:40:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:14] <stashbot>	 T367632: Drop ipblocks in production - https://phabricator.wikimedia.org/T367632
[11:40:35] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] "🚀🚀🚀" [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[11:41:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney)
[11:41:12] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1047051 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[11:41:12] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Pause s3/db1240 snapshots until load completes [puppet] - 10https://gerrit.wikimedia.org/r/1047055 (https://phabricator.wikimedia.org/T367162)
[11:41:16] <wikibugs>	 (03PS1) 10Btullis: Update the contactgroups for all wdqs and wcqs servers [puppet] - 10https://gerrit.wikimedia.org/r/1047056 (https://phabricator.wikimedia.org/T365881)
[11:41:30] <wikibugs>	 (03PS3) 10Effie Mouzeli: mediawiki: switch eqiad to use the mw-mcrouter daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047050 (https://phabricator.wikimedia.org/T346690)
[11:41:50] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521)
[11:41:56] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Pause s3/db1240 snapshots until load completes [puppet] - 10https://gerrit.wikimedia.org/r/1047055 (https://phabricator.wikimedia.org/T367162)
[11:42:10] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[11:42:27] <marostegui>	 !log Delete ipblocks table on clouddb2002-dev  (labtestwiki) T367632
[11:42:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:42:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney)
[11:42:53] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:43:10] <wikibugs>	 (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2949/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047056 (https://phabricator.wikimedia.org/T365881) (owner: 10Btullis)
[11:43:15] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Pause s3/db1240 snapshots until load completes [puppet] - 10https://gerrit.wikimedia.org/r/1047055 (https://phabricator.wikimedia.org/T367162)
[11:43:57] <wikibugs>	 (03PS2) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[11:44:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[11:44:35] <wikibugs>	 (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1047055 (https://phabricator.wikimedia.org/T367162) (owner: 10Jcrespo)
[11:45:17] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: switch eqiad to use the mw-mcrouter daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047050 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:45:20] <wikibugs>	 (03PS1) 10Dreamrimmer: Add VL namespace alias to Azerbaijani Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264)
[11:45:31] <wikibugs>	 (03PS3) 10EoghanGaffney: lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521)
[11:46:04] <wikibugs>	 (03Abandoned) 10Alexandros Kosiaris: mw-mcrouter: Switch helmfile.d to use the newer cache module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038860 (owner: 10Alexandros Kosiaris)
[11:46:10] <wikibugs>	 (03PS3) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[11:46:32] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[11:46:41] <wikibugs>	 (03PS1) 10Hashar: Point its-phabricator to stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1047058
[11:47:01] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: switch eqiad to use the mw-mcrouter daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047050 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:47:27] <wikibugs>	 (03CR) 10Jcrespo: [C:03+2] dbbackups: Pause s3/db1240 snapshots until load completes [puppet] - 10https://gerrit.wikimedia.org/r/1047055 (https://phabricator.wikimedia.org/T367162) (owner: 10Jcrespo)
[11:47:29] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) (owner: 10Dreamrimmer)
[11:47:58] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney)
[11:48:02] <wikibugs>	 (03PS1) 10Jcrespo: Revert "dbbackups: Pause s3/db1240 snapshots until load completes" [puppet] - 10https://gerrit.wikimedia.org/r/1047059
[11:48:14] <wikibugs>	 (03CR) 10Jcrespo: [C:04-1] "Not yet." [puppet] - 10https://gerrit.wikimedia.org/r/1047059 (owner: 10Jcrespo)
[11:48:31] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Switch DB firewall rules to use primary host variable [puppet] - 10https://gerrit.wikimedia.org/r/1046785 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[11:48:36] <wikibugs>	 (03PS4) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[11:48:51] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1208.eqiad.wmnet with OS bookworm
[11:48:53] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney)
[11:49:15] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] kask: don't allocate service port twice when using mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047052 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan)
[11:50:12] <wikibugs>	 (03Merged) 10jenkins-bot: kask: don't allocate service port twice when using mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047052 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan)
[11:50:32] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.dns.wipe-cache lists.wikimedia.org on all recursors
[11:50:35] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) lists.wikimedia.org on all recursors
[11:51:51] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate system::role for DE test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047060
[11:53:32] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:53:40] <wikibugs>	 (03PS4) 10Dreamrimmer: maiwiki: Remove 'CA' namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667)
[11:53:52] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[11:54:00] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[11:54:12] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:55:13] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) (owner: 10Dreamrimmer)
[11:56:49] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[11:57:16] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] idp::build: Make the rsync setup depend on the OS [puppet] - 10https://gerrit.wikimedia.org/r/1047051 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[11:58:00] <effie>	 !log Slowly pointing mediawiki in eqiad to mw-mcrouter daemonset - T346690
[11:58:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:05] <stashbot>	 T346690: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690
[11:58:18] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[11:59:42] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[11:59:50] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[11:59:51] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[12:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1200)
[12:00:54] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[12:01:11] <denisse>	 !incidents 
[12:01:11] <sirenbot>	 4757 (ACKED)  Host db1165 (paged) - PING  - Packet loss = 100%
[12:03:53] <wikibugs>	 (03PS1) 10Slyngshede: R:idp_test MPIC went away. [labs/private] - 10https://gerrit.wikimedia.org/r/1047062
[12:04:11] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[12:04:46] <wikibugs>	 (03CR) 10Slyngshede: "Triggers PCC error, due to the remaining service configuration being missing." [labs/private] - 10https://gerrit.wikimedia.org/r/1047062 (owner: 10Slyngshede)
[12:04:48] <topranks>	 !log adding Netbox-generated IPv6 DNS records for wikikube-worker, mw and parse hosts 
[12:04:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:08] <wikibugs>	 (03PS1) 10Muehlenhoff: cas::build: Fix creation of build directory [puppet] - 10https://gerrit.wikimedia.org/r/1047063 (https://phabricator.wikimedia.org/T367487)
[12:05:25] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add IPv6 records for mw, parse and wikikube-worker hosts - cmooney@cumin1002"
[12:05:30] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[12:05:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] R:idp_test MPIC went away. [labs/private] - 10https://gerrit.wikimedia.org/r/1047062 (owner: 10Slyngshede)
[12:06:05] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] R:idp_test MPIC went away. [labs/private] - 10https://gerrit.wikimedia.org/r/1047062 (owner: 10Slyngshede)
[12:06:08] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] R:idp_test MPIC went away. [labs/private] - 10https://gerrit.wikimedia.org/r/1047062 (owner: 10Slyngshede)
[12:06:34] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add IPv6 records for mw, parse and wikikube-worker hosts - cmooney@cumin1002"
[12:06:34] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[12:07:09] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:08:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] cas::build: Fix creation of build directory [puppet] - 10https://gerrit.wikimedia.org/r/1047063 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[12:14:19] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9903045 (10ABran-WMF)
[12:14:33] <icinga-wm_>	 PROBLEM - mailman3_runners on lists1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:14:37] <icinga-wm_>	 PROBLEM - mailman3 on lists1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman3/bin/master https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:14:43] <icinga-wm_>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:14:43] <icinga-wm_>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:14:50] <eoghan>	 Mailman errors are me, silencing again for a bit.
[12:15:02] <logmsgbot>	 !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on lists[1001,1004,2001].wikimedia.org with reason: Mailman migration
[12:15:17] <logmsgbot>	 !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lists[1001,1004,2001].wikimedia.org with reason: Mailman migration
[12:15:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: ferm.service on kubernetes2056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:15:30] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9903046 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=33783771-f385-4d8a-9005-972d...
[12:16:27] <wikibugs>	 (03PS5) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[12:18:09] <wikibugs>	 (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:18:35] <icinga-wm_>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 1.007 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:18:35] <icinga-wm_>	 RECOVERY - mailman3 on lists1004 is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman3/bin/master https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:19:51] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "to be consistent with the general hiera structure please use ulsfo/profile/cache/haproxy.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[12:20:25] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: ferm.service on kubernetes2056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:20:27] <wikibugs>	 (03CR) 10Muehlenhoff: P:idp Allow upgrade to Tomcat 10. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:21:27] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Deprecate system::role for DE test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047060 (owner: 10Muehlenhoff)
[12:22:26] <moritzm>	 !log rebalance ganeti eqiad/D following reboots
[12:22:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:32] <wikibugs>	 (03PS1) 10Slyngshede: P:idp::build Add fakeroot build dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1047064 (https://phabricator.wikimedia.org/T367487)
[12:23:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:idp::build Add fakeroot build dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1047064 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:24:59] <wikibugs>	 (03CR) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:25:25] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:28:58] <wikibugs>	 (03PS1) 10Effie Mouzeli: Revert "mediawiki: switch eqiad to use the mw-mcrouter daemonset" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047065
[12:29:06] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2003.codfw.wmnet
[12:30:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:31:35] <icinga-wm_>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52197 bytes in 0.240 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:31:45] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+2] Revert "mediawiki: switch eqiad to use the mw-mcrouter daemonset" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047065 (owner: 10Effie Mouzeli)
[12:32:39] <wikibugs>	 (03PS1) 10Slyngshede: Update Debian package dependencies for CAS 7.X [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487)
[12:33:04] <wikibugs>	 (03Abandoned) 10Slyngshede: P:idp::build Add fakeroot build dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1047064 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:33:25] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "mediawiki: switch eqiad to use the mw-mcrouter daemonset" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047065 (owner: 10Effie Mouzeli)
[12:33:41] <wikibugs>	 (03PS4) 10Fabfur: hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756)
[12:33:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] P:idp Allow upgrade to Tomcat 10. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:34:27] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[12:34:42] <wikibugs>	 (03CR) 10Fabfur: "ack tnx" [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[12:35:14] <wikibugs>	 (03PS6) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487)
[12:35:16] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] ATS: replace service by discovery record for datahub-next [puppet] - 10https://gerrit.wikimedia.org/r/1047040 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol)
[12:35:25] <jinxer-wm>	 FIRING: [6x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:35:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:35:58] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2003.codfw.wmnet
[12:36:42] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[12:37:25] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 43.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:40:11] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[12:40:25] <jinxer-wm>	 FIRING: [7x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:42:28] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[12:42:29] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9903125 (10eoghan)
[12:42:32] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[12:42:35] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[12:42:52] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9903129 (10eoghan)
[12:42:56] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[12:42:58] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[12:43:21] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur)
[12:44:05] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Allow mail to be received on lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1046786 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[12:45:25] <jinxer-wm>	 FIRING: [13x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:46:37] <wikibugs>	 (03CR) 10Jforrester: "We could change to the powered-by-Wikimedia one that won't change?" [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190) (owner: 10Ladsgroup)
[12:47:05] <wikibugs>	 (03CR) 10Elukey: Prepare for netbox-dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[12:47:05] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[12:47:23] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047070 (https://phabricator.wikimedia.org/T364383)
[12:47:24] <fabfur>	 !log upgrade haproxy to v2.8.10 on all ulsfo cp hosts (T367756)
[12:47:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:28] <stashbot>	 T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[12:48:06] <logmsgbot>	 !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[12:49:00] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047070 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez)
[12:49:18] <wikibugs>	 (03PS2) 10Slyngshede: Update Debian package dependencies for CAS 7.X [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487)
[12:49:26] <wikibugs>	 (03PS1) 10Marostegui: Revert^2 "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047071
[12:49:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1160', diff saved to https://phabricator.wikimedia.org/P65155 and previous config saved to /var/cache/conftool/dbconfig/20240618-124945-root.json
[12:50:25] <jinxer-wm>	 FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:50:28] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert^2 "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047071 (owner: 10Marostegui)
[12:51:07] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo
[12:51:51] <marostegui>	 !log Deploy schema change on old s4 eqiad master db1160 dbmaint T364069
[12:51:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:55] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[12:52:06] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo
[12:52:57] <wikibugs>	 (03CR) 10Muehlenhoff: Update Debian package dependencies for CAS 7.X (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:53:31] <vgutierrez>	 !log disable puppet on A:cp-eqsin before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047070 - T364383
[12:53:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:36] <stashbot>	 T364383: Update fifo_log_demux puppet module to support new parameters - https://phabricator.wikimedia.org/T364383
[12:53:55] <wikibugs>	 (03CR) 10Muehlenhoff: Update Debian package dependencies for CAS 7.X (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:54:46] <wikibugs>	 (03PS1) 10Ssingh: install_server: update NTP server anycast address for d-i [puppet] - 10https://gerrit.wikimedia.org/r/1047073 (https://phabricator.wikimedia.org/T366360)
[12:55:17] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] hiera: Set fifo-log-demux prometheus port for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047070 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez)
[12:55:25] <jinxer-wm>	 FIRING: [15x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:55:56] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2950/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047073 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[12:56:06] <wikibugs>	 (03PS1) 10Ssingh: wikimedia.org: switch ntp.$site to ntp-a.anycast.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1047074 (https://phabricator.wikimedia.org/T366360)
[12:56:48] <vgutierrez>	 !log rolling upgrade on A:cp-eqsin to fifo-log-demux 0.7.5  - T364383
[12:56:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:15] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mcrouter: Temporarily disable in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047075 (https://phabricator.wikimedia.org/T346690)
[12:58:53] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on wikikube-ctrl1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[12:59:15] <wikibugs>	 (03PS3) 10Ssingh: config/common: update list of ntp_servers to use anycast NTP servers [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360)
[12:59:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[12:59:26] <wikibugs>	 (03CR) 10Ssingh: config/common: update list of ntp_servers to use anycast NTP servers (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1300).
[13:00:05] <jouncebot>	 DreamRimmer: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:10] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Update Debian package dependencies for CAS 7.X [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[13:00:18] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] Update Debian package dependencies for CAS 7.X (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[13:00:22] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] Update Debian package dependencies for CAS 7.X [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede)
[13:00:25] <jinxer-wm>	 FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:01:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] mcrouter: Temporarily disable in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047075 (https://phabricator.wikimedia.org/T346690) (owner: 10Alexandros Kosiaris)
[13:01:05] <wikibugs>	 (03CR) 10Ssingh: "Thanks for doing that! CR updated for the comment below. This will be merged later but I wanted to get the reviews in first." [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[13:01:35] <Lucas_WMDE>	 o/
[13:01:50] <wikibugs>	 (03CR) 10Arnaudb: mariadb: bugfixes mysql_legacy (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb)
[13:02:21] <DreamRimmer>	 I am around
[13:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: mcrouter: Temporarily disable in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047075 (https://phabricator.wikimedia.org/T346690) (owner: 10Alexandros Kosiaris)
[13:02:48] <wikibugs>	 (03PS6) 10Ayounsi: Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275)
[13:02:50] <Lucas_WMDE>	 any other deployers around? I have a meeting in 30 minutes, so I’m not sure I’ll be able to deploy both changes
[13:04:00] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-coord1004.eqiad.wmnet
[13:04:55] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: sync
[13:05:03] <Lucas_WMDE>	 well, let’s start with the azwiktionary namespace alias then
[13:05:22] <akosiaris>	 Lucas_WMDE: gimme 30 seconds, I am disabling the mcrouter stuff in codfw
[13:05:25] <jinxer-wm>	 FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:05:28] <Lucas_WMDE>	 akosiaris: ack
[13:05:34] <Lucas_WMDE>	 the patch needs an update anyway, I just noticed
[13:06:05] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] Add VL namespace alias to Azerbaijani Wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) (owner: 10Dreamrimmer)
[13:06:12] <Lucas_WMDE>	 DreamRimmer: ^
[13:06:20] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync
[13:06:21] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: sync
[13:06:37] <icinga-wm_>	 RECOVERY - Host elastic2099 is UP: PING WARNING - Packet loss = 80%, RTA = 30.36 ms
[13:07:01] <DreamRimmer>	 yeah
[13:07:11] <Lucas_WMDE>	 (#RandomWikiLove for diffConfig, amazing feature to have :))
[13:07:35] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: sync
[13:07:36] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: sync
[13:07:47] <wikibugs>	 (03PS1) 10Vgutierrez: hiera,openldap::replica: Enable IPIP on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1047076 (https://phabricator.wikimedia.org/T367861)
[13:07:57] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: sync
[13:07:58] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: sync
[13:08:28] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047076 (https://phabricator.wikimedia.org/T367861) (owner: 10Vgutierrez)
[13:08:46] <akosiaris>	 doing mw-web now, should be done pretty soon
[13:09:14] <wikibugs>	 (03PS1) 10Jforrester: Use isEnumType in selector and isCustomEnum for creating literals [extensions/WikiLambda] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047077 (https://phabricator.wikimedia.org/T367159)
[13:09:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:09:24] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: sync
[13:09:25] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: sync
[13:10:24] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1004.eqiad.wmnet
[13:10:25] <jinxer-wm>	 FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:10:45] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: sync
[13:11:15] <wikibugs>	 (03CR) 10Ssingh: dnsbox: announce ntp-[abc].anycast.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[13:12:22] <akosiaris>	 Lucas_WMDE: I am done
[13:12:30] <akosiaris>	 thanks for your patience
[13:12:32] <eoghan>	 hnowlan: We're finished with mailman now, FYI
[13:12:38] <Lucas_WMDE>	 np, I’m still waiting for the new patch set anyway :)
[13:13:01] <icinga-wm_>	 PROBLEM - Host elastic2099 is DOWN: PING CRITICAL - Packet loss = 100%
[13:13:58] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[13:14:15] <jinxer-wm>	 RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:14:29] <wikibugs>	 (03PS2) 10Dreamrimmer: Add VL namespace alias to Azerbaijani Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264)
[13:14:52] <wikibugs>	 (03CR) 10Vgutierrez: "we need to depool ldap-ro & ldap-ro-ssl on codfw before proceeding with this CR" [puppet] - 10https://gerrit.wikimedia.org/r/1047076 (https://phabricator.wikimedia.org/T367861) (owner: 10Vgutierrez)
[13:15:15] <wikibugs>	 (03CR) 10Dreamrimmer: Add VL namespace alias to Azerbaijani Wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) (owner: 10Dreamrimmer)
[13:15:25] <jinxer-wm>	 FIRING: [18x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:15:42] <DreamRimmer>	 Lucas_WMDE: done
[13:15:54] <hnowlan>	 eoghan: ack, thanks! 
[13:15:57] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "Cool, thx for the explanation" [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[13:16:16] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-mariadb1002.eqiad.wmnet
[13:16:17] <Lucas_WMDE>	 looking
[13:16:20] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[13:16:22] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[13:16:36] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] "lgtm !" [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[13:16:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) (owner: 10Dreamrimmer)
[13:16:50] <Lucas_WMDE>	 let’s see if it finishes within 14 minutes…
[13:17:08] <Lucas_WMDE>	 (and I need to remember to also run that maintenance script afterwards)
[13:17:14] <Lucas_WMDE>	 (namespaceDupes)
[13:17:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add VL namespace alias to Azerbaijani Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) (owner: 10Dreamrimmer)
[13:17:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T367648#9903242 (10phaultfinder)
[13:17:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[13:17:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1047057|Add VL namespace alias to Azerbaijani Wiktionary (T367264)]]
[13:17:54] <stashbot>	 T367264: Add "VL" namespace alias to Azerbaijani Wiktionary - https://phabricator.wikimedia.org/T367264
[13:18:34] <wikibugs>	 (03Merged) 10jenkins-bot: mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[13:19:52] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.remove-downtime for db1208.eqiad.wmnet
[13:19:53] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1208.eqiad.wmnet
[13:20:25] <jinxer-wm>	 FIRING: [17x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:21:09] <wikibugs>	 (03PS4) 10Cathal Mooney: Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348)
[13:21:25] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[13:22:22] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, dreamrimmer: Backport for [[gerrit:1047057|Add VL namespace alias to Azerbaijani Wiktionary (T367264)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:22:41] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1002.eqiad.wmnet
[13:22:58] <Lucas_WMDE>	 DreamRimmer: can you test?
[13:23:01] <Lucas_WMDE>	 (looks good to me so far)
[13:23:15] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney)
[13:23:16] <DreamRimmer>	 doing
[13:23:32] <DreamRimmer>	 working
[13:23:40] <DreamRimmer>	 go for it
[13:23:45] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, dreamrimmer: Continuing with sync
[13:23:46] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:23:47] <Lucas_WMDE>	 ok!
[13:25:13] <wikibugs>	 (03PS8) 10Btullis: Add WMF customisations to the upstream ceph-csi-rbd chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472)
[13:25:13] <wikibugs>	 (03PS12) 10Btullis: Deploy the ceph-csi-rbd chart to dse-k8s with default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028938 (https://phabricator.wikimedia.org/T364472)
[13:25:13] <wikibugs>	 (03PS18) 10Btullis: Add a values file for the ceph-csi plugin on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031589 (https://phabricator.wikimedia.org/T327259)
[13:25:25] <jinxer-wm>	 FIRING: [15x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:26:47] <wikibugs>	 (03CR) 10Btullis: Add WMF customisations to the upstream ceph-csi-rbd chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis)
[13:28:03] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply
[13:28:10] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply
[13:28:53] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on wikikube-ctrl1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:29:30] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-db1002.eqiad.wmnet
[13:30:25] <jinxer-wm>	 FIRING: [9x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:31:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Partially revert "mcrouter: Temporarily disable in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047080
[13:32:01] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Partially revert "mcrouter: Temporarily disable in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047080
[13:32:25] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: Partially revert "mcrouter: Temporarily disable in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047080
[13:32:28] <wikibugs>	 (03PS1) 10Elukey: WIP: alternative proposal for netbox dev refactor [puppet] - 10https://gerrit.wikimedia.org/r/1047081
[13:33:57] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1047057|Add VL namespace alias to Azerbaijani Wiktionary (T367264)]] (duration: 16m 07s)
[13:34:02] <stashbot>	 T367264: Add "VL" namespace alias to Azerbaijani Wiktionary - https://phabricator.wikimedia.org/T367264
[13:34:12] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes azwiktionary --fix # T367264; 7 pages fixed, 10 links fixed
[13:34:15] <jinxer-wm>	 FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:34:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:18] <Lucas_WMDE>	 also, why did scap exit nonzero?
[13:34:31] <Lucas_WMDE>	 ah, mw2321 failed to docker pull. doesn’t matter then
[13:34:48] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2951/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (owner: 10Elukey)
[13:35:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Partially revert "mcrouter: Temporarily disable in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047080 (owner: 10Alexandros Kosiaris)
[13:35:09] * Lucas_WMDE afk
[13:35:18] <Lucas_WMDE>	 if someone else can deploy the other config change that’d be great…
[13:35:25] <jinxer-wm>	 FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:35:43] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1002.eqiad.wmnet
[13:36:22] <DreamRimmer>	 Lucas_WMDE: Thanks for you valuable time :)
[13:36:34] <wikibugs>	 (03Merged) 10jenkins-bot: Partially revert "mcrouter: Temporarily disable in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047080 (owner: 10Alexandros Kosiaris)
[13:37:05] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe1002.eqiad.wmnet with OS bookworm
[13:37:18] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903331 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe1002.eqiad.wmnet with OS bookworm
[13:39:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete profile::java::java_8 [puppet] - 10https://gerrit.wikimedia.org/r/1047083
[13:39:15] <jinxer-wm>	 RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:39:55] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo
[13:40:11] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo
[13:40:11] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] Move moss-fe{1,2}001 back to apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047033 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[13:40:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9903343 (10Jhancock.wm) The serial number is just a barcode. There's nothing else on that label. I've looked over the guide and I don't see anything in particular that s...
[13:40:25] <jinxer-wm>	 RESOLVED: [14x] SystemdUnitFailed: ferm.service on kubernetes1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:41:44] <wikibugs>	 (03PS2) 10Btullis: Update the contactgroups for all wdqs and wcqs servers [puppet] - 10https://gerrit.wikimedia.org/r/1047056 (https://phabricator.wikimedia.org/T365881)
[13:41:45] <wikibugs>	 (03PS1) 10Btullis: Remove conda repository from reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550)
[13:42:23] <wikibugs>	 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9903360 (10Ladsgroup) This is done I think but then maybe we should drop the grant on lists1001 then?
[13:43:40] <wikibugs>	 (03CR) 10Btullis: "Once this is removed, we will still have to cleanup reprepro by hand, as per: https://wikitech.wikimedia.org/wiki/Reprepro#Removing_a_comp" [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550) (owner: 10Btullis)
[13:44:58] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::java: Add support for Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1047086 (https://phabricator.wikimedia.org/T367487)
[13:45:15] <wikibugs>	 (03PS2) 10Elukey: WIP: alternative proposal for netbox dev refactor [puppet] - 10https://gerrit.wikimedia.org/r/1047081
[13:45:35] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo)
[13:45:52] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply
[13:46:02] <wikibugs>	 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9903367 (10Marostegui) >>! In T367833#9903360, @Ladsgroup wrote: > This is done I think but then maybe we should drop the grant on lists1001 then?  +1 - we should review puppet grants in case we ment...
[13:46:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550) (owner: 10Btullis)
[13:46:17] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] mariadb: removes underscore on striker database name [puppet] - 10https://gerrit.wikimedia.org/r/1020709 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb)
[13:46:31] <wikibugs>	 (03PS1) 10Eevans: restbase: upgrade cluster to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1047087 (https://phabricator.wikimedia.org/T350567)
[13:46:41] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2952/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (owner: 10Elukey)
[13:47:12] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply
[13:47:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Let's also remove modules/aptrepo/files/updates-keys/*_conda.gpg, though." [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550) (owner: 10Btullis)
[13:47:48] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[13:47:55] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync
[13:49:13] <wikibugs>	 (03CR) 10Jcrespo: [C:03+1] mariadb: removes underscore on striker database name [puppet] - 10https://gerrit.wikimedia.org/r/1020709 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb)
[13:49:30] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[13:49:30] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe1002.eqiad.wmnet with OS bookworm
[13:49:39] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903409 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe1002.eqiad.wmnet with OS bookworm executed with errors: - moss-fe1002 (...
[13:49:44] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047086 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[13:49:54] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe1002.eqiad.wmnet with OS bookworm
[13:50:04] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903416 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe1002.eqiad.wmnet with OS bookworm
[13:50:58] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[13:50:59] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[13:51:28] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[13:51:42] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] ATS: replace service by discovery record for datahub-next [puppet] - 10https://gerrit.wikimedia.org/r/1047040 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol)
[13:52:22] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply
[13:52:22] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply
[13:52:22] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[13:52:22] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[13:52:23] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[13:52:26] <jinxer-wm>	 RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[13:52:29] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply
[13:52:29] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply
[13:52:29] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[13:52:58] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9903425 (10Eevans)
[13:52:59] <wikibugs>	 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9903423 (10CDanis) I think the last step to do here is to validate that any rsync failures will get reported on IRC...
[13:53:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate system::role for Cloud VPS-specific roles [puppet] - 10https://gerrit.wikimedia.org/r/1047090
[13:54:17] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: prometheus config tweak for db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278)
[13:54:19] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[13:54:26] <logmsgbot>	 !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[13:55:01] <wikibugs>	 (03CR) 10Arnaudb: "like Patchset 2?" [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[13:56:06] <wikibugs>	 (03CR) 10Ssingh: "Looks good, one comment inline:" [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall)
[13:57:23] <wikibugs>	 (03CR) 10Ladsgroup: "yup!" [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[13:57:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[13:57:43] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[13:57:46] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[13:57:50] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[13:57:57] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync
[14:02:30] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be1001.eqiad.wmnet with OS bookworm
[14:02:44] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be1001.eqiad.wmnet with OS bookworm
[14:03:23] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: host reimage
[14:05:19] <wikibugs>	 (03CR) 10Volans: "I've tried to explain my suggestion with some suggested edit. LMK if it's more clear now." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb)
[14:06:39] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: host reimage
[14:07:15] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Remove obsolete profile::java::java_8 [puppet] - 10https://gerrit.wikimedia.org/r/1047083 (owner: 10Muehlenhoff)
[14:08:50] <wikibugs>	 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9903462 (10Kgraessle)
[14:09:33] <swfrench-wmf>	 !log included conftool 3.0.0 into buster-wikimedia on apt.w.o for T365123
[14:09:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:38] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[14:10:31] <wikibugs>	 (03PS3) 10JMeybohm: helmfile_psp: Remove seccomp/apparmor mutations from PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507)
[14:11:09] <elukey>	 jayme: wow it is happening
[14:12:56] <wikibugs>	 (03CR) 10Elukey: "Eric should we do it or do we wait for the mesh changes?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans)
[14:13:38] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Add symlink to /var/lib/mailman3 when using different root [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706)
[14:15:05] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2953/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[14:16:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete profile::java::java_8 [puppet] - 10https://gerrit.wikimedia.org/r/1047083 (owner: 10Muehlenhoff)
[14:16:50] <wikibugs>	 (03Abandoned) 10Vgutierrez: hiera: Set prometheus port on fifo-log-demux@cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1029213 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez)
[14:17:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[14:17:46] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[14:17:49] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[14:18:06] * Lucas_WMDE back fwiw
[14:18:58] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-web, mw-api-ext: Raise replicas for 100% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047046 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[14:19:51] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 100% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047046 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[14:19:55] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1001.eqiad.wmnet with reason: host reimage
[14:20:16] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[14:20:18] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync
[14:20:26] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[14:20:35] <wikibugs>	 (03CR) 10Brennen Bearnes: [C:03+2] AVA: Check earlier if acting user is admin [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039766 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper)
[14:20:37] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[14:20:38] <wikibugs>	 (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] AVA: Check earlier if acting user is admin [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039766 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper)
[14:20:44] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[14:20:51] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[14:21:09] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[14:21:28] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[14:21:34] <wikibugs>	 (03PS4) 10Aklapper: Count user transactions in Maniphest only in last two million rows [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039786 (https://phabricator.wikimedia.org/T366811)
[14:21:34] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[14:21:43] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[14:21:45] <wikibugs>	 (03CR) 10Brennen Bearnes: [C:03+2] Count user transactions in Maniphest only in last two million rows [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039786 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper)
[14:21:46] <wikibugs>	 (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] Count user transactions in Maniphest only in last two million rows [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039786 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper)
[14:22:09] <wikibugs>	 (03PS2) 10Aklapper: Limit querying latest user transactions in Maniphest to recent IDs [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039791 (https://phabricator.wikimedia.org/T366811)
[14:22:10] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:22:15] <wikibugs>	 (03CR) 10Brennen Bearnes: [C:03+2] Limit querying latest user transactions in Maniphest to recent IDs [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039791 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper)
[14:22:17] <wikibugs>	 (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] Limit querying latest user transactions in Maniphest to recent IDs [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039791 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper)
[14:22:39] <wikibugs>	 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#9903563 (10jcrespo)
[14:22:51] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1001.eqiad.wmnet with reason: host reimage
[14:23:05] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney)
[14:23:13] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.presto.reboot-workers for Presto an-presto cluster: Reboot Presto nodes
[14:23:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] "It is a short click for a man, a huge leap for mankind." [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[14:23:50] <claime>	 Here we go people
[14:24:14] <claime>	 !log trafficserver: move 100% of traffic to mw-on-k8s - T362323
[14:24:14] <wikibugs>	 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#9903572 (10jcrespo)
[14:24:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:24:18] <stashbot>	 T362323: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323
[14:24:21] <Lucas_WMDE>	 :O :O :O
[14:24:27] * arnaudb holds his breath
[14:24:32] <jynus>	 wow
[14:24:33] <jynus>	 kudos
[14:24:34] <_joe_>	 claime: merged
[14:25:03] <claime>	 https://grafana.wikimedia.org/goto/FATzf8UIg?orgId=1
[14:25:14] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903575 (10Jhancock.wm)
[14:25:19] <wikibugs>	 (03Merged) 10jenkins-bot: Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney)
[14:25:24] <claime>	 Then it's into the logs to find what's still calling the bare metal cluster :p
[14:25:40] <Lucas_WMDE>	 hehe
[14:25:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903577 (10Jhancock.wm)
[14:26:45] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "looks good to me, please see inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[14:27:19] <_joe_>	 claime: are you running puppet on the cp hosts or should I?
[14:27:44] <claime>	 _joe_: can do
[14:27:55] <_joe_>	 claime: no doing it myself
[14:28:06] <_joe_>	 I wanted to be sure I wasn't stepping on your toes
[14:28:07] <claime>	 jerk :p
[14:28:23] <claime>	 (I usually let it roll out on its own)
[14:29:31] <wikibugs>	 (03PS3) 10Arnaudb: mariadb: prometheus config tweak for db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278)
[14:30:15] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb)
[14:31:42] * Lucas_WMDE watches line go up
[14:32:36] <_joe_>	 I prefer to watch the phys hosts line go down
[14:32:38] <_joe_>	 :D
[14:32:51] <claime>	 https://grafana.wikimedia.org/goto/W8wof8USR?orgId=1
[14:32:57] <claime>	 This one
[14:33:05] <_joe_>	 yep
[14:33:09] <_joe_>	 some baseline will remain
[14:33:11] <Lucas_WMDE>	 heh, looks much more significant there \o/
[14:33:14] <_joe_>	 and that's LVS checks
[14:33:35] <claime>	 yes, and also probably some remnants from somewhere internal
[14:33:44] <wikibugs>	 (03PS1) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/1047097
[14:34:19] <_joe_>	 claime: also there's at least one cp host with puppet disabled I'd say
[14:34:26] <wikibugs>	 (03PS3) 10Hnowlan: service: add basic config for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309)
[14:34:26] <wikibugs>	 (03PS1) 10Hnowlan: services_proxy: add shellbox-video listener [puppet] - 10https://gerrit.wikimedia.org/r/1047098 (https://phabricator.wikimedia.org/T357309)
[14:34:32] <jinxer-wm>	 FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput
[14:34:38] <claime>	 oh?
[14:34:42] <wikibugs>	 (03PS1) 10Superzerocool: cswiki: adding throttle rule, removing old throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047099 (https://phabricator.wikimedia.org/T367858)
[14:34:43] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2003.codfw.wmnet with OS bookworm
[14:34:57] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ml-staging2003.codfw.wmnet with OS boo...
[14:34:58] <_joe_>	 uhhh
[14:35:17] <_joe_>	 I fear irc is related to k8s
[14:35:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CHANGELOG: add changelogs for release v0.6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/1047097 (owner: 10Cathal Mooney)
[14:35:53] <hnowlan>	 that fired yesterday also I believe
[14:35:58] <claime>	 yeah
[14:36:18] <_joe_>	 oh maybe we're only sending phys hosts to irc1002?
[14:36:24] <_joe_>	 irc1001 seems to be fine
[14:36:42] <claime>	 the rules are here
[14:36:50] <sukhe>	 !log enabling puppet and running puppet agent on cp4037 
[14:36:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:37:23] <claime>	 _joe_: mediawiki-config says it's not active/active?
[14:37:37] <_joe_>	 claime: it shouldn't be from my memory, yes
[14:37:49] <wikibugs>	 (03PS2) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/1047097
[14:38:46] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:00] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9903617 (10klausman) Machine is drained and off, so you're free to reseat memory etc. Let me know when it's back (and what we might...
[14:39:12] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1040.eqiad.wmnet with reason: T365984
[14:39:17] <stashbot>	 T365984: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad	 - https://phabricator.wikimedia.org/T365984
[14:39:26] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: T365984
[14:39:28] <logmsgbot>	 !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4046.ulsfo.wmnet
[14:39:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 depool - T365984', diff saved to https://phabricator.wikimedia.org/P65156 and previous config saved to /var/cache/conftool/dbconfig/20240618-143951-arnaudb.json
[14:39:53] <claime>	 I'm more surprised that irc1002 is sending messages at all actually
[14:40:04] <claime>	 because if it's not active active, and irc1001 is sending messages
[14:40:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047099 (https://phabricator.wikimedia.org/T367858) (owner: 10Superzerocool)
[14:40:31] <logmsgbot>	 !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ml-serve2001.codfw.wmnet with reason: Hardware maintenance for memory errors
[14:40:47] <logmsgbot>	 !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ml-serve2001.codfw.wmnet with reason: Hardware maintenance for memory errors
[14:41:03] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9903626 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ebd7c06d-d85d-4a91-a22b-6101091bac81) set by klausman@c...
[14:42:08] <claime>	 bare metal is now serving 45rps (excluding jobrunners because of videoscaling)
[14:42:10] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[14:42:57] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Point its-phabricator to stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1047058 (owner: 10Hashar)
[14:43:55] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] service: add basic config for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[14:44:15] <jynus>	 !log reenable puppet on backup2002
[14:44:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:21] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1001.eqiad.wmnet with OS bookworm
[14:44:27] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:44:31] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903639 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be1001.eqiad.wmnet with OS bookworm completed: - moss-be1001 (**PASS**)...
[14:45:16] <wikibugs>	 (03PS3) 10Hashar: wmf: `bazel test` our plugins [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1046783
[14:46:34] <wikibugs>	 (03CR) 10Hashar: [C:03+2] wmf: `bazel test` our plugins [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1046783 (owner: 10Hashar)
[14:46:37] <icinga-wm_>	 PROBLEM - Host ml-cache2001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:46:53] <icinga-wm_>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:47:20] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:40:00 on lsw1-f7-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f7-eqiad
[14:47:34] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:40:00 on lsw1-f7-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f7-eqiad
[14:47:49] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903652 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0039bfdd-84ad-4638-9b4c-c0c23984e401) set by cmooney...
[14:48:46] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:49:00] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bookworm
[14:49:01] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Change lists sync to use quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706)
[14:49:14] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be1003.eqiad.wmnet with OS bookworm
[14:49:28] <wikibugs>	 (03PS2) 10BCornwall: hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891)
[14:49:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall)
[14:50:14] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2954/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[14:50:30] <wikibugs>	 (03Merged) 10jenkins-bot: Point its-phabricator to stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1047058 (owner: 10Hashar)
[14:51:26] <jinxer-wm>	 FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:53:41] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet
[14:54:24] <wikibugs>	 (03PS3) 10BCornwall: hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891)
[14:55:30] <wikibugs>	 (03Merged) 10jenkins-bot: wmf: `bazel test` our plugins [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1046783 (owner: 10Hashar)
[14:55:35] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2956/console" [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall)
[14:55:48] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:56:26] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[14:56:26] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lsw1-f7-eqiad,lsw1-f7-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f7-eqiad
[14:56:42] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lsw1-f7-eqiad,lsw1-f7-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f7-eqiad
[14:56:53] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903691 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b16e0477-5d40-4e59-950e-09e82271c822) set by cmooney...
[14:57:19] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:35:00 on an-worker[1172-1174].eqiad.wmnet,es1040.eqiad.wmnet,ms-be1081.eqiad.wmnet with reason: JunOS upgrade lsw1-f7-eqiad
[14:57:26] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall)
[14:57:36] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:35:00 on an-worker[1172-1174].eqiad.wmnet,es1040.eqiad.wmnet,ms-be1081.eqiad.wmnet with reason: JunOS upgrade lsw1-f7-eqiad
[14:57:44] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903694 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=80e189d2-8757-4138-ad14-1e0cf5cfbbdb) set by cmooney...
[14:58:05] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9903696 (10kamila) 05In progress→03Stalled
[14:58:05] <wikibugs>	 (03PS2) 10Klausman: hiera/conftool/manifest: Add ml-staging2003 as a k8s GPU host [puppet] - 10https://gerrit.wikimedia.org/r/1042227 (https://phabricator.wikimedia.org/T357415)
[14:58:39] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:59:01] <claime>	 Well that's too bad httpbb, but it's not a problem anymore :P
[14:59:31] <icinga-wm_>	 RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 143, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:00:04] <logmsgbot>	 !log mforns@deploy1002 Started deploy [airflow-dags/analytics@4f7d29a]: (no justification provided)
[15:00:04] <jouncebot>	 eoghan, jelto, arnoldokoth, and mutante: I, the Bot under the Fountain, call upon thee, The Deployer, to do SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1500).
[15:00:08] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet
[15:00:13] <topranks>	 !log rebooting lsw1-f7-eqiad to upgrade JunOS on switch T365984
[15:00:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:30] <stashbot>	 T365984: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad	 - https://phabricator.wikimedia.org/T365984
[15:00:32] <logmsgbot>	 !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@4f7d29a]: (no justification provided) (duration: 00m 28s)
[15:01:22] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9903731 (10Clement_Goubert)
[15:02:25] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] CHANGELOG: add changelogs for release v0.6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/1047097 (owner: 10Cathal Mooney)
[15:02:43] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: deploy llama3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047106 (https://phabricator.wikimedia.org/T354870)
[15:03:26] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update
[15:03:40] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update
[15:03:43] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9903729 (10Clement_Goubert) {F55438321}  🚀🚀🚀
[15:03:45] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update
[15:03:58] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update
[15:04:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T364069)', diff saved to https://phabricator.wikimedia.org/P65157 and previous config saved to /var/cache/conftool/dbconfig/20240618-150416-marostegui.json
[15:04:21] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[15:04:28] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/1047097 (owner: 10Cathal Mooney)
[15:04:28] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@ebe3a94]: deploy phab2002 for T367775
[15:04:34] <stashbot>	 T367775: Deploy Phabricator/Phorge 2024-06-18 - https://phabricator.wikimedia.org/T367775
[15:05:05] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@ebe3a94]: deploy phab2002 for T367775 (duration: 00m 36s)
[15:05:17] <wikibugs>	 (03CR) 10Elukey: [C:03+1] profile::java: Add support for Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1047086 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff)
[15:05:28] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@ebe3a94]: deploy phab1004 for T367775
[15:06:16] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@ebe3a94]: deploy phab1004 for T367775 (duration: 00m 47s)
[15:06:49] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[15:06:50] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[15:06:57] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe1002.eqiad.wmnet with OS bookworm
[15:07:26] <logmsgbot>	 !log brennen@deploy1002 Started deploy [phabricator/deployment@ef680d8]: revert phab1004 after breakage for T367775
[15:07:41] <logmsgbot>	 !log brennen@deploy1002 Finished deploy [phabricator/deployment@ef680d8]: revert phab1004 after breakage for T367775 (duration: 00m 15s)
[15:07:47] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903763 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe1002.eqiad.wmnet with OS bookworm completed: - moss-fe1002 (**WARN**)...
[15:07:49] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage
[15:08:01] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:08:07] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:08:13] <sukhe>	 ? expected?
[15:08:30] <wikibugs>	 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9903765 (10Ladsgroup) {meme, src=itshappening}
[15:08:39] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:09:31] <icinga-wm_>	 PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:09:58] <claime>	 sukhe: don't think so
[15:10:55] <claime>	 Emperor ?
[15:10:58] <jinxer-wm>	 FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:11:02] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:11:06] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[15:11:11] <claime>	 !incidents
[15:11:11] <sirenbot>	 4757 (ACKED)  Host db1165 (paged) - PING  - Packet loss = 100%
[15:11:11] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage
[15:11:11] <sirenbot>	 4758 (UNACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[15:11:14] <Amir1>	 here
[15:11:17] <hnowlan>	 here
[15:11:19] <claime>	 !ack 4758
[15:11:19] <sirenbot>	 4758 (ACKED)  ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad)
[15:11:19] <Amir1>	 !incidnts 
[15:11:28] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab-settings: add timer for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[15:12:02] <denisse>	 Here.
[15:12:07] <claime>	 p99 jumped hard up
[15:12:33] <Emperor>	 titan are the thanos-software front-ends, and godog knows about them
[15:12:48] <denisse>	 Looking.
[15:12:55] <hnowlan>	 rx went from 24mb/s to 1.2GB/s
[15:13:02] <denisse>	 Emperor: godog is on vacation.
[15:13:04] <hnowlan>	 something is blasting it 
[15:13:44] <denisse>	 I think it may be a query.
[15:13:46] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:14:02] <claime>	 denisse: tell us if we can help/how
[15:14:40] <cdanis>	 it is very likely a very large query
[15:14:42] <cdanis>	 https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=thanos&var-instance=All&from=now-1h&to=now
[15:14:54] <arnaudb>	 oof
[15:14:54] <cdanis>	 https://i.imgur.com/FlnOPaV.png
[15:15:07] <wikibugs>	 (03PS4) 10JMeybohm: admin_ng: Add toggles for PSP to PSS migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507)
[15:15:26] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml-services: deploy llama3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047106 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos)
[15:15:34] <cdanis>	 sorry I need to join a meeting
[15:15:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin_ng: Add toggles for PSP to PSS migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm)
[15:15:58] <jinxer-wm>	 RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:15:58] <hnowlan>	 titan1001's saturation has dropped off at least
[15:16:00] <hnowlan>	 ah 
[15:16:02] <denisse>	 I think it'll self resolve, let me see if I can see the contents of the query.
[15:16:24] <hnowlan>	 yeah page has resolved
[15:17:06] <Emperor>	 probably worth adding to T356788
[15:17:06] <stashbot>	 T356788: thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788
[15:17:22] <Emperor>	 (which I think is where we've been tracking things-that-kill-titan)
[15:17:44] <denisse>	 Emperor: good idea, let me add it.
[15:18:11] <hnowlan>	 oh titan1001 recovered because it OOMkilled :) 
[15:18:19] <hnowlan>	 at 15:11
[15:18:20] <Amir1>	 for having a name like titan it seems a bit fragile 
[15:18:37] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] shellbox-video: initial helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[15:18:40] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903792 (10cmooney) Switch is back online after upgrade, everything looks good at first glance.
[15:18:55] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: deploy llama3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047106 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos)
[15:19:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[15:19:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P65158 and previous config saved to /var/cache/conftool/dbconfig/20240618-151923-marostegui.json
[15:19:30] <icinga-wm_>	 RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:20:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P65159 and previous config saved to /var/cache/conftool/dbconfig/20240618-152031-arnaudb.json
[15:20:38] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[15:20:53] <wikibugs>	 (03CR) 10Klausman: [C:03+2] hiera/conftool/manifest: Add ml-staging2003 as a k8s GPU host [puppet] - 10https://gerrit.wikimedia.org/r/1042227 (https://phabricator.wikimedia.org/T357415) (owner: 10Klausman)
[15:21:10] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9903802 (10VRiley-WMF) Hey @Eevans This is correct. The backplane was replaced. At this stage we can move forward with a motherboard replacement if you wish. I will be pulling it from a different...
[15:21:22] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9903803 (10SCherukuwada) Manager approves.
[15:21:25] <Emperor>	 semi-serious question, should we wait longer before p.aging for those alerts, since titan does typically self-resolve after OOMing?
[15:21:37] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox-video: initial helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[15:21:53] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deploy llama3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047106 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos)
[15:21:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1047087 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans)
[15:21:56] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall)
[15:22:50] <Gerges>	 Hi
[15:23:02] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[15:23:45] <Gerges>	 How do I modify Bridgebot repo?
[15:24:08] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903811 (10MatthewVernon) ms swift looks good, thanks.
[15:24:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9903813 (10VRiley-WMF) Hey @RKemper would Thursday work for you? Around 12:00 EST?
[15:25:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9903814 (10RKemper) >>! In T367442#9903813, @VRiley-WMF wrote: > Hey @RKemper would Thursday work for you? Around 12:00 EST?  @VRiley-WMF Sounds great!
[15:25:34] <kamila_>	 Gerges: does https://wikitech.wikimedia.org/wiki/Tool:Bridgebot perhaps help?
[15:26:05] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903816 (10klausman) It looks like the primary interface can't see the network device (the console shows "media test failure, check cable".  {F55438869}
[15:26:23] <Gerges>	 https://gitlab.wikimedia.org/toolforge-repos/bridgebot/-/merge_requests/7
[15:26:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:26:33] <Gerges>	 I made a merge request
[15:26:47] <Gerges>	 But I don't know if this is true or not
[15:29:12] <wikibugs>	 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9903823 (10BCornwall) 05In progress→03Resolved
[15:29:22] <wikibugs>	 (03PS5) 10JMeybohm: admin_ng: Add toggles for PSP to PSS migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507)
[15:29:38] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:29:44] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Change lists sync to use quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706)
[15:30:11] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
[15:30:14] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
[15:30:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[15:30:27] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1003.eqiad.wmnet with OS bookworm
[15:30:38] <wikibugs>	 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903833 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be1003.eqiad.wmnet with OS bookworm completed: - moss-be1003 (**PASS**)...
[15:31:13] <kamila_>	 Gerges: I have no idea who maintans bridgebot, but my guess would be that it would help to link to your merge request on the phab task and/or pointing the maintainer at it
[15:31:34] <wikibugs>	 (03PS3) 10Elukey: WIP: alternative proposal for netbox dev refactor [puppet] - 10https://gerrit.wikimedia.org/r/1047081
[15:31:58] <icinga-wm_>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:32:07] <Gerges>	 bd808: hi
[15:32:10] <jinxer-wm>	 RESOLVED: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:32:25] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9903844 (10eoghan)
[15:32:33] <wikibugs>	 (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2957/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (owner: 10Elukey)
[15:32:49] <Gerges>	 https://gitlab.wikimedia.org/toolforge-repos/bridgebot/-/merge_requests/7
[15:33:10] <wikibugs>	 (03PS1) 10Brennen Bearnes: gitlab-settings: use v1.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/1047110 (https://phabricator.wikimedia.org/T355097)
[15:33:31] <wikibugs>	 (03CR) 10Jelto: [C:03+1] gitlab-settings: use v1.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/1047110 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[15:33:47] <Gerges>	 bd808: Do you have merge privileges in the Bridgebot repository?
[15:33:51] <dancy>	 Gerges: bd808 might be the right person, but he's out this week.
[15:33:56] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2958/console" [puppet] - 10https://gerrit.wikimedia.org/r/1043979 (owner: 10BCornwall)
[15:34:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] WIP: alternative proposal for netbox dev refactor [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (owner: 10Elukey)
[15:34:07] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab-settings: use v1.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/1047110 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[15:34:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P65161 and previous config saved to /var/cache/conftool/dbconfig/20240618-153430-marostegui.json
[15:34:49] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+2] lists: Change lists sync to use quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[15:35:07] <wikibugs>	 (03PS3) 10EoghanGaffney: lists: Change lists sync to use quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706)
[15:35:18] <wikibugs>	 (03PS2) 10BCornwall: acme-chief: Preparatory PyYAML formatting [puppet] - 10https://gerrit.wikimedia.org/r/1043979
[15:35:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 25%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P65162 and previous config saved to /var/cache/conftool/dbconfig/20240618-153537-arnaudb.json
[15:35:42] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[15:35:45] <wikibugs>	 (03PS7) 10Ayounsi: Netbox 4: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275)
[15:35:57] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp3066.*} and A:cp
[15:36:00] <fabfur>	 @
[15:36:08] <fabfur>	 !log upgrade haproxy to v2.8.10 on cp3066 (T367756)
[15:36:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:12] <stashbot>	 T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756
[15:36:46] <wikibugs>	 (03PS4) 10Ayounsi: Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (https://phabricator.wikimedia.org/T336275) (owner: 10Elukey)
[15:36:53] <wikibugs>	 (03CR) 10Elukey: cli: modify get_distro_name to return the version id (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey)
[15:36:57] <icinga-wm_>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 25 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:37:14] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] grafana: Change synthetic performance test proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1044292 (https://phabricator.wikimedia.org/T367488) (owner: 10Phedenskog)
[15:37:28] <wikibugs>	 (03Abandoned) 10Ayounsi: Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi)
[15:37:31] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: haproxy: use HTTP healthcheck for the k8s api-server [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389)
[15:37:48] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "please merge this one with puppet disabled on acme-chief hosts and check that it's a NOOP at acme-chief level on acmechief-test instances" [puppet] - 10https://gerrit.wikimedia.org/r/1043979 (owner: 10BCornwall)
[15:38:00] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp3066.*} and A:cp
[15:39:25] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5030.*} and A:cp
[15:39:27] <fabfur>	 !log upgrade haproxy to v2.8.10 on cp5030,cp5032 (T367756)
[15:39:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:38] <wikibugs>	 (03PS1) 10Brennen Bearnes: gitlab-settings: update tag to 1.5.0 for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1047114 (https://phabricator.wikimedia.org/T355097)
[15:41:44] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5030.*} and A:cp
[15:42:01] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5032.*} and A:cp
[15:42:09] <wikibugs>	 (03CR) 10Jelto: [C:03+1] gitlab-settings: update tag to 1.5.0 for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1047114 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[15:42:10] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[15:42:39] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab-settings: update tag to 1.5.0 for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1047114 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[15:43:40] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "I love it, really nice!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1046734 (https://phabricator.wikimedia.org/T365372) (owner: 10Volans)
[15:43:42] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[15:44:03] <wikibugs>	 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9903888 (10Eevans) >>! In T362033#9903802, @VRiley-WMF wrote: > .... Is there a time you would like to proceed with this?  I have no time preference; I can be available any time this week.
[15:44:06] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5032.*} and A:cp
[15:45:09] <wikibugs>	 (03CR) 10Volans: [C:03+2] redfish: simplify interface of Redfish classes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1046734 (https://phabricator.wikimedia.org/T365372) (owner: 10Volans)
[15:45:38] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[15:46:18] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[15:47:08] <wikibugs>	 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9903900 (10Dzahn) How about adding a MAILTO to the timer and mail a specific list / team / group?  I think that ale...
[15:47:13] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[15:47:36] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+2 C:03+2] lists: Change lists sync to use quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[15:47:38] <swfrench-wmf>	 !log included conftool 3.0.0 into buster/bullseye/bookworm-wikimedia on apt.w.o for T365123
[15:47:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:43] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[15:48:10] <hnowlan>	 jouncebot: nowandnext
[15:48:10] <jouncebot>	 For the next 0 hour(s) and 11 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1500)
[15:48:10] <jouncebot>	 In 0 hour(s) and 11 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1600)
[15:48:43] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[15:49:25] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ml-staging2003
[15:49:34] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-staging2003
[15:49:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T364069)', diff saved to https://phabricator.wikimedia.org/P65163 and previous config saved to /var/cache/conftool/dbconfig/20240618-154938-marostegui.json
[15:49:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[15:49:44] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[15:49:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[15:50:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T364069)', diff saved to https://phabricator.wikimedia.org/P65164 and previous config saved to /var/cache/conftool/dbconfig/20240618-155000-marostegui.json
[15:50:26] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync
[15:50:42] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[15:50:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 50%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P65165 and previous config saved to /var/cache/conftool/dbconfig/20240618-155042-arnaudb.json
[15:50:51] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[15:51:09] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[15:51:09] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[15:52:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply
[15:52:47] <wikibugs>	 (03Merged) 10jenkins-bot: redfish: simplify interface of Redfish classes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1046734 (https://phabricator.wikimedia.org/T365372) (owner: 10Volans)
[15:52:55] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] statograph: Use k8s envoy metric for statuspage [puppet] - 10https://gerrit.wikimedia.org/r/1047115 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert)
[15:53:10] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: haproxy: check the k8s api-server /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389)
[15:53:32] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s7
[15:53:36] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s2
[15:53:45] <wikibugs>	 (03PS1) 10MVernon: cephadm: limit mgr daemons to _admin-labelled hosts [puppet] - 10https://gerrit.wikimedia.org/r/1047117 (https://phabricator.wikimedia.org/T279621)
[15:54:00] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Add symlink to /var/lib/mailman3 when using different root [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706)
[15:54:00] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118
[15:54:08] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: server fails to reboot for clouddb1018.eqiad.wmnet - https://phabricator.wikimedia.org/T367499#9903941 (10fnegri) Thanks @Jclark-ctr!  The host is now repooled.
[15:54:28] <wikibugs>	 (03CR) 10CI reject: [V:04-1] lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[15:55:00] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-staging2003.codfw.wmnet with OS bookworm
[15:55:08] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm executed with errors...
[15:55:35] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118
[15:59:31] <wikibugs>	 (03PS6) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378)
[16:00:05] <jouncebot>	 jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1600)
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:02:24] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply
[16:02:30] <wikibugs>	 (03PS2) 10Btullis: Remove conda repository from reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550)
[16:05:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P65166 and previous config saved to /var/cache/conftool/dbconfig/20240618-160548-arnaudb.json
[16:05:53] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[16:06:00] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] "Ack" [puppet] - 10https://gerrit.wikimedia.org/r/1043979 (owner: 10BCornwall)
[16:11:17] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync
[16:11:20] <wikibugs>	 (03PS3) 10Btullis: Remove conda repository from reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550)
[16:11:25] <wikibugs>	 (03PS1) 10Clément Goubert: statograph: fix wiki response time query [puppet] - 10https://gerrit.wikimedia.org/r/1047121
[16:12:53] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Revert "Show experimental login popup links on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891)
[16:13:10] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Revert "Show experimental login popup links on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891)
[16:14:10] <wikibugs>	 (03PS2) 10Clément Goubert: statograph: fix wiki response time query [puppet] - 10https://gerrit.wikimedia.org/r/1047121
[16:14:52] <icinga-wm_>	 RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 75%, RTA = 119.44 ms
[16:15:24] <wikibugs>	 (03PS3) 10Clément Goubert: statograph: fix wiki response time query [puppet] - 10https://gerrit.wikimedia.org/r/1047121
[16:16:04] <wikibugs>	 (03PS1) 10Hnowlan: admin_ng: bump limits for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309)
[16:16:18] <icinga-wm_>	 PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100%
[16:16:55] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] statograph: fix wiki response time query [puppet] - 10https://gerrit.wikimedia.org/r/1047121 (owner: 10Clément Goubert)
[16:17:52] <wikibugs>	 (03PS3) 10EoghanGaffney: lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118
[16:18:23] <wikibugs>	 (03CR) 10Eevans: [C:03+2] restbase: upgrade cluster to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1047087 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans)
[16:19:12] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync
[16:19:49] <wikibugs>	 (03PS1) 10DLynch: Deploy references edit check to phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047125 (https://phabricator.wikimedia.org/T361843)
[16:20:54] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P65167 and previous config saved to /var/cache/conftool/dbconfig/20240618-162053-arnaudb.json
[16:21:06] <stashbot>	 T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad	 - https://phabricator.wikimedia.org/T365983
[16:21:26] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[16:22:58] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Upgrade to Java 11 — T350567 - eevans@cumin1002
[16:23:02] <stashbot>	 T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567
[16:23:15] <swfrench-wmf>	 !log depooled / pooled mw2441.codfw.wmnet to smoke-test python3-conftool for T365123
[16:23:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:20] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[16:23:47] <claime>	 !log resetting Wiki response time metric on wikimedia.statuspage.io following complete switch to k8s - T362323
[16:23:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:52] <stashbot>	 T362323: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323
[16:24:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047125 (https://phabricator.wikimedia.org/T361843) (owner: 10DLynch)
[16:24:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson)
[16:26:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:27:49] <wikibugs>	 (03CR) 10Eevans: [C:03+1] data-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046752 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French)
[16:28:19] <wikibugs>	 (03PS1) 10DLynch: Turn on Visual Editor collab beta feature on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131
[16:28:48] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 (owner: 10DLynch)
[16:29:07] <swfrench-wmf>	 !log conftool on cumin2002 updated to 3.0.0 for T365123
[16:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:29:11] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[16:29:38] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:31:31] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1093.eqiad.wmnet with reason: T367825 hw maint
[16:31:36] <stashbot>	 T367825: hw troubleshooting: Multi-bit errors on DIMM_A2 for an-worker1093 - https://phabricator.wikimedia.org/T367825
[16:31:45] <logmsgbot>	 !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1093.eqiad.wmnet with reason: T367825 hw maint
[16:32:07] <wikibugs>	 (03PS1) 10EoghanGaffney: stewards: Allow lists.wm.o to access the stewards rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1047135
[16:34:55] <logmsgbot>	 !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1
[16:35:27] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Remove conda repository from reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550) (owner: 10Btullis)
[16:39:19] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync
[16:39:29] <swfrench-wmf>	 !log validated dbctl 3.0.0 on cumin2002 (noop edit to note: on parsercache spare pc2014) for T365123
[16:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:39:34] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[16:42:35] <swfrench-wmf>	 !log conftool on puppetmaster2001 updated to 3.0.0 for T365123
[16:42:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:45:20] <swfrench-wmf>	 !log validated requestctl 3.0.0 find-ip (new read-only subcommand) on puppetmaster2001 for T365123
[16:47:01] <wikibugs>	 (03PS1) 10Clément Goubert: statograph: Use benthos query to save thanos [puppet] - 10https://gerrit.wikimedia.org/r/1047138 (https://phabricator.wikimedia.org/T362323)
[16:50:03] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.presto.reboot-workers (exit_code=0) for Presto an-presto cluster: Reboot Presto nodes
[16:51:10] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[16:51:30] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:52:37] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm. One question in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[16:55:12] <wikibugs>	 (03PS2) 10Clément Goubert: statograph: Use benthos query to save thanos [puppet] - 10https://gerrit.wikimedia.org/r/1047138 (https://phabricator.wikimedia.org/T367894)
[16:56:12] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "LGTM other than the CPU limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan)
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1700)
[17:12:10] <swfrench-wmf>	 !log updated conftool to 3.0.0 on remaining buster hosts in codfw for T365123
[17:12:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:15] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[17:13:41] <wikibugs>	 (03CR) 10Jdlrobson: [C:04-1] "Blocked until June 20th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson)
[17:13:47] <wikibugs>	 (03CR) 10CDanis: [C:03+2] statograph: Use benthos query to save thanos [puppet] - 10https://gerrit.wikimedia.org/r/1047138 (https://phabricator.wikimedia.org/T367894) (owner: 10Clément Goubert)
[17:14:43] <swfrench-wmf>	 !log updated conftool to 3.0.0 on remaining bookworm hosts in codfw for T365123
[17:14:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:15:50] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "this would have no affect on lists1001 and change the path on lists1004 to /srv/mailman3" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[17:16:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2003.codfw.wmnet with OS bookworm
[17:16:23] <swfrench-wmf>	 !log updated conftool to 3.0.0 on remaining bullseye hosts in codfw for T365123
[17:16:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9904356 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm
[17:16:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:09] <cdanis>	 !log resetting Wiki response time metric on wikimedia.statuspage.io following complete switch to k8s - T362323 T367894
[17:21:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:15] <stashbot>	 T362323: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323
[17:21:16] <stashbot>	 T367894: update status page latency for mw-on-k8s - https://phabricator.wikimedia.org/T367894
[17:23:46] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:28:59] <wikibugs>	 (03CR) 10Dreamrimmer: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047099 (https://phabricator.wikimedia.org/T367858) (owner: 10Superzerocool)
[17:31:48] <wikibugs>	 (03PS1) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047147
[17:34:01] <wikibugs>	 (03PS2) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047147
[17:34:07] <swfrench-wmf>	 !log updated conftool to 3.0.0 on buster hosts in eqiad for T365123
[17:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:34:11] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[17:35:10] <swfrench-wmf>	 !log updated conftool to 3.0.0 on bookworm hosts in eqiad for T365123
[17:35:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:37:28] <swfrench-wmf>	 !log updated conftool to 3.0.0 on bullseye hosts in eqiad for T365123
[17:37:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:38:14] <wikibugs>	 (03CR) 10Esanders: [C:03+1] Deploy references edit check to phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047125 (https://phabricator.wikimedia.org/T361843) (owner: 10DLynch)
[17:40:11] <wikibugs>	 (03PS1) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047148
[17:40:11] <wikibugs>	 (03PS1) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047149
[17:40:11] <wikibugs>	 (03PS1) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047150
[17:41:17] <wikibugs>	 (03PS2) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047150
[17:41:33] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] lists: Update rsync module path for quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[17:42:35] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] hiera: Unify ulsfo trafficserver storage elements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall)
[17:51:55] <wikibugs>	 (03CR) 10Dzahn: "let's first fix this one that seems related:" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[17:57:13] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:57:59] <icinga-wm_>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:58:09] <wikibugs>	 (03CR) 10BCornwall: "Thanks for that, Taavi. Is that to say that only wikimediacloud.org and wikimedia.cloud being blacklisted is good enough?" [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor)
[17:58:29] <wikibugs>	 (03CR) 10Dzahn: lists: Change lists sync to use quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[17:58:45] <wikibugs>	 (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor)
[17:59:38] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] lists: Update rsync module path for quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[18:00:05] <jouncebot>	 jnuche and brennen: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1800)
[18:00:42] <wikibugs>	 (03CR) 10BBlack: [C:03+1] conftool-data: add ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046675 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[18:01:05] <brennen>	 o/ nothing for this window.
[18:01:38] <wikibugs>	 (03CR) 10BBlack: [C:03+1] dnsbox: announce ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh)
[18:03:10] <wikibugs>	 (03CR) 10BCornwall: "These are all handled but I'm noticing that markmonitor is returning punycode as having ns[0-2].wikimedia.org..." [dns] - 10https://gerrit.wikimedia.org/r/1040335 (owner: 10Ncmonitor)
[18:08:34] <wikibugs>	 (03PS4) 10Dzahn: lists: Update rsync module path for quickdatacopy, fix invalid unit name [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[18:09:39] <wikibugs>	 (03CR) 10Dzahn: lists: Update rsync module path for quickdatacopy, fix invalid unit name (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[18:11:59] <wikibugs>	 (03PS1) 10Jdlrobson: Fix codex link styles overriding other link styles [skins/Vector] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047155 (https://phabricator.wikimedia.org/T367844)
[18:12:10] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [skins/Vector] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047155 (https://phabricator.wikimedia.org/T367844) (owner: 10Jdlrobson)
[18:12:49] <wikibugs>	 (03CR) 10Dzahn: "with the additional change now it would mean a change on all servers.. I want to avoid that too... sigh" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[18:14:14] <wikibugs>	 (03CR) 10Dzahn: "this being inside a " if $primary_host " it suprises me this has an affect on lists1001 and lists2001" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[18:16:30] <wikibugs>	 (03CR) 10Muehlenhoff: "Or instead fix this in quickdatacopy by sanitising the name, have a look at what I added for in the firewall::service define, line 28 onwa" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[18:16:30] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-staging2003.codfw.wmnet with OS bookworm
[18:16:31] <logmsgbot>	 !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-staging2003.codfw.wmnet with OS bookworm
[18:17:05] <swfrench-wmf>	 !log updated conftool to 3.0.0 on hosts (cp,ncredir) in ulsfo for T365123
[18:17:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:10] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[18:17:46] <wikibugs>	 (03CR) 10Dzahn: "ah, nevermind, the rsync::quickdatacopy resource should of exist on both (all) machines, but then internal logic inside it decides what to" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[18:19:44] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2003.codfw.wmnet with OS bookworm
[18:23:05] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] stewards: Allow lists.wm.o to access the stewards rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1047135 (owner: 10EoghanGaffney)
[18:23:14] <wikibugs>	 (03PS2) 10EoghanGaffney: stewards: Allow lists.wm.o to access the stewards rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1047135
[18:25:48] <wikibugs>	 (03Abandoned) 10Dzahn: codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn)
[18:27:03] <swfrench-wmf>	 !log updated conftool to 3.0.0 on hosts (cp,ncredir) in magru for T365123
[18:27:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:08] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[18:27:08] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] stewards: Allow lists.wm.o to access the stewards rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1047135 (owner: 10EoghanGaffney)
[18:29:43] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Upgrade to Java 11 — T350567 - eevans@cumin1002
[18:29:48] <stashbot>	 T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567
[18:31:10] <wikibugs>	 (03PS1) 10Ahmon Dancy: mw-web: Add traindev environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047158
[18:31:26] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[18:33:12] <swfrench-wmf>	 !log updated conftool to 3.0.0 on hosts (cp,ncredir) in drmrs for T365123
[18:33:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:16] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[18:34:47] <jinxer-wm>	 FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput
[18:38:59] <swfrench-wmf>	 !log updated conftool to 3.0.0 on hosts (cp,ncredir) in eqsin for T365123
[18:39:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:39:03] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[18:40:17] <wikibugs>	 (03CR) 10Dzahn: [V:03+2] "unit started manually on lists1004, works fine" [puppet] - 10https://gerrit.wikimedia.org/r/1047135 (owner: 10EoghanGaffney)
[18:44:42] <swfrench-wmf>	 !log updated conftool to 3.0.0 on hosts (cp,ncredir) in esams for T365123
[18:44:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:47] <stashbot>	 T365123: Make dbctl check for depooled future masters  - https://phabricator.wikimedia.org/T365123
[18:46:46] <mutante>	 jinxer-wm: help
[18:49:52] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Upgrade to Java 11 — T350567 - eevans@cumin1002
[18:49:58] <stashbot>	 T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567
[18:53:11] <wikibugs>	 (03PS1) 10Dzahn: lists: fix invalid unit name for rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047160 (https://phabricator.wikimedia.org/T331706)
[18:53:33] <wikibugs>	 (03PS18) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[18:53:54] <wikibugs>	 (03CR) 10Dzahn: "follow-up created: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047160" [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[19:00:18] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "disabling puppet on lists*, then deploying on at a time" [puppet] - 10https://gerrit.wikimedia.org/r/1047160 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn)
[19:01:34] <wikibugs>	 (03PS19) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[19:03:48] <wikibugs>	 (03PS1) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047161 (https://phabricator.wikimedia.org/T363001)
[19:13:01] <icinga-wm_>	 RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 33%, RTA = 30.35 ms
[19:15:48] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:17:45] <wikibugs>	 (03Abandoned) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047161 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[19:17:51] <mutante>	 !log lists1001 - systemctl reset-failed - clean up systemd state due to units not found anymore after migration - disable puppet and then deploy gerrit:1047160 on lists to fix invalid unit name -  T331706
[19:17:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:17:57] <stashbot>	 T331706: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706
[19:18:29] <icinga-wm_>	 PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100%
[19:18:30] <wikibugs>	 (03PS20) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[19:19:32] <wikibugs>	 (03PS5) 10Hashar: Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419)
[19:26:54] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[19:26:55] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[19:29:40] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[19:29:41] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[19:30:14] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar)
[19:30:14] <wikibugs>	 (03PS21) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[19:30:50] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[19:31:10] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[19:32:46] <wikibugs>	 (03PS22) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[19:33:33] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[19:33:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[19:34:26] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "on lists1001 - no change" [puppet] - 10https://gerrit.wikimedia.org/r/1047160 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn)
[19:36:32] <wikibugs>	 (03PS5) 10Dzahn: lists: Update rsync module path for quickdatacopy, fix invalid unit name [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[19:36:39] <wikibugs>	 (03PS6) 10Dzahn: lists: Update rsync module path for quickdatacopy, fix invalid unit name [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[19:40:01] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-staging2003.codfw.wmnet with OS bookworm
[19:41:43] <wikibugs>	 (03PS23) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[19:42:26] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[19:42:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[19:42:54] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[19:43:27] <wikibugs>	 (03PS7) 10Dzahn: lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[19:44:48] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[19:44:49] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[19:48:40] <wikibugs>	 (03PS6) 10Hashar: Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419)
[19:55:38] <wikibugs>	 (03PS24) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[19:55:41] <icinga-wm_>	 PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 197592184 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:56:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[19:56:41] <icinga-wm_>	 RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 108688 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[19:57:13] <wikibugs>	 (03PS25) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[19:58:19] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[19:59:14] <wikibugs>	 (03PS26) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[19:59:56] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[20:00:01] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T2000).
[20:00:05] <jouncebot>	 Superzerocool, kemayo, and jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:20] <Kemayo>	 o/
[20:00:35] <Superzerocool>	 hi!
[20:00:51] <jan_drewniak>	 o/ (I can do jdlrobson's patches today)
[20:01:15] <urbanecm>	 jan_drewniak: would it be ok if i deploy everything in interest of time?
[20:02:38] <urbanecm>	 let's start
[20:02:42] <jan_drewniak>	 urbanecm: you can leave mine for last and I can self-deploy, I have toyofuku shadowing me on a deployement today :) 
[20:02:45] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] cswiki: adding throttle rule, removing old throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047099 (https://phabricator.wikimedia.org/T367858) (owner: 10Superzerocool)
[20:02:53] <urbanecm>	 jan_drewniak: ah, okay. souds good then
[20:03:22] <wikibugs>	 (03Merged) 10jenkins-bot: cswiki: adding throttle rule, removing old throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047099 (https://phabricator.wikimedia.org/T367858) (owner: 10Superzerocool)
[20:03:28] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Deploy references edit check to phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047125 (https://phabricator.wikimedia.org/T361843) (owner: 10DLynch)
[20:04:09] <urbanecm>	 oh, we're releasing collab somewhere? cool!
[20:04:20] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy references edit check to phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047125 (https://phabricator.wikimedia.org/T361843) (owner: 10DLynch)
[20:04:35] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Turn on Visual Editor collab beta feature on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 (owner: 10DLynch)
[20:04:40] <wikibugs>	 (03PS2) 10DLynch: Turn on Visual Editor collab beta feature on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131
[20:04:55] <Kemayo>	 Technically you can make it happen pretty much anywhere at the moment -- the thing that's gated away is the UI for actually *starting* a session. Once one is started, links to it should work regardless of your own feature-status.
[20:04:57] <wikibugs>	 (03CR) 10Urbanecm: [C:03+2] Turn on Visual Editor collab beta feature on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 (owner: 10DLynch)
[20:05:15] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 (owner: 10DLynch)
[20:05:36] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Visual Editor collab beta feature on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 (owner: 10DLynch)
[20:05:39] <urbanecm>	 Kemayo: but still, exposing the ui somewhere is very cool :)
[20:05:45] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] "lists1001 - no change" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[20:06:07] <Kemayo>	 It'll be good to get feedback on the UX / legal questions once people actually use it. And experience weird editing-conflicts that we've not managed to see ourselves yet. :D
[20:06:10] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1047099|cswiki: adding throttle rule, removing old throttle rule (T367858)]], [[gerrit:1047125|Deploy references edit check to phase 1 wikis (T361843)]], [[gerrit:1047131|Turn on Visual Editor collab beta feature on officewiki]]
[20:06:16] <stashbot>	 T367858: Lift IP cap on 2024-06-26 for Editathon Human Rights - cs.wikipedia - https://phabricator.wikimedia.org/T367858
[20:06:16] <stashbot>	 T361843: Make Edit Check (references) available to all newcomers at phase 1 Wikipedias - https://phabricator.wikimedia.org/T361843
[20:06:27] <urbanecm>	 mutante: can i bribe you to puppet merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1045211 please?
[20:07:01] <wikibugs>	 (03PS2) 10Urbanecm: admin: urbanecm's home: Update .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/1045211
[20:07:56] <wikibugs>	 (03CR) 10Majavah: [C:03+2] admin: urbanecm's home: Update .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/1045211 (owner: 10Urbanecm)
[20:08:08] <urbanecm>	 thanks taavi 
[20:08:12] <wikibugs>	 (03CR) 10Dzahn: [V:03+2 C:03+2] "lists2001 - /usr/local/sbin/sync-mailman-root-sync now pulls into /srv/mailman3/ and remote side offers /srv/mailman3 - manually started r" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney)
[20:08:50] <wikibugs>	 (03PS27) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[20:09:15] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[20:09:17] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[20:09:31] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[20:10:46] <logmsgbot>	 !log urbanecm@deploy1002 urbanecm, superzerocool, kemayo: Backport for [[gerrit:1047099|cswiki: adding throttle rule, removing old throttle rule (T367858)]], [[gerrit:1047125|Deploy references edit check to phase 1 wikis (T361843)]], [[gerrit:1047131|Turn on Visual Editor collab beta feature on officewiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:10:46] <logmsgbot>	 !log urbanecm@deploy1002 Sync cancelled.
[20:11:04] <urbanecm>	 Kemayo: can you test at mwdebug please?
[20:11:11] <Kemayo>	 Sure, just a second.
[20:11:38] <urbanecm>	 (both patches please)
[20:12:46] <wikibugs>	 (03PS28) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[20:13:36] <urbanecm>	 why sync cancelled...
[20:13:37] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[20:13:42] <urbanecm>	 i didn't cancel anything
[20:13:49] <urbanecm>	 restarting
[20:14:06] <logmsgbot>	 !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1047099|cswiki: adding throttle rule, removing old throttle rule (T367858)]], [[gerrit:1047125|Deploy references edit check to phase 1 wikis (T361843)]], [[gerrit:1047131|Turn on Visual Editor collab beta feature on officewiki]]
[20:14:06] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[20:14:15] <stashbot>	 T367858: Lift IP cap on 2024-06-26 for Editathon Human Rights - cs.wikipedia - https://phabricator.wikimedia.org/T367858
[20:14:16] <stashbot>	 T361843: Make Edit Check (references) available to all newcomers at phase 1 Wikipedias - https://phabricator.wikimedia.org/T361843
[20:14:50] <Kemayo>	 1047125 is working, but I can't persuade 1047131 to -- does mwdebug actually work on officewiki?
[20:15:05] <Kemayo>	 (I don't think I've ever done an officewiki-specific deployment before.)
[20:15:37] <urbanecm>	 Kemayo: it should work there
[20:15:58] <urbanecm>	 wikitech is the only exception (and i hope not for long)
[20:17:58] <urbanecm>	 i do see wgVisualEditorEnableCollabBeta is set to true at mwdebug
[20:17:58] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9905039 (10Dzahn) After a little follow-up fix rsync::quickdatacopy is now in use and copies both from and to new path /srv/mailman3 (and /v...
[20:18:38] <logmsgbot>	 !log urbanecm@deploy1002 kemayo, urbanecm, superzerocool: Backport for [[gerrit:1047099|cswiki: adding throttle rule, removing old throttle rule (T367858)]], [[gerrit:1047125|Deploy references edit check to phase 1 wikis (T361843)]], [[gerrit:1047131|Turn on Visual Editor collab beta feature on officewiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:19:08] <urbanecm>	 Kemayo: if it is not breaking something visibly, we might try deploying and see what happens afterwards? unless you object.
[20:19:27] <Daimona>	 Hey folks. Not sure if it's related to the ongoing deployment, but I was just told of a problem with EditCheck that is preventing edits, at least on testwiki. Still gathering some links and will file a task shortly, but I wanted to say it here first.
[20:20:39] <Kemayo>	 urbanecm: I'm fine with pushing it out and seeing if that helps
[20:20:41] <urbanecm>	 Daimona: my deployment did not reach anything non-debug
[20:20:49] <urbanecm>	 so it should not be related
[20:20:51] <urbanecm>	 but Kemayo would know more :)
[20:21:17] <Kemayo>	 There's some other stuff on the train that changed this week with edit check, so more details would be helpful.
[20:21:28] <wikibugs>	 (03CR) 10Dzahn: "I think we don't really need it anymore now. mailman_root is /srv/mailman3 on lists1004 and lists2001 and lists1001 is gone soon. unless t" [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[20:22:58] <logmsgbot>	 !log urbanecm@deploy1002 kemayo, urbanecm, superzerocool: Continuing with sync
[20:24:13] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[20:26:18] <Daimona>	 Task filed: T367920
[20:26:18] <stashbot>	 T367920: Cannot save edits in testwiki with VE: mw.editcheck.findAddedContentNeedingReference is not a function - https://phabricator.wikimedia.org/T367920
[20:29:15] <Daimona>	 I still haven't checked what wikis are affected and whether certain specific config is needed to reproduce, but for now I just wanted to file the task and get some more eyes on it.
[20:29:57] <Daimona>	 I guess it might also be a deployment blocker, but again, still checking the impact.
[20:30:29] <Kemayo>	 Got it, it's a problem with the stuff on the train. I will see if I can write a very quick patch.
[20:33:05] <logmsgbot>	 !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1047099|cswiki: adding throttle rule, removing old throttle rule (T367858)]], [[gerrit:1047125|Deploy references edit check to phase 1 wikis (T361843)]], [[gerrit:1047131|Turn on Visual Editor collab beta feature on officewiki]] (duration: 18m 59s)
[20:33:10] <stashbot>	 T367858: Lift IP cap on 2024-06-26 for Editathon Human Rights - cs.wikipedia - https://phabricator.wikimedia.org/T367858
[20:33:11] <stashbot>	 T361843: Make Edit Check (references) available to all newcomers at phase 1 Wikipedias - https://phabricator.wikimedia.org/T361843
[20:33:42] <urbanecm>	 okay, patch finished syncing
[20:33:53] <urbanecm>	 and i think that settles the first group?
[20:33:57] <urbanecm>	 so jan_drewniak, i think you can start
[20:34:56] <jan_drewniak>	 urbanecm: thanks! I'll get to it :) 
[20:35:02] <wikibugs>	 (03PS1) 10Hashar: Gerrit 3.10.x rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1047175 (https://phabricator.wikimedia.org/T367419)
[20:36:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046790 (https://phabricator.wikimedia.org/T367463) (owner: 10Jdlrobson)
[20:36:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047155 (https://phabricator.wikimedia.org/T367844) (owner: 10Jdlrobson)
[20:41:27] <wikibugs>	 (03PS1) 10Dzahn: admin: add Audrey Penven to ldap_only (wmde/nda) [puppet] - 10https://gerrit.wikimedia.org/r/1047176 (https://phabricator.wikimedia.org/T367184)
[20:44:06] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905129 (10Dzahn) Thanks @KFrancis! Can you please add Audrey to the 'NDA and MOU' spreadsheet?
[20:45:14] <wikibugs>	 (03CR) 10Scott French: [C:03+2] data-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046752 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French)
[20:46:08] <wikibugs>	 (03Merged) 10jenkins-bot: data-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046752 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French)
[20:47:16] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply
[20:47:19] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Kashmiri Wikimedians User Group - https://phabricator.wikimedia.org/T367640#9905136 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/wikimedia-ks.lists.wikimedia.org
[20:47:22] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905139 (10Dzahn) 05Stalled→03In progress
[20:47:27] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply
[20:48:08] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905140 (10KFrancis) Done, thanks!
[20:49:14] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply
[20:49:32] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply
[20:49:39] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905142 (10Dzahn) thanks Katie!  @AudreyPenven_WMDE All is ready, we just still need an approval from one of the WMDE engineering managers (https://wikitech.wikimedia.org...
[20:49:56] <wikibugs>	 (03CR) 10Dzahn: "pending WMDE engineering manager approval" [puppet] - 10https://gerrit.wikimedia.org/r/1047176 (https://phabricator.wikimedia.org/T367184) (owner: 10Dzahn)
[20:50:33] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply
[20:50:49] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply
[20:51:25] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Upstream: Unnecessary horizontal scrollbars - https://phabricator.wikimedia.org/T283028#9905147 (10Ladsgroup) There was a new version of mailman deployed today. I can't reproduce this anymore. Can you check @reedy?
[20:51:25] <jinxer-wm>	 FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[20:53:04] <wikibugs>	 (03PS2) 10Hashar: Gerrit 3.10.x rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1047175 (https://phabricator.wikimedia.org/T367419)
[20:53:35] * hashar sleeps
[20:55:31] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the reviews! I'll be out tomorrow (Wednesday), but will aim to get this deployed on Thursday when I return." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French)
[20:59:53] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Upgrade to Java 11 — T350567 - eevans@cumin1002
[20:59:57] <stashbot>	 T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567
[21:02:30] <wikibugs>	 (03Merged) 10jenkins-bot: Improve responsive images and avoid for inline [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046790 (https://phabricator.wikimedia.org/T367463) (owner: 10Jdlrobson)
[21:02:33] <wikibugs>	 (03Merged) 10jenkins-bot: Fix codex link styles overriding other link styles [skins/Vector] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047155 (https://phabricator.wikimedia.org/T367844) (owner: 10Jdlrobson)
[21:03:07] <logmsgbot>	 !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]]
[21:03:13] <stashbot>	 T367463: Tables with images inside them appear at minuscule size or disappear due to responsive image CSS - https://phabricator.wikimedia.org/T367463
[21:03:14] <stashbot>	 T367844: Various buttons on Vector 2022 acquired unexpected link styling when hovered - https://phabricator.wikimedia.org/T367844
[21:03:39] <wikibugs>	 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9905192 (10Dzahn) Hi @MunizaA no problem, but we'll need a few more things from you for that.  Could you please use the template linked from https://wikitech.wikimedia.org/wiki/SRE/Producti...
[21:05:58] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9905205 (10Dzahn)
[21:07:48] <logmsgbot>	 !log jdrewniak@deploy1002 jdrewniak, jdlrobson: Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:07:48] <logmsgbot>	 !log jdrewniak@deploy1002 Sync cancelled.
[21:09:33] <logmsgbot>	 !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]]
[21:09:39] <stashbot>	 T367463: Tables with images inside them appear at minuscule size or disappear due to responsive image CSS - https://phabricator.wikimedia.org/T367463
[21:09:39] <stashbot>	 T367844: Various buttons on Vector 2022 acquired unexpected link styling when hovered - https://phabricator.wikimedia.org/T367844
[21:12:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9905218 (10Dzahn) Hi @Kgraessle   in addition to your manager please get any of the following people to approve of this request here on the ticket.   `     a...
[21:12:32] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9905219 (10Dzahn)
[21:12:51] <Kemayo>	 Daimona: Okay, got what I think is a patch for it, just need to get some code review on it and we can unblock the train. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1047180
[21:13:59] <logmsgbot>	 !log jdrewniak@deploy1002 jdlrobson, jdrewniak: Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:14:03] <Daimona>	 Thank you! I'd normally offer to take a look, but right now I'm struggling to keep my eyes open and I don't want to do damage.
[21:14:43] <wikibugs>	 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367925 (10phaultfinder) 03NEW
[21:16:11] <logmsgbot>	 !log jdrewniak@deploy1002 jdlrobson, jdrewniak: Continuing with sync
[21:19:09] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to dumps-roots and to clouddumps*.eqiad.wmnet for xcollazo - https://phabricator.wikimedia.org/T367571#9905250 (10Dzahn) Hi @xcollazo   so clouddumps1001.eqiad.wmnet and clouddumps1002.eqiad.wmnet don't exist.  But clouddumps1001.wikimedia.org and clouddumps1002....
[21:20:12] <wikibugs>	 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9905262 (10Dzahn)
[21:20:41] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudvirt-wdqs100[1,2,3] - https://phabricator.wikimedia.org/T367773#9905260 (10VRiley-WMF) a:03VRiley-WMF
[21:21:14] <wikibugs>	 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9905264 (10Dzahn) tagging with data-engineering per the new process to request approval from group approvers
[21:21:20] <wikibugs>	 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9905266 (10Dzahn) it's an SRE access request, unrelated to LDAP. adjusting tags
[21:21:26] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[21:21:32] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9905267 (10Dzahn)
[21:22:51] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to dumps-roots and to clouddumps*.eqiad.wmnet for xcollazo - https://phabricator.wikimedia.org/T367571#9905272 (10BTullis) @xcollazo is already a member of analytics-admins: https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.ya...
[21:24:25] <wikibugs>	 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905271 (10Dzahn) @WMDE-leszek Can we get approval here from WMDE management?
[21:24:55] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to dumps-roots and to clouddumps*.eqiad.wmnet for xcollazo - https://phabricator.wikimedia.org/T367571#9905280 (10BTullis) Mind you, membership of the `dumps-roots` group would give more privileges. Full root access: https://github.com/wikimedia/operations-puppet...
[21:25:30] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudvirt-wdqs100[1,2,3] - https://phabricator.wikimedia.org/T367773#9905281 (10VRiley-WMF)
[21:25:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:26:07] <logmsgbot>	 !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]] (duration: 16m 33s)
[21:26:13] <stashbot>	 T367463: Tables with images inside them appear at minuscule size or disappear due to responsive image CSS - https://phabricator.wikimedia.org/T367463
[21:26:13] <stashbot>	 T367844: Various buttons on Vector 2022 acquired unexpected link styling when hovered - https://phabricator.wikimedia.org/T367844
[21:27:57] <jan_drewniak>	 Hey all, looks like the backport finished, but it did end with the following error (not sure why) 
[21:28:00] <jan_drewniak>	 backport failed: <CalledProcessError> Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=jdlrobson', 'Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]]']' returned non-zero exit status 1.
[21:29:25] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to dumps-roots and to clouddumps*.eqiad.wmnet for xcollazo - https://phabricator.wikimedia.org/T367571#9905297 (10Dzahn) Ah, yes, confirmed. You already have clouddumps. And I also see the user on a host like dumpdata1004  or snapshot1010 where I assumed that's d...
[21:29:45] <thcipriani>	 jan_drewniak: hrm, was there any other backscroll in the output?
[21:29:49] <icinga-wm_>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:29:51] <icinga-wm_>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:31:20] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9905306 (10Dzahn) The update to approvers for WMDE would be in T367914
[21:31:26] <jinxer-wm>	 RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[21:32:43] <wikibugs>	 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudvirt-wdqs100[1,2,3] - https://phabricator.wikimedia.org/T367773#9905314 (10VRiley-WMF) 05Open→03Resolved
[21:34:43] <icinga-wm_>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52198 bytes in 3.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:34:43] <icinga-wm_>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.998 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[21:35:07] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367925#9905318 (10VRiley-WMF) a:03VRiley-WMF
[21:35:21] <jan_drewniak>	 thcipriani: only a k8s host timeout :/ 
[21:35:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367925#9905319 (10VRiley-WMF) Adjusted power cable. Power supply is back on.
[21:35:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367925#9905321 (10VRiley-WMF) 05Open→03Resolved
[21:37:41] <Kemayo>	 urbanecm: I worked out what my beta feature issue was. I completely forgot about needing to add it to wgBetaFeaturesAllowList.
[21:38:16] <urbanecm>	 Kemayo: ahh. Makes sense. 
[21:38:59] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9905322 (10Ladsgroup) 05Open→03Resolved https://lists.wikimedia.org/postorius/lists/project-wikimoitree.lists.wikimedia.org
[21:40:16] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: MM3/postorius: takes too long to load - https://phabricator.wikimedia.org/T314247#9905328 (10Dzahn) Mailman migrated to a new server and a new version just now.  Did this get faster?
[21:44:18] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[21:50:31] <wikibugs>	 (03PS1) 10DLynch: Add Visual Editor collab beta feature to wgBetaFeaturesAllowList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047182
[21:50:48] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: MM3/postorius: takes too long to load - https://phabricator.wikimedia.org/T314247#9905359 (10Reedy) →14Duplicate dup:03T353891
[21:52:52] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#9905356 (10Reedy)
[21:53:04] <Kemayo>	 The joys of things that don't apply to local development config.
[21:53:37] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#9905392 (10Dzahn) >>! In T353891#9684341, @fnegri wrote: > It's very slow for me as well, I hadn't opened it in a while but it was barely usable b...
[21:54:25] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[21:55:54] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[21:56:37] <wikibugs>	 (03PS29) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001)
[21:56:57] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[21:57:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[22:03:01] <wikibugs>	 (03PS1) 10Dzahn: mailman3: remove buster support [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706)
[22:05:04] <wikibugs>	 (03PS1) 10JHathaway: postfix: always send local mail to smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/1047185 (https://phabricator.wikimedia.org/T325406)
[22:05:31] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047185 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[22:07:03] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[22:09:40] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[22:11:34] <wikibugs>	 (03PS2) 10JHathaway: postfix: always send local mail to smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/1047185 (https://phabricator.wikimedia.org/T325406)
[22:11:49] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047185 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[22:18:08] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: always send local mail to smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/1047185 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[22:19:45] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[22:20:55] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[22:30:28] <wikibugs>	 (03PS1) 10DLynch: findAddedContentNeedingReference was removed accidentally [extensions/VisualEditor] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047188 (https://phabricator.wikimedia.org/T367920)
[22:31:00] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[22:34:47] <jinxer-wm>	 FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput
[22:34:49] <wikibugs>	 (03PS1) 10Bking: analytics: allow dse-k8s pod network to reach an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001)
[22:35:09] <wikibugs>	 (03PS2) 10Bking: analytics: allow dse-k8s pod network to reach an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001)
[22:35:45] <wikibugs>	 (03PS1) 10JHathaway: postfix: fix path to aliases [puppet] - 10https://gerrit.wikimedia.org/r/1047190 (https://phabricator.wikimedia.org/T325406)
[22:35:55] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047190 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[22:36:23] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[22:37:16] <wikibugs>	 (03PS1) 10BCornwall: ncredir: Remove localized TLD redirects [puppet] - 10https://gerrit.wikimedia.org/r/1047191
[22:38:26] <wikibugs>	 (03CR) 10Jforrester: [C:03+1] Add Visual Editor collab beta feature to wgBetaFeaturesAllowList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047182 (owner: 10DLynch)
[22:39:00] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] postfix: fix path to aliases [puppet] - 10https://gerrit.wikimedia.org/r/1047190 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway)
[22:39:16] <James_F>	 I'm going to deploy a minor train backport for Wikifunctions, and a more serious one for VE.
[22:39:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/WikiLambda] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047077 (https://phabricator.wikimedia.org/T367159) (owner: 10Jforrester)
[22:39:34] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047188 (https://phabricator.wikimedia.org/T367920) (owner: 10DLynch)
[22:40:57] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Change lists.wm.o A/AAAA records to CNAME and MX [dns] - 10https://gerrit.wikimedia.org/r/1047192
[22:41:14] <wikibugs>	 (03CR) 10Jforrester: [C:03+2] "Officewiki-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047182 (owner: 10DLynch)
[22:41:23] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2963/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047191 (owner: 10BCornwall)
[22:41:54] <wikibugs>	 (03Merged) 10jenkins-bot: Add Visual Editor collab beta feature to wgBetaFeaturesAllowList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047182 (owner: 10DLynch)
[22:41:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] lists: Change lists.wm.o A/AAAA records to CNAME and MX [dns] - 10https://gerrit.wikimedia.org/r/1047192 (owner: 10EoghanGaffney)
[22:45:22] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] analytics: allow dse-k8s pod network to reach an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[22:45:48] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:45:49] <wikibugs>	 (03CR) 10Bking: [C:03+2] analytics: allow dse-k8s pod network to reach an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[22:45:56] <wikibugs>	 (03Merged) 10jenkins-bot: Use isEnumType in selector and isCustomEnum for creating literals [extensions/WikiLambda] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047077 (https://phabricator.wikimedia.org/T367159) (owner: 10Jforrester)
[22:46:29] <James_F>	 Kemayo: (Hello here.)
[22:46:48] <James_F>	 I always forget how long VE patches take to land.
[22:47:27] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "nit: It affects both an-db100[1-2] although 1002 is currently the replica." [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking)
[22:47:33] <Kemayo>	 James_F: Need me to test them on debug before they go into the train branch?
[22:48:00] <James_F>	 Kemayo: No, I'm happy to test myself – but also happy to pause for to you to approve, if you'd prefer.
[22:49:04] <Kemayo>	 James_F: Someone else testing sounds good overall. I did test myself when I wrote the patch, but more eyes and all.
[22:49:08] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply
[22:49:18] <James_F>	 +1
[22:49:36] <logmsgbot>	 !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply
[22:51:09] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Change lists.wm.o A/AAAA records to CNAME and MX [dns] - 10https://gerrit.wikimedia.org/r/1047192
[22:51:56] <wikibugs>	 (03PS2) 10Jdlrobson: Cleanup: Remove wgNavigationTimingSurveyName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128)
[22:52:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] lists: Change lists.wm.o A/AAAA records to CNAME and MX [dns] - 10https://gerrit.wikimedia.org/r/1047192 (owner: 10EoghanGaffney)
[22:57:15] <wikibugs>	 (03PS2) 10Jdlrobson: Enable dark mode on data table pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373)
[23:03:54] <wikibugs>	 (03CR) 10EoghanGaffney: "From my perspective, we do need it as there were some parts of mailman-web that weren't respecting the different mailman root. We need to " [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[23:04:40] <wikibugs>	 (03Merged) 10jenkins-bot: findAddedContentNeedingReference was removed accidentally [extensions/VisualEditor] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047188 (https://phabricator.wikimedia.org/T367920) (owner: 10DLynch)
[23:04:58] <James_F>	 Finally!
[23:05:16] <Kemayo>	 Under half an hour! It's pretty good today!
[23:05:31] <logmsgbot>	 !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1047077|Use isEnumType in selector and isCustomEnum for creating literals (T367159)]], [[gerrit:1047188|findAddedContentNeedingReference was removed accidentally (T367920)]]
[23:05:38] <stashbot>	 T367159: Unable to create converters using the UI as identity fields cannot be set - https://phabricator.wikimedia.org/T367159
[23:05:38] <stashbot>	 T367920: Cannot save edits in testwiki with VE: mw.editcheck.findAddedContentNeedingReference is not a function - https://phabricator.wikimedia.org/T367920
[23:08:17] <wikibugs>	 (03PS1) 10Scott French: drivers/etcd: only attempt to load existing configs [software/conftool] - 10https://gerrit.wikimedia.org/r/1047193 (https://phabricator.wikimedia.org/T367919)
[23:10:15] <logmsgbot>	 !log jforrester@deploy1002 jforrester, kemayo: Backport for [[gerrit:1047077|Use isEnumType in selector and isCustomEnum for creating literals (T367159)]], [[gerrit:1047188|findAddedContentNeedingReference was removed accidentally (T367920)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[23:12:48] <James_F>	 Kemayo: And it works: https://test.wikipedia.org/w/index.php?title=Test&diff=prev&oldid=599462
[23:12:50] <logmsgbot>	 !log jforrester@deploy1002 jforrester, kemayo: Continuing with sync
[23:13:06] <Kemayo>	 James_F: Excellent, thanks!
[23:13:24] <James_F>	 Kemayo: Also https://office.wikimedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures
[23:15:48] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:16:05] <Kemayo>	 James_F: Looks good there on 1002. Thanks again!
[23:16:13] <James_F>	 Success.
[23:16:22] <James_F>	 Now the second half hour wait, this time to sync.
[23:16:41] <Kemayo>	 🎉
[23:22:48] <logmsgbot>	 !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1047077|Use isEnumType in selector and isCustomEnum for creating literals (T367159)]], [[gerrit:1047188|findAddedContentNeedingReference was removed accidentally (T367920)]] (duration: 17m 16s)
[23:22:54] <stashbot>	 T367159: Unable to create converters using the UI as identity fields cannot be set - https://phabricator.wikimedia.org/T367159
[23:22:54] <stashbot>	 T367920: Cannot save edits in testwiki with VE: mw.editcheck.findAddedContentNeedingReference is not a function - https://phabricator.wikimedia.org/T367920
[23:22:54] <James_F>	 And done.
[23:30:26] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9905564 (10Papaul) @Jhancock.wm @RobH some information on this server.   **Information1**  The server came with 2 network add-on cards: - 1st card connected to slot A1 is...
[23:33:37] <icinga-wm_>	 RECOVERY - Host mw2321 is UP: PING WARNING - Packet loss = 77%, RTA = 30.33 ms
[23:35:33] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9905575 (10Papaul)
[23:36:54] <wikibugs>	 06SRE, 10Cassandra, 06Data-Persistence: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567#9905577 (10Eevans)
[23:38:27] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047199
[23:38:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047199 (owner: 10TrainBranchBot)
[23:58:16] <wikibugs>	 (03PS1) 10Jforrester: mathoid: Upgrade image from 2023-11-03-103402 to 2024-06-18-233457 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047201 (https://phabricator.wikimedia.org/T350004)
[23:58:54] <wikibugs>	 (03CR) 10Jforrester: "I can deploy this on Thursday, if needed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047201 (https://phabricator.wikimedia.org/T350004) (owner: 10Jforrester)