[00:00:20] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS bullseye [00:00:33] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901770 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye [00:00:51] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1046793 (owner: 10TrainBranchBot) [00:02:35] !log zabe@deploy1002 Finished scap: T366649 (duration: 15m 16s) [00:02:39] T366649: Create an 'Universal Code of Conduct Coordinating Committee (U4C)' private wiki - https://phabricator.wikimedia.org/T366649 [00:03:21] (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046800 [00:03:22] (03CR) 10Zabe: [C:03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046800 (owner: 10Zabe) [00:04:16] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046800 (owner: 10Zabe) [00:04:51] !log zabe@deploy1002 Started scap: Update interwiki cache [00:05:34] !log zabe@mwmaint1002:~$ mwscript extensions/CirrusSearch/maintenance/UpdateSearchIndexConfig.php --wiki=u4cwiki --cluster=all 2>&1 | tee /tmp/u4c.UpdateSearchIndexConfig.log # T366649 [00:05:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:12] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4044.ulsfo.wmnet with OS bullseye [00:10:20] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901798 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye execu... [00:10:27] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4044.ulsfo.wmnet with OS bullseye [00:10:34] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901799 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye [00:13:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P65124 and previous config saved to /var/cache/conftool/dbconfig/20240618-001316-ladsgroup.json [00:13:40] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839 (10Zabe) 03NEW [00:14:09] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9901810 (10Zabe) [00:18:54] !log zabe@deploy1002 Finished scap: Update interwiki cache (duration: 14m 03s) [00:24:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:25:43] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 2.873 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:28:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T352010)', diff saved to https://phabricator.wikimedia.org/P65125 and previous config saved to /var/cache/conftool/dbconfig/20240618-002823-ladsgroup.json [00:28:29] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:29:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: firmware upgrade for mw1359.eqiad.wmnet, mw1364.eqiad.wmnet, mw1365.eqiad.wmnet, mw1412.eqiad.wmnet - https://phabricator.wikimedia.org/T367766#9901830 (10Jclark-ctr) @clement_goubert did you need just idrac updated we can do that easily. B... [00:31:27] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage [00:34:56] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4044.ulsfo.wmnet with reason: host reimage [00:50:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T364069)', diff saved to https://phabricator.wikimedia.org/P65126 and previous config saved to /var/cache/conftool/dbconfig/20240618-005054-marostegui.json [00:50:59] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [00:57:08] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4044.ulsfo.wmnet with OS bullseye [00:57:18] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901859 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4044.ulsfo.wmnet with OS bullseye completed: - cp4044 (**PASS... [01:06:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P65127 and previous config saved to /var/cache/conftool/dbconfig/20240618-010601-marostegui.json [01:07:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.10 [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1046803 (https://phabricator.wikimedia.org/T361404) [01:07:57] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.10 [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1046803 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot) [01:10:48] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4044.ulsfo.wmnet [01:11:54] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9901869 (10BCornwall) [01:17:51] (03PS4) 10Scott French: mediawiki: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042440 (https://phabricator.wikimedia.org/T362978) [01:17:51] (03PS2) 10Scott French: mediawiki: enable securityContext in all canaries [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978) [01:17:51] (03PS2) 10Scott French: mediawiki: enable securityContext everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978) [01:21:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P65128 and previous config saved to /var/cache/conftool/dbconfig/20240618-012109-marostegui.json [01:21:59] (03PS1) 10BCornwall: hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) [01:24:36] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2948/console" [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [01:31:11] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.10 [core] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1046803 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot) [01:36:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T364069)', diff saved to https://phabricator.wikimedia.org/P65129 and previous config saved to /var/cache/conftool/dbconfig/20240618-013616-marostegui.json [01:36:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [01:36:22] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [01:36:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1181.eqiad.wmnet with reason: Maintenance [01:36:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65130 and previous config saved to /var/cache/conftool/dbconfig/20240618-013639-marostegui.json [01:40:01] (03CR) 10Scott French: "Alright, I think I've figured out how to make CI render the right diffs here." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [01:40:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:45:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [01:55:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0200) [02:00:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:38:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [02:58:46] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0300) [03:01:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [03:01:51] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046807 (https://phabricator.wikimedia.org/T361404) [03:01:52] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046807 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot) [03:02:29] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046807 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot) [03:03:00] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.10 refs T361404 [03:03:05] T361404: 1.43.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T361404 [03:07:58] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:08:18] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:08:18] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:08:48] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:51:31] FIRING: ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:55:48] FIRING: [2x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:56:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:58:46] RESOLVED: [2x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0400) [04:01:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:01:57] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.10 refs T361404 (duration: 58m 57s) [04:02:04] T361404: 1.43.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T361404 [04:02:52] !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.7 (duration: 02m 50s) [04:20:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s4 T367378 [04:20:45] T367378: Switchover s4 master (db1160 -> db1238) - https://phabricator.wikimedia.org/T367378 [04:20:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1238 with weight 0 T367378', diff saved to https://phabricator.wikimedia.org/P65131 and previous config saved to /var/cache/conftool/dbconfig/20240618-042054-marostegui.json [04:21:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s4 T367378 [04:21:40] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1238 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1042595 (https://phabricator.wikimedia.org/T367378) (owner: 10Gerrit maintenance bot) [04:23:45] (03PS1) 10Marostegui: db1201: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1046808 [04:34:35] (03CR) 10Marostegui: [C:03+2] db1201: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1046808 (owner: 10Marostegui) [04:34:41] 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9902003 (10Marostegui) There is a problem before we can even check the grants, there's no connection between those two hosts and the proxies. I guess a FW rules needs to be added somewhere: ` root@l... [04:47:28] !log Starting s4 eqiad failover from db1160 to db1238 - T367378 [04:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:47:33] T367378: Switchover s4 master (db1160 -> db1238) - https://phabricator.wikimedia.org/T367378 [04:47:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T367378', diff saved to https://phabricator.wikimedia.org/P65132 and previous config saved to /var/cache/conftool/dbconfig/20240618-044747-marostegui.json [04:48:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1238 to s4 primary and set section read-write T367378', diff saved to https://phabricator.wikimedia.org/P65133 and previous config saved to /var/cache/conftool/dbconfig/20240618-044806-marostegui.json [04:48:41] (03PS2) 10Gerrit maintenance bot: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1042596 (https://phabricator.wikimedia.org/T367378) [04:49:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1160 T367378', diff saved to https://phabricator.wikimedia.org/P65134 and previous config saved to /var/cache/conftool/dbconfig/20240618-044908-root.json [04:49:23] (03CR) 10Marostegui: [C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1042596 (https://phabricator.wikimedia.org/T367378) (owner: 10Gerrit maintenance bot) [04:49:24] (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/1042596 (https://phabricator.wikimedia.org/T367378) (owner: 10Gerrit maintenance bot) [04:51:44] (03PS1) 10Marostegui: db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1046809 [04:51:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Long schema change [04:51:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1160.eqiad.wmnet with reason: Long schema change [04:52:09] (03CR) 10Marostegui: [C:03+2] db1160: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1046809 (owner: 10Marostegui) [04:54:52] !log dbmaint eqiad s4 deploy schema change on db1160 T364299 [04:54:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:54:57] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [04:57:23] !log dbmaint eqiad s2 deploy schema change on db2207 T364299 [04:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:43] !log dbmaint codfw s5 deploy schema change on db2213 T364299 [05:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:48] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:15:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65135 and previous config saved to /var/cache/conftool/dbconfig/20240618-051517-marostegui.json [05:15:22] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [05:23:46] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P65136 and previous config saved to /var/cache/conftool/dbconfig/20240618-053024-marostegui.json [05:33:22] (03PS1) 10KartikMistry: Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046810 (https://phabricator.wikimedia.org/T367838) [05:38:19] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046810 (https://phabricator.wikimedia.org/T367838) (owner: 10KartikMistry) [05:44:53] !log jynus@cumin2002 START - Cookbook sre.hosts.decommission for hosts db2102.codfw.wmnet [05:45:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P65137 and previous config saved to /var/cache/conftool/dbconfig/20240618-054531-marostegui.json [05:50:44] !log jynus@cumin2002 START - Cookbook sre.dns.netbox [05:53:29] !log jynus@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2102.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin2002" [05:54:35] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:55:05] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:55:21] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 11, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:55:23] !log jynus@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2102.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin2002" [05:55:23] !log jynus@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [05:55:24] !log jynus@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2102.codfw.wmnet [05:56:35] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0600) [06:00:05] marostegui, Amir1, and arnaudb: May I have your attention please! Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0600) [06:00:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T364069)', diff saved to https://phabricator.wikimedia.org/P65138 and previous config saved to /var/cache/conftool/dbconfig/20240618-060038-marostegui.json [06:00:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [06:00:46] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:00:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [06:01:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T364069)', diff saved to https://phabricator.wikimedia.org/P65139 and previous config saved to /var/cache/conftool/dbconfig/20240618-060100-marostegui.json [06:02:43] (03PS1) 10Jcrespo: mariadb: Remove all remaining puppet references to db2102 [puppet] - 10https://gerrit.wikimedia.org/r/1046812 (https://phabricator.wikimedia.org/T366892) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:20:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:20:48] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:21:55] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:21:57] (03CR) 10Jcrespo: [C:03+2] mariadb: Remove all remaining puppet references to db2102 [puppet] - 10https://gerrit.wikimedia.org/r/1046812 (https://phabricator.wikimedia.org/T366892) (owner: 10Jcrespo) [06:23:54] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2102.codw.wmnet - https://phabricator.wikimedia.org/T366892#9902128 (10jcrespo) a:05jcrespo→03None [06:31:24] (03CR) 10Ayounsi: dnsbox: announce ntp-[abc].anycast.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [06:52:35] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1240.eqiad.wmnet with reason: data reload [06:52:49] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1240.eqiad.wmnet with reason: data reload [06:54:09] (03CR) 10Muehlenhoff: [C:03+1] "LGTM (I'll also send a patch to move this to firewall::service when the migration is completed)" [puppet] - 10https://gerrit.wikimedia.org/r/1046785 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [06:56:23] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:56:23] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:00:04] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:03] here [07:01:48] (03CR) 10Ayounsi: "That's nice ! it's great to see data being removed from those yaml files !" [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [07:02:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046810 (https://phabricator.wikimedia.org/T367838) (owner: 10KartikMistry) [07:03:20] (03Merged) 10jenkins-bot: Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046810 (https://phabricator.wikimedia.org/T367838) (owner: 10KartikMistry) [07:04:20] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1046810|Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% (T367838)]] [07:04:24] T367838: Adjust the Machine translation limit for Telugu Wikipedia from 70% to 75% - https://phabricator.wikimedia.org/T367838 [07:04:45] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:08:32] (03PS6) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 [07:09:15] !log kartik@deploy1002 kartik: Backport for [[gerrit:1046810|Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% (T367838)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:10:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9902174 (10MoritzMuehlenhoff) >>! In T367071#9882394, @Jclark-ctr wrote: > @MoritzMuehlenhoff after replacing failed drive looked like it might boot but still fails.... [07:10:37] (03CR) 10Jcrespo: "After rebase, those changes have already committed by someone else :-( :-) :-| . Only the heartbeat changes are left." [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo) [07:11:06] !log kartik@deploy1002 kartik: Continuing with sync [07:12:46] !log dbmaint codfw s4 deploy schema change [07:12:47] (03CR) 10Jcrespo: [C:04-1] "I believe this is missing the new replication user. @Ladsgroup" [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo) [07:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:51] !log dbmaint codfw s4 deploy schema change T367261 [07:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:55] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [07:15:43] !log dbmaint eqiad s5 deploy schema change on primary master T364069 [07:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:15:47] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:19:29] I'll also deploy cxserver since there is no other config/backport patches in the queue. [07:19:43] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2024-06-13-045621-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042603 (https://phabricator.wikimedia.org/T364122) (owner: 10KartikMistry) [07:19:43] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 31 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:20:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:20:33] (03Merged) 10jenkins-bot: Update cxserver to 2024-06-13-045621-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042603 (https://phabricator.wikimedia.org/T364122) (owner: 10KartikMistry) [07:20:56] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1046810|Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% (T367838)]] (duration: 16m 36s) [07:21:01] T367838: Adjust the Machine translation limit for Telugu Wikipedia from 70% to 75% - https://phabricator.wikimedia.org/T367838 [07:21:50] seems backport failing with: "07:20:56 backport failed: Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=kartik', 'Backport for [[gerrit:1046810|Content Translation: Adjust the Machine translation limit for Telugu WP from 70% to 75% (T367838)]]']' returned non-zero exit status 1." [07:21:55] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:24:39] but change seems applied.. [07:25:59] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 9438 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [07:26:08] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/cxserver: apply [07:26:30] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [07:26:34] jouncebot: now [07:26:34] For the next 0 hour(s) and 33 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0700) [07:26:37] jouncebot: next [07:26:37] In 0 hour(s) and 33 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0800) [07:28:02] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/cxserver: apply [07:28:10] 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9902206 (10Marostegui) a:05Ladsgroup→03eoghan [07:28:34] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [07:29:11] effie: I'm deploying cxserver now, since there weren't any more backport/config patches.. [07:29:29] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [07:29:46] kart_: thank you! [07:30:02] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [07:31:30] !log Updated cxserver to 2024-06-13-045621-production (T364122, T138401) [07:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:36] T364122: In zgh.wikipedia Content Translation use machine translation with MinT Translation with tzm code - https://phabricator.wikimedia.org/T364122 [07:31:36] T138401: Replace jsduck with JSDoc3 across all Wikimedia code bases - https://phabricator.wikimedia.org/T138401 [07:31:41] effie: I'm done. [07:31:46] cheers [07:33:03] !log jiji@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-eqiad [07:35:49] (03CR) 10Slyngshede: "@dzahn@wikimedia.org Yes, I just checked on the servers as well and the CAS version of the Gitlab services have been removed. Very nice :-" [puppet] - 10https://gerrit.wikimedia.org/r/1043247 (https://phabricator.wikimedia.org/T320390) (owner: 10Dzahn) [07:35:51] jnuche: I am rebooting about 23 k8s nodes, I expect not to delay the trauin much [07:36:13] effie: ack, thx for the headsup [07:38:21] PROBLEM - MariaDB Replica Lag: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 455.92 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:40:51] !log dbmaint codfw s7 deploy schema change on codfw master T364069 [07:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:56] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:42:21] RECOVERY - MariaDB Replica Lag: s4 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 55.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:42:21] (03CR) 10Volans: "For the skip of rebooted hosts if not too urgent you could wait for https://phabricator.wikimedia.org/T366797" [cookbooks] - 10https://gerrit.wikimedia.org/r/1046780 (https://phabricator.wikimedia.org/T367592) (owner: 10Ryan Kemper) [07:43:25] (03CR) 10Brouberol: [C:03+1] "AFAIK, this was only useful for Cassandra. Druid connection time was not an issue, so +1! Yay to less hacks :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [07:44:27] (03PS2) 10Muehlenhoff: irc.w.o: Add support for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1046659 (https://phabricator.wikimedia.org/T331702) [07:46:43] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:47:36] 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9902233 (10eoghan) That's right -- we'll be doing that as part of the maintenance work later today: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1046785 https://phabricator.wikimedia.org/T36... [07:51:45] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 25 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:52:29] (03PS5) 10Ayounsi: Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) [07:52:29] (03PS6) 10Ayounsi: Netbox 4: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) [07:52:29] (03PS2) 10Ayounsi: Rename ganeti-netbox-sync.py to ganeti_netbox_sync.py [puppet] - 10https://gerrit.wikimedia.org/r/1039697 [07:52:51] (03CR) 10Ayounsi: Prepare for netbox-dev (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:53:20] (03PS1) 10Slyngshede: Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487) [07:53:37] 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9902239 (10Marostegui) Yes, we have that RW and RO users in other services. [07:56:51] !log uploaded python-irc 8.5.3+dfsg-4+wmf1 to apt.wikimedia.org T331702 [07:56:53] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [07:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:56] T331702: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 [07:59:42] jnuche: I will ping you when I am done alright ? [07:59:45] (03PS2) 10Slyngshede: Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487) [08:00:04] effie: ok [08:00:05] jnuche and brennen: How many deployers does it take to do MediaWiki train - Utc-0+Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T0800). [08:02:34] (03PS3) 10Slyngshede: Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487) [08:04:23] (03PS4) 10Slyngshede: Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487) [08:07:00] (03CR) 10Santiago Faci: [C:03+1] "I would that nobody knows it because we never had the opportunity to check that. So far those services only connect with Druid and, when s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [08:07:53] (03CR) 10Santiago Faci: [C:03+1] "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [08:09:21] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9902257 (10ABran-WMF) [08:12:52] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9902282 (10ABran-WMF) [08:13:34] (03PS1) 10KartikMistry: testwiki: Enable MinT for Wikipedia readers MVP on a Igbo Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047014 (https://phabricator.wikimedia.org/T367852) [08:21:58] (03PS10) 10Arnaudb: mariadb: bugfixes mysql_legacy [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) [08:24:53] jnuche: last 3 reboots [08:25:48] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:28:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046659 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [08:29:26] !log deploy pfw policy update 1718644831 - T367796 [08:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:48] jnuche: go ahead [08:31:06] sorry for the delay, it has been hard finding enough time to do this [08:31:14] effie: thanks! I'll start the train deployment in a couple of minutes [08:31:17] no prob [08:34:51] !log dbmaint eqiad s6 deploy schema change on eqiad master T364069 [08:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:56] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [08:35:45] PROBLEM - BGP status on lsw1-e2-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64601/IPv6: Connect - kubernetes-eqiad, AS64601/IPv4: Connect - kubernetes-eqiad https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:36:47] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:37:45] RECOVERY - BGP status on lsw1-e2-eqiad.mgmt is OK: BGP OK - up: 10, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:38:29] (03CR) 10Vgutierrez: [C:04-1] varnish: show better error for 429s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041705 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [08:38:47] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047018 (https://phabricator.wikimedia.org/T361404) [08:38:48] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047018 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot) [08:39:37] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047018 (https://phabricator.wikimedia.org/T361404) (owner: 10TrainBranchBot) [08:40:15] !log jiji@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-eqiad [08:41:45] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:41:57] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [08:43:05] !log cp4037 currently depooled and puppet disabled for T367756 [08:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:10] T367756: Upgrade ulsfo hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [08:44:39] (03CR) 10Muehlenhoff: [C:03+2] irc.w.o: Add support for Bullseye [puppet] - 10https://gerrit.wikimedia.org/r/1046659 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [08:45:10] !log hashar@deploy1002 Started deploy [integration/docroot@7a92240]: doc: Add mwseaql Rust crate [08:45:17] !log hashar@deploy1002 Finished deploy [integration/docroot@7a92240]: doc: Add mwseaql Rust crate (duration: 00m 07s) [08:46:10] 10ops-codfw, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9902380 (10Gehel) p:05Triage→03High [08:47:44] PROBLEM - Host db1165 #page is DOWN: PING CRITICAL - Packet loss = 100% [08:47:56] RECOVERY - Host db1165 #page is UP: PING WARNING - Packet loss = 66%, RTA = 314.72 ms [08:48:34] 10ops-codfw, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2024.06.17 - 2024.07.07): Elastic2099 unresponsive - https://phabricator.wikimedia.org/T367598#9902391 (10Gehel) [08:49:24] weird false positive [08:50:01] here [08:50:43] (03PS1) 10Marostegui: Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047019 [08:50:58] arnaudb: db1165 you mean? [08:50:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P65140 and previous config saved to /var/cache/conftool/dbconfig/20240618-085057-root.json [08:51:05] yep [08:51:19] (03CR) 10Marostegui: [C:03+2] Revert "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047019 (owner: 10Marostegui) [08:51:25] it looks like it has hardware issues, will downtime it [08:51:44] thanks [08:51:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1165.eqiad.wmnet with reason: repl issues [08:51:49] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 7 days, 0:00:00 on db1165.eqiad.wmnet with reason: repl issues [08:51:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on db1165.eqiad.wmnet with reason: hardware issues [08:52:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on db1165.eqiad.wmnet with reason: hardware issues [08:52:11] thanks arnaudb [08:52:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1165 depool to troubleshoot hardware issues', diff saved to https://phabricator.wikimedia.org/P65141 and previous config saved to /var/cache/conftool/dbconfig/20240618-085254-arnaudb.json [08:53:48] !log jnuche@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.10 refs T361404 [08:53:53] T361404: 1.43.0-wmf.10 deployment blockers - https://phabricator.wikimedia.org/T361404 [08:54:19] RECOVERY - ircecho bot process on irc2002 is OK: PROCS OK: 1 process with command name python2, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [08:57:32] 10ops-eqiad, 06DBA, 06DC-Ops: db1165 network flapping issues - https://phabricator.wikimedia.org/T367854 (10ABran-WMF) 03NEW [08:57:39] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:58:34] 10ops-eqiad, 06DBA, 06DC-Ops: db1165 network flapping issues - https://phabricator.wikimedia.org/T367854#9902429 (10ABran-WMF) 05Open→03In progress [08:59:17] PROBLEM - ircecho bot process on irc1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [09:01:25] FIRING: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:01:49] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:03:13] (03PS1) 10Muehlenhoff: mw-irc: Fix installation of Prometheus Python client package [puppet] - 10https://gerrit.wikimedia.org/r/1047021 (https://phabricator.wikimedia.org/T331702) [09:03:25] (03CR) 10CI reject: [V:04-1] mw-irc: Fix installation of Prometheus Python client package [puppet] - 10https://gerrit.wikimedia.org/r/1047021 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [09:03:38] (03PS4) 10Gehel: cloudelastic: enable IPIP for LVS [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T367511) (owner: 10Bking) [09:04:15] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9902459 (10eoghan) [09:05:55] (03PS2) 10Muehlenhoff: mw-irc: Fix installation of Prometheus Python client package [puppet] - 10https://gerrit.wikimedia.org/r/1047021 (https://phabricator.wikimedia.org/T331702) [09:05:58] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [09:06:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P65142 and previous config saved to /var/cache/conftool/dbconfig/20240618-090603-root.json [09:06:49] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:08:08] (03CR) 10Btullis: [C:03+1] "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/1046699 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol) [09:08:37] PROBLEM - Host acmechief2002 is DOWN: PING CRITICAL - Packet loss = 100% [09:08:43] PROBLEM - Host logstash2023 is DOWN: PING CRITICAL - Packet loss = 100% [09:08:49] PROBLEM - Host ncredir2001 is DOWN: PING CRITICAL - Packet loss = 100% [09:08:49] PROBLEM - Host netboxdb2002 is DOWN: PING CRITICAL - Packet loss = 100% [09:09:42] (03CR) 10JMeybohm: [C:03+1] "Ouch, yeah...sounds plausible. Nice find!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046692 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:10:48] FIRING: [2x] ProbeDown: Service logstash2023:443 has failed probes (http_logstash_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#logstash2023:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:10:49] PROBLEM - ganeti-noded running on ganeti2029 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:10:52] !log dbmaint eqiad s4 deploy schema change T367261 [09:10:53] (03PS3) 10Brouberol: dse-k8s: setup a discovery record for all deployed applications [dns] - 10https://gerrit.wikimedia.org/r/1046699 (https://phabricator.wikimedia.org/T367768) [09:10:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:56] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [09:12:14] (03CR) 10Brouberol: [C:03+2] dse-k8s: setup a discovery record for all deployed applications [dns] - 10https://gerrit.wikimedia.org/r/1046699 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol) [09:12:21] (03CR) 10JMeybohm: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046693 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:13:37] !log rebooting ganeti2029 [09:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:57] PROBLEM - Host ganeti2029 is DOWN: PING CRITICAL - Packet loss = 100% [09:15:01] (03CR) 10Vgutierrez: cloudelastic: enable IPIP for LVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T367511) (owner: 10Bking) [09:18:23] (03CR) 10Arnaudb: [C:03+1] cephadm: install lvm2 on all target nodes, not just osds [puppet] - 10https://gerrit.wikimedia.org/r/1043809 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:18:36] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1047022 (https://phabricator.wikimedia.org/T367857) [09:18:40] (03PS1) 10Gerrit maintenance bot: wmnet: Update s7-master alias [dns] - 10https://gerrit.wikimedia.org/r/1047023 (https://phabricator.wikimedia.org/T367857) [09:18:43] RECOVERY - Host ganeti2029 is UP: PING OK - Packet loss = 0%, RTA = 30.22 ms [09:18:46] FIRING: [3x] ProbeDown: Service ganeti2029:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:18:49] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:18:51] RECOVERY - ganeti-noded running on ganeti2029 is OK: PROCS OK: 1 process with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [09:19:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: firmware upgrade for mw1359.eqiad.wmnet, mw1364.eqiad.wmnet, mw1365.eqiad.wmnet, mw1412.eqiad.wmnet - https://phabricator.wikimedia.org/T367766#9902528 (10Clement_Goubert) Yes, idrac should be enough, thank you. [09:20:29] RECOVERY - Host logstash2023 is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [09:20:37] RECOVERY - Host acmechief2002 is UP: PING OK - Packet loss = 0%, RTA = 30.51 ms [09:20:37] RECOVERY - Host netboxdb2002 is UP: PING OK - Packet loss = 0%, RTA = 30.63 ms [09:20:39] RECOVERY - Host ncredir2001 is UP: PING OK - Packet loss = 0%, RTA = 30.58 ms [09:20:48] FIRING: [3x] ProbeDown: Service ganeti2029:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:21:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P65143 and previous config saved to /var/cache/conftool/dbconfig/20240618-092108-root.json [09:23:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:23:46] RESOLVED: [3x] ProbeDown: Service ganeti2029:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:23:46] FIRING: [3x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:51] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 31 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:26:01] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9902539 (10MatthewVernon) Have the swift containers been generated for these wikis? I can't find any obviously-matching ones. [09:27:03] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 69, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:27:35] (03CR) 10Klausman: [C:03+1] "Yes, we would like to keep the alert, and for now, the threshold/duration should be good. We will see if we need to tune it, and then make" [alerts] - 10https://gerrit.wikimedia.org/r/1046781 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [09:27:47] !log arm keyholder on acmechief2002 [09:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:16] (03CR) 10Muehlenhoff: [C:03+2] mw-irc: Fix installation of Prometheus Python client package [puppet] - 10https://gerrit.wikimedia.org/r/1047021 (https://phabricator.wikimedia.org/T331702) (owner: 10Muehlenhoff) [09:29:32] (03CR) 10MVernon: [C:03+2] cephadm: install lvm2 on all target nodes, not just osds [puppet] - 10https://gerrit.wikimedia.org/r/1043809 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [09:31:53] jnuche: ping me please after the train is done [09:33:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on acmechief2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:36:01] 06SRE, 10SRE-swift-storage, 06Privacy Engineering: Check the permissions on the swift containers for the new private wikis - https://phabricator.wikimedia.org/T367839#9902607 (10MatthewVernon) ...further to @Ladsgroup's comment elsewhere, if the intention is that these wikis all have local upload disabled, t... [09:36:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P65144 and previous config saved to /var/cache/conftool/dbconfig/20240618-093614-root.json [09:40:05] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:41:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2002.wikimedia.org [09:41:30] 06SRE, 06Infrastructure-Foundations: DRBD kernel error on ganeti2031 led to kernel hang - https://phabricator.wikimedia.org/T348730#9902622 (10MoritzMuehlenhoff) Happened once more on ganeti2029 today. We're gradually moving nodes to Bookworm (the routed cluster and magru cluster are already running it and the... [09:45:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2002.wikimedia.org [09:48:24] (03CR) 10Kamila Součková: service: add basic config for shellbox-video (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [09:48:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1002.wikimedia.org [09:50:06] effie: train is done [09:50:16] cheers! [09:50:20] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: management and main interface down for mw2321.codfw.wmnet - https://phabricator.wikimedia.org/T367702#9902628 (10Clement_Goubert) Yes, it is only for the `docker_pull_k8s` step, for which failures are not critical unless a lot of hosts fail it... [09:51:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P65145 and previous config saved to /var/cache/conftool/dbconfig/20240618-095119-root.json [09:51:23] RECOVERY - ircecho bot process on irc1002 is OK: PROCS OK: 1 process with command name python2, regex args /usr/local/bin/udpmxircecho.py https://wikitech.wikimedia.org/wiki/Ircecho [09:52:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1002.wikimedia.org [09:52:52] (03CR) 10Btullis: Initial import of ceph-csi-rbd chart for inspection (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [09:53:21] !log cgoubert@cumin1002 conftool action : set/weight=10:pooled=yes; selector: name=(wikikube-worker1019.eqiad.wmnet|wikikube-worker1020.eqiad.wmnet|wikikube-worker1021.eqiad.wmnet),cluster=kubernetes,service=kubesvc [09:55:51] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:57:39] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:59:57] (03PS16) 10Btullis: Add a values file for the ceph-csi plugin on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031589 (https://phabricator.wikimedia.org/T327259) [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1000) [10:00:17] (03PS11) 10Btullis: Deploy the ceph-csi-rbd chart to dse-k8s with default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028938 (https://phabricator.wikimedia.org/T364472) [10:00:26] (03PS7) 10Btullis: Add WMF customisations to the upstream ceph-csi-rbd chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) [10:00:51] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 26 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:00:57] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1046613 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede) [10:01:25] RESOLVED: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:01] (03PS17) 10Btullis: Add a values file for the ceph-csi plugin on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031589 (https://phabricator.wikimedia.org/T327259) [10:03:16] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki-image-download: Drop to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1039619 (https://phabricator.wikimedia.org/T366778) (owner: 10Alexandros Kosiaris) [10:04:23] (03CR) 10Vgutierrez: [C:03+1] "looks good, changes should be applied first on the realservers before restarting pybal on lvs1020 and lvs1018" [puppet] - 10https://gerrit.wikimedia.org/r/1043302 (https://phabricator.wikimedia.org/T367511) (owner: 10Bking) [10:04:44] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp3066.esams.wmnet [10:05:19] !log cp3066 currently depooled and puppet disabled for T367756 [10:05:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:23] T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [10:05:40] (03CR) 10JMeybohm: [C:03+1] mw-on-k8s: Deploy statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [10:05:47] eoghan: shall we start? [10:06:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P65146 and previous config saved to /var/cache/conftool/dbconfig/20240618-100624-root.json [10:08:06] Amir1: Yep! Just getting myself set up here. I suggest #wikimedia-sre-collab to keep the noise out of here, that ok with you? [10:08:17] sure [10:08:55] Heads up, we're going to start the mailman migration to new hardware now, details can be found here: https://phabricator.wikimedia.org/T367521 [10:08:57] hnowlan: note to oncall: We (I mean mostly eoghan, I'm just for emotional support) are migrating mailman to new hw and sw [10:09:11] downtime of two hours [10:09:12] thanks for letting me know! [10:09:27] (03CR) 10JMeybohm: Add WMF customisations to the upstream ceph-csi-rbd chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [10:09:49] (03CR) 10JMeybohm: [V:03+2 C:03+2] Allow multiple update files in one go [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/643912 (owner: 10JMeybohm) [10:10:27] (03CR) 10EoghanGaffney: [C:03+2] lists: Block incoming email on lists hosts during mailman migration [puppet] - 10https://gerrit.wikimedia.org/r/1043799 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [10:14:18] !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on lists[1001,1004,2001].wikimedia.org with reason: Mailman migration [10:14:34] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lists[1001,1004,2001].wikimedia.org with reason: Mailman migration [10:14:49] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9902692 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=f70cad25-fba3-40c1-a3c3-abe8534eca40) set by eogha... [10:14:57] (03PS2) 10Hnowlan: service: add basic config for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) [10:16:08] 06SRE, 07SRE-Unowned, 10Wikimedia-IRC-RC-Server: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702#9902698 (10MoritzMuehlenhoff) Bullseye-based servers are up and running, one can connect to irc1002.wikimedia.org and irc2002.wikimedia.org the same way as for irc1001/irc2001.... [10:17:41] (03PS1) 10Fabfur: hiera: enable benthos on cp3066 [puppet] - 10https://gerrit.wikimedia.org/r/1047029 (https://phabricator.wikimedia.org/T367756) [10:18:32] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9902700 (10eoghan) [10:19:05] (03PS1) 10Effie Mouzeli: mw-mcrouter: add ClusterIP for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047030 (https://phabricator.wikimedia.org/T346690) [10:21:06] (03CR) 10Hnowlan: service: add basic config for shellbox-video (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [10:21:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1160 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P65147 and previous config saved to /var/cache/conftool/dbconfig/20240618-102130-root.json [10:21:48] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047029 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [10:22:06] (03CR) 10Clément Goubert: [C:03+2] mw-on-k8s: Deploy statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [10:22:14] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, let's give this a shot. I'll upload the secondary openjdk-21 buildin a few, then we can attempt a build." [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [10:22:42] (03PS1) 10Effie Mouzeli: mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690) [10:22:55] (03CR) 10Slyngshede: [C:03+2] SSH Key mgmt: Ensure that keys are trimmed [software/bitu] - 10https://gerrit.wikimedia.org/r/1046613 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede) [10:23:03] (03Merged) 10jenkins-bot: mw-on-k8s: Deploy statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1043704 (https://phabricator.wikimedia.org/T365265) (owner: 10Clément Goubert) [10:23:17] jouncebot: nowandnext [10:23:17] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1000) [10:23:17] In 1 hour(s) and 36 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1200) [10:23:53] (03PS1) 10MVernon: Move moss-fe{1,2}001 back to apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047033 (https://phabricator.wikimedia.org/T279621) [10:24:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T364069)', diff saved to https://phabricator.wikimedia.org/P65148 and previous config saved to /var/cache/conftool/dbconfig/20240618-102418-marostegui.json [10:24:21] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 48.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:24:23] (03CR) 10Cathal Mooney: Include vlans with an IRB int in device vlans even if not on L2 port (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [10:24:23] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:24:27] (03Merged) 10jenkins-bot: SSH Key mgmt: Ensure that keys are trimmed [software/bitu] - 10https://gerrit.wikimedia.org/r/1046613 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede) [10:24:40] (03PS1) 10Ladsgroup: prometheus: Change footer icon ping url [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190) [10:27:11] (03CR) 10Ladsgroup: [C:04-1] "That's actually not the right url and will be removed too. I need to wait a week before pushing the correct url." [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190) (owner: 10Ladsgroup) [10:27:29] !log cgoubert@deploy1002 Started scap: Deploy statsd exporter - T365265 [10:27:34] T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265 [10:29:18] (03CR) 10EoghanGaffney: "I don't believe this is the case, I think that it only acts on those IPs if they're set -- for example puppet runs correctly on lists1004/" [puppet] - 10https://gerrit.wikimedia.org/r/1036610 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [10:29:26] (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1036610 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [10:30:39] !log cgoubert@deploy1002 Finished scap: Deploy statsd exporter - T365265 (duration: 03m 39s) [10:30:41] !log upload openjdk-21 21.0.3+9-2~deb12u2 for bookworm/wikimedia (secondary rebuild on build2001 following the initial bootstrap build) https://phabricator.wikimedia.org/T367487 [10:30:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:20] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:31:21] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:31:50] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:32:04] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:32:10] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:32:12] (03CR) 10Fabfur: [C:04-2] "Do not merge until haproxy is upgraded to 2.8.10 on the impacted hosts and benthos configuration is using rfc5424 syslog format" [puppet] - 10https://gerrit.wikimedia.org/r/1047029 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [10:32:19] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:32:22] (03CR) 10EoghanGaffney: [C:03+2] lists: Migrate mailman primary host from lists1001 -> lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1036610 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [10:32:26] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:32:28] (03PS1) 10Ladsgroup: mariadb: Update code doc for replication grants [puppet] - 10https://gerrit.wikimedia.org/r/1047036 [10:32:35] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:32:38] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:32:47] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:32:55] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:33:06] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:33:11] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:33:20] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:33:31] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-jobrunner: apply [10:33:41] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-jobrunner: apply [10:33:45] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-jobrunner: apply [10:33:54] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-jobrunner: apply [10:35:21] (03PS2) 10Cathal Mooney: Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) [10:37:07] (03CR) 10CI reject: [V:04-1] Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [10:38:35] (03PS2) 10Hnowlan: DNM: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) [10:39:14] (03CR) 10CI reject: [V:04-1] DNM: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [10:39:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P65149 and previous config saved to /var/cache/conftool/dbconfig/20240618-103925-marostegui.json [10:39:28] (03PS2) 10Ladsgroup: mariadb: Update code doc for replication grants [puppet] - 10https://gerrit.wikimedia.org/r/1047036 [10:39:33] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Update code doc for replication grants [puppet] - 10https://gerrit.wikimedia.org/r/1047036 (owner: 10Ladsgroup) [10:42:36] 06SRE, 06serviceops, 10Data Products (Data Products Sprint 15), 13Patch-For-Review, 07Service-deployment-requests: Commons Impact Metrics AQS 2.0 Deployment to Staging and Production - https://phabricator.wikimedia.org/T361835#9902761 (10SGupta-WMF) @Scott_French I am waiting for final go ahead from QA .... [10:43:05] (03PS3) 10Hnowlan: DNM: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) [10:45:07] (03CR) 10CI reject: [V:04-1] DNM: Add shellbox-video vars/config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043812 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [10:47:14] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9902781 (10eoghan) [10:48:10] (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046785 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [10:48:47] !log dbmaint codfw s2 deploy schema change T364069 [10:48:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:52] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [10:49:17] (03PS1) 10Fabfur: hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) [10:49:24] (03PS1) 10Brouberol: ATS: replace service by discovery record for datahub-next [puppet] - 10https://gerrit.wikimedia.org/r/1047040 (https://phabricator.wikimedia.org/T367768) [10:49:26] (03PS1) 10Brouberol: ATS: replace service by discovery record for all DSE services [puppet] - 10https://gerrit.wikimedia.org/r/1047041 (https://phabricator.wikimedia.org/T367768) [10:49:38] (03CR) 10CI reject: [V:04-1] hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [10:51:02] (03PS2) 10Fabfur: hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) [10:51:22] (03CR) 10CI reject: [V:04-1] hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [10:51:26] (03PS1) 10Effie Mouzeli: mw-debug: point mediawiki to mw-mcrouter's clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047043 (https://phabricator.wikimedia.org/T346690) [10:51:28] (03Abandoned) 10Hnowlan: conftool: Remove thumbor [puppet] - 10https://gerrit.wikimedia.org/r/1005728 (owner: 10Alexandros Kosiaris) [10:52:32] (03CR) 10Hnowlan: shellbox-video: initial helmfile configuration (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [10:54:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P65150 and previous config saved to /var/cache/conftool/dbconfig/20240618-105432-marostegui.json [10:56:01] (03CR) 10Kamila Součková: [C:03+1] "LGTM, except I have zero clue about the LVS part" [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [10:56:54] (03PS1) 10Muehlenhoff: idp::build: Install Java 21 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487) [10:57:56] (03PS2) 10Effie Mouzeli: mw-mcrouter: add ClusterIP for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047030 (https://phabricator.wikimedia.org/T346690) [10:58:00] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp3066.esams.wmnet [10:58:14] !log cp3066 repooled and puppet enabled (T367756) [10:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:19] T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [10:58:50] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [10:59:15] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-mcrouter: add ClusterIP for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047030 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:00:53] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 41 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:01:18] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-presto1001.eqiad.wmnet [11:01:59] (03PS3) 10Fabfur: hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) [11:02:04] (03PS2) 10Alexandros Kosiaris: mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:02:47] (03CR) 10Effie Mouzeli: [C:03+2] mw-mcrouter: add ClusterIP for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047030 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:03:03] (03CR) 10Kamila Součková: [C:03+1] "thanks for cleaning up my TODOs, greatly appreciated :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [11:03:33] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [11:03:39] (03Merged) 10jenkins-bot: mw-mcrouter: add ClusterIP for eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047030 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:04:12] !next [11:05:04] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1208.eqiad.wmnet with reason: Upgrading to bookworm [11:05:12] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-presto1001.eqiad.wmnet [11:05:17] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1208.eqiad.wmnet with reason: Upgrading to bookworm [11:05:23] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:05:41] we 'll be change the kubernetes service IPs for mcrouter in eqiad and codfw [11:05:45] changing* [11:05:57] (03PS2) 10Muehlenhoff: idp::build: Install Java 21 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487) [11:07:51] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [11:07:57] (03PS3) 10Cathal Mooney: Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) [11:08:18] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [11:08:23] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [11:08:27] (03CR) 10Alexandros Kosiaris: [C:03+1] mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:08:30] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-ui1001.eqiad.wmnet [11:09:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [11:09:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T364069)', diff saved to https://phabricator.wikimedia.org/P65151 and previous config saved to /var/cache/conftool/dbconfig/20240618-110939-marostegui.json [11:09:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [11:09:44] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:09:49] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host db1208.eqiad.wmnet with OS bookworm [11:09:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [11:10:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T364069)', diff saved to https://phabricator.wikimedia.org/P65152 and previous config saved to /var/cache/conftool/dbconfig/20240618-111001-marostegui.json [11:12:13] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-ui1001.eqiad.wmnet [11:13:08] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-mcrouter: apply [11:13:14] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-mcrouter: apply [11:13:25] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-druid1001.eqiad.wmnet [11:13:27] (03CR) 10Hnowlan: [C:03+2] kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [11:13:59] (03PS4) 10Clément Goubert: wikikube: Use conftool for scap docker_pull_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047031 (https://phabricator.wikimedia.org/T367862) [11:14:06] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [11:14:16] (03Merged) 10jenkins-bot: kask: add mesh configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [11:14:28] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [11:15:49] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [11:15:56] (03CR) 10Jcrespo: [C:03+1] mariadb: Update code doc for replication grants [puppet] - 10https://gerrit.wikimedia.org/r/1047036 (owner: 10Ladsgroup) [11:16:08] (03CR) 10Slyngshede: [C:03+2] Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:16:11] (03CR) 10Slyngshede: [V:03+2 C:03+2] Upgrade to CAS 7.0.4.1 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047013 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:16:11] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9902871 (10Clement_Goubert) [11:16:21] (03CR) 10Muehlenhoff: [C:03+2] idp::build: Install Java 21 on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1047044 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [11:16:37] (03PS7) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 [11:16:50] (03CR) 10Jcrespo: mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo) [11:18:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-druid1001.eqiad.wmnet [11:20:55] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:22:23] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-test-master1002.eqiad.wmnet [11:22:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [11:23:07] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 100% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047046 (https://phabricator.wikimedia.org/T362323) [11:24:17] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1208.eqiad.wmnet with reason: host reimage [11:25:28] (03PS1) 10Clément Goubert: trafficserver: move 95% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323) [11:26:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1208.eqiad.wmnet with reason: host reimage [11:27:55] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 43 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:28:57] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-master1002.eqiad.wmnet [11:29:35] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [11:29:43] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [11:29:45] (03PS1) 10EoghanGaffney: lists: Remove service IPs from lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521) [11:29:58] (03CR) 10EoghanGaffney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney) [11:31:57] (03CR) 10Jelto: "interface::alias will probably fail if we are aliasing the same address multiple times?" [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney) [11:32:59] (03CR) 10EoghanGaffney: "As with the comment in line above, it's a no-op when the variables are unset" [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney) [11:33:01] (03CR) 10Jelto: [C:03+1] "this looks good with the default in $list_outbound_ips (I missed them)" [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney) [11:33:05] (03PS1) 10Effie Mouzeli: mediawiki: switch to using the mw-mcrouter daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047050 (https://phabricator.wikimedia.org/T346690) [11:33:55] (03PS1) 10Muehlenhoff: idp::build: Make the rsync setup depend on the OS [puppet] - 10https://gerrit.wikimedia.org/r/1047051 (https://phabricator.wikimedia.org/T367487) [11:34:07] (03CR) 10CI reject: [V:04-1] idp::build: Make the rsync setup depend on the OS [puppet] - 10https://gerrit.wikimedia.org/r/1047051 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [11:34:10] (03CR) 10EoghanGaffney: [C:03+2] lists: Remove service IPs from lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1047049 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney) [11:34:29] (03PS2) 10Muehlenhoff: idp::build: Make the rsync setup depend on the OS [puppet] - 10https://gerrit.wikimedia.org/r/1047051 (https://phabricator.wikimedia.org/T367487) [11:34:45] (03PS1) 10Hnowlan: kask: don't allocate service port twice when using mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047052 (https://phabricator.wikimedia.org/T363996) [11:35:01] (03Abandoned) 10Effie Mouzeli: mw-debug: point mediawiki to mw-mcrouter's clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047043 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:35:10] eqiad mw-mcrouter has been recreated with the new hardcoded service IP btw, that above is to use it ^ [11:35:40] (03PS2) 10Effie Mouzeli: mediawiki: switch eqiad to use the mw-mcrouter daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047050 (https://phabricator.wikimedia.org/T346690) [11:36:22] (03CR) 10Hnowlan: [C:03+1] mw-web, mw-api-ext: Raise replicas for 100% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047046 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [11:36:36] (03PS1) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [11:37:22] (03CR) 10Hnowlan: [C:04-1] trafficserver: move 95% of traffic to mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [11:37:23] (03CR) 10CI reject: [V:04-1] P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:37:23] (03PS2) 10Clément Goubert: trafficserver: move 100% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323) [11:37:23] (03CR) 10Clément Goubert: trafficserver: move 100% of traffic to mw-on-k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [11:37:53] (03CR) 10Clément Goubert: [C:03+1] kask: don't allocate service port twice when using mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047052 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [11:39:54] (03CR) 10Giuseppe Lavagetto: [C:03+1] kask: don't allocate service port twice when using mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047052 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [11:39:57] jouncebot: now [11:39:57] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [11:40:01] jouncebot: next [11:40:01] In 0 hour(s) and 19 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1200) [11:40:04] (03PS1) 10EoghanGaffney: lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521) [11:40:10] !log Rename ipblocks table on db1169 (enwiki) T367632 [11:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:14] T367632: Drop ipblocks in production - https://phabricator.wikimedia.org/T367632 [11:40:35] (03CR) 10Hnowlan: [C:03+1] "🚀🚀🚀" [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [11:41:00] (03CR) 10CI reject: [V:04-1] lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney) [11:41:12] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1047051 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [11:41:12] (03PS1) 10Jcrespo: dbbackups: Pause s3/db1240 snapshots until load completes [puppet] - 10https://gerrit.wikimedia.org/r/1047055 (https://phabricator.wikimedia.org/T367162) [11:41:16] (03PS1) 10Btullis: Update the contactgroups for all wdqs and wcqs servers [puppet] - 10https://gerrit.wikimedia.org/r/1047056 (https://phabricator.wikimedia.org/T365881) [11:41:30] (03PS3) 10Effie Mouzeli: mediawiki: switch eqiad to use the mw-mcrouter daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047050 (https://phabricator.wikimedia.org/T346690) [11:41:50] (03PS2) 10EoghanGaffney: lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521) [11:41:56] (03PS2) 10Jcrespo: dbbackups: Pause s3/db1240 snapshots until load completes [puppet] - 10https://gerrit.wikimedia.org/r/1047055 (https://phabricator.wikimedia.org/T367162) [11:42:10] FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [11:42:27] !log Delete ipblocks table on clouddb2002-dev (labtestwiki) T367632 [11:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:43] (03CR) 10CI reject: [V:04-1] lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney) [11:42:53] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:43:10] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2949/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047056 (https://phabricator.wikimedia.org/T365881) (owner: 10Btullis) [11:43:15] (03PS3) 10Jcrespo: dbbackups: Pause s3/db1240 snapshots until load completes [puppet] - 10https://gerrit.wikimedia.org/r/1047055 (https://phabricator.wikimedia.org/T367162) [11:43:57] (03PS2) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [11:44:18] (03CR) 10CI reject: [V:04-1] P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:44:35] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1047055 (https://phabricator.wikimedia.org/T367162) (owner: 10Jcrespo) [11:45:17] (03CR) 10Effie Mouzeli: [C:03+2] mediawiki: switch eqiad to use the mw-mcrouter daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047050 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:45:20] (03PS1) 10Dreamrimmer: Add VL namespace alias to Azerbaijani Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) [11:45:31] (03PS3) 10EoghanGaffney: lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521) [11:46:04] (03Abandoned) 10Alexandros Kosiaris: mw-mcrouter: Switch helmfile.d to use the newer cache module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038860 (owner: 10Alexandros Kosiaris) [11:46:10] (03PS3) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [11:46:32] (03CR) 10CI reject: [V:04-1] P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:46:41] (03PS1) 10Hashar: Point its-phabricator to stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1047058 [11:47:01] (03Merged) 10jenkins-bot: mediawiki: switch eqiad to use the mw-mcrouter daemonset [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047050 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:47:27] (03CR) 10Jcrespo: [C:03+2] dbbackups: Pause s3/db1240 snapshots until load completes [puppet] - 10https://gerrit.wikimedia.org/r/1047055 (https://phabricator.wikimedia.org/T367162) (owner: 10Jcrespo) [11:47:29] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) (owner: 10Dreamrimmer) [11:47:58] (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney) [11:48:02] (03PS1) 10Jcrespo: Revert "dbbackups: Pause s3/db1240 snapshots until load completes" [puppet] - 10https://gerrit.wikimedia.org/r/1047059 [11:48:14] (03CR) 10Jcrespo: [C:04-1] "Not yet." [puppet] - 10https://gerrit.wikimedia.org/r/1047059 (owner: 10Jcrespo) [11:48:31] (03CR) 10EoghanGaffney: [C:03+2] lists: Switch DB firewall rules to use primary host variable [puppet] - 10https://gerrit.wikimedia.org/r/1046785 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [11:48:36] (03PS4) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [11:48:51] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1208.eqiad.wmnet with OS bookworm [11:48:53] (03CR) 10EoghanGaffney: [C:03+2] lists: Update DNS records to use host IP for lists1004 [dns] - 10https://gerrit.wikimedia.org/r/1047054 (https://phabricator.wikimedia.org/T367521) (owner: 10EoghanGaffney) [11:49:15] (03CR) 10Hnowlan: [C:03+2] kask: don't allocate service port twice when using mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047052 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [11:50:12] (03Merged) 10jenkins-bot: kask: don't allocate service port twice when using mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047052 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [11:50:32] !log eoghan@cumin1002 START - Cookbook sre.dns.wipe-cache lists.wikimedia.org on all recursors [11:50:35] !log eoghan@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) lists.wikimedia.org on all recursors [11:51:51] (03PS1) 10Muehlenhoff: Deprecate system::role for DE test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047060 [11:53:32] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:53:40] (03PS4) 10Dreamrimmer: maiwiki: Remove 'CA' namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) [11:53:52] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [11:54:00] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [11:54:12] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:55:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) (owner: 10Dreamrimmer) [11:56:49] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [11:57:16] (03CR) 10Muehlenhoff: [C:03+2] idp::build: Make the rsync setup depend on the OS [puppet] - 10https://gerrit.wikimedia.org/r/1047051 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [11:58:00] !log Slowly pointing mediawiki in eqiad to mw-mcrouter daemonset - T346690 [11:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:05] T346690: mcrouter daemonset on mw-on-k8s - https://phabricator.wikimedia.org/T346690 [11:58:18] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:59:42] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [11:59:50] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [11:59:51] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1200) [12:00:54] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [12:01:11] !incidents [12:01:11] 4757 (ACKED) Host db1165 (paged) - PING - Packet loss = 100% [12:03:53] (03PS1) 10Slyngshede: R:idp_test MPIC went away. [labs/private] - 10https://gerrit.wikimedia.org/r/1047062 [12:04:11] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:04:46] (03CR) 10Slyngshede: "Triggers PCC error, due to the remaining service configuration being missing." [labs/private] - 10https://gerrit.wikimedia.org/r/1047062 (owner: 10Slyngshede) [12:04:48] !log adding Netbox-generated IPv6 DNS records for wikikube-worker, mw and parse hosts [12:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:08] (03PS1) 10Muehlenhoff: cas::build: Fix creation of build directory [puppet] - 10https://gerrit.wikimedia.org/r/1047063 (https://phabricator.wikimedia.org/T367487) [12:05:25] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add IPv6 records for mw, parse and wikikube-worker hosts - cmooney@cumin1002" [12:05:30] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:05:35] (03CR) 10Muehlenhoff: [C:03+1] R:idp_test MPIC went away. [labs/private] - 10https://gerrit.wikimedia.org/r/1047062 (owner: 10Slyngshede) [12:06:05] (03CR) 10Slyngshede: [C:03+2] R:idp_test MPIC went away. [labs/private] - 10https://gerrit.wikimedia.org/r/1047062 (owner: 10Slyngshede) [12:06:08] (03CR) 10Slyngshede: [V:03+2 C:03+2] R:idp_test MPIC went away. [labs/private] - 10https://gerrit.wikimedia.org/r/1047062 (owner: 10Slyngshede) [12:06:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add IPv6 records for mw, parse and wikikube-worker hosts - cmooney@cumin1002" [12:06:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:07:09] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:08:33] (03CR) 10Muehlenhoff: [C:03+2] cas::build: Fix creation of build directory [puppet] - 10https://gerrit.wikimedia.org/r/1047063 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [12:14:19] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e2-eqiad - https://phabricator.wikimedia.org/T365994#9903045 (10ABran-WMF) [12:14:33] PROBLEM - mailman3_runners on lists1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:14:37] PROBLEM - mailman3 on lists1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman3/bin/master https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:14:43] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:14:43] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:14:50] Mailman errors are me, silencing again for a bit. [12:15:02] !log eoghan@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on lists[1001,1004,2001].wikimedia.org with reason: Mailman migration [12:15:17] !log eoghan@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lists[1001,1004,2001].wikimedia.org with reason: Mailman migration [12:15:25] FIRING: SystemdUnitFailed: ferm.service on kubernetes2056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:15:30] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9903046 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=33783771-f385-4d8a-9005-972d... [12:16:27] (03PS5) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [12:18:09] (03CR) 10Slyngshede: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:18:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 1.007 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:18:35] RECOVERY - mailman3 on lists1004 is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman3/bin/master https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:19:51] (03CR) 10Vgutierrez: [C:04-1] "to be consistent with the general hiera structure please use ulsfo/profile/cache/haproxy.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [12:20:25] FIRING: [2x] SystemdUnitFailed: ferm.service on kubernetes2056:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:20:27] (03CR) 10Muehlenhoff: P:idp Allow upgrade to Tomcat 10. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:21:27] (03CR) 10Brouberol: [C:03+1] Deprecate system::role for DE test cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047060 (owner: 10Muehlenhoff) [12:22:26] !log rebalance ganeti eqiad/D following reboots [12:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:32] (03PS1) 10Slyngshede: P:idp::build Add fakeroot build dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1047064 (https://phabricator.wikimedia.org/T367487) [12:23:54] (03CR) 10CI reject: [V:04-1] P:idp::build Add fakeroot build dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1047064 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:24:59] (03CR) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:25:25] FIRING: [3x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:58] (03PS1) 10Effie Mouzeli: Revert "mediawiki: switch eqiad to use the mw-mcrouter daemonset" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047065 [12:29:06] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2003.codfw.wmnet [12:30:25] FIRING: [4x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:31:35] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52197 bytes in 0.240 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:31:45] (03CR) 10Effie Mouzeli: [C:03+2] Revert "mediawiki: switch eqiad to use the mw-mcrouter daemonset" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047065 (owner: 10Effie Mouzeli) [12:32:39] (03PS1) 10Slyngshede: Update Debian package dependencies for CAS 7.X [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) [12:33:04] (03Abandoned) 10Slyngshede: P:idp::build Add fakeroot build dependency. [puppet] - 10https://gerrit.wikimedia.org/r/1047064 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:33:25] (03Merged) 10jenkins-bot: Revert "mediawiki: switch eqiad to use the mw-mcrouter daemonset" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047065 (owner: 10Effie Mouzeli) [12:33:41] (03PS4) 10Fabfur: hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) [12:33:50] (03CR) 10Muehlenhoff: [C:03+1] P:idp Allow upgrade to Tomcat 10. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:34:27] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [12:34:42] (03CR) 10Fabfur: "ack tnx" [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [12:35:14] (03PS6) 10Slyngshede: P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) [12:35:16] (03CR) 10Stevemunene: [C:03+1] ATS: replace service by discovery record for datahub-next [puppet] - 10https://gerrit.wikimedia.org/r/1047040 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol) [12:35:25] FIRING: [6x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:35:37] (03CR) 10CI reject: [V:04-1] P:idp Allow upgrade to Tomcat 10. [puppet] - 10https://gerrit.wikimedia.org/r/1047053 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:35:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2003.codfw.wmnet [12:36:42] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [12:37:25] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 43.04 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:40:11] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:40:25] FIRING: [7x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:42:28] (03CR) 10Vgutierrez: [C:03+1] hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [12:42:29] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9903125 (10eoghan) [12:42:32] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [12:42:35] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:42:52] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9903129 (10eoghan) [12:42:56] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [12:42:58] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [12:43:21] (03CR) 10Fabfur: [C:03+2] hiera: upgrade haproxy to 2.8 on ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1047039 (https://phabricator.wikimedia.org/T367756) (owner: 10Fabfur) [12:44:05] (03CR) 10EoghanGaffney: [C:03+2] lists: Allow mail to be received on lists1004 [puppet] - 10https://gerrit.wikimedia.org/r/1046786 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [12:45:25] FIRING: [13x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:46:37] (03CR) 10Jforrester: "We could change to the powered-by-Wikimedia one that won't change?" [puppet] - 10https://gerrit.wikimedia.org/r/1047034 (https://phabricator.wikimedia.org/T256190) (owner: 10Ladsgroup) [12:47:05] (03CR) 10Elukey: Prepare for netbox-dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [12:47:05] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:47:23] (03PS1) 10Vgutierrez: hiera: Set fifo-log-demux prometheus port for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047070 (https://phabricator.wikimedia.org/T364383) [12:47:24] !log upgrade haproxy to v2.8.10 on all ulsfo cp hosts (T367756) [12:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:28] T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [12:48:06] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:49:00] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047070 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [12:49:18] (03PS2) 10Slyngshede: Update Debian package dependencies for CAS 7.X [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) [12:49:26] (03PS1) 10Marostegui: Revert^2 "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047071 [12:49:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1160', diff saved to https://phabricator.wikimedia.org/P65155 and previous config saved to /var/cache/conftool/dbconfig/20240618-124945-root.json [12:50:25] FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:50:28] (03CR) 10Marostegui: [C:03+2] Revert^2 "db1160: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1047071 (owner: 10Marostegui) [12:51:07] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo [12:51:51] !log Deploy schema change on old s4 eqiad master db1160 dbmaint T364069 [12:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:55] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [12:52:06] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo [12:52:57] (03CR) 10Muehlenhoff: Update Debian package dependencies for CAS 7.X (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:53:31] !log disable puppet on A:cp-eqsin before merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047070 - T364383 [12:53:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:36] T364383: Update fifo_log_demux puppet module to support new parameters - https://phabricator.wikimedia.org/T364383 [12:53:55] (03CR) 10Muehlenhoff: Update Debian package dependencies for CAS 7.X (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:54:46] (03PS1) 10Ssingh: install_server: update NTP server anycast address for d-i [puppet] - 10https://gerrit.wikimedia.org/r/1047073 (https://phabricator.wikimedia.org/T366360) [12:55:17] (03CR) 10Vgutierrez: [C:03+2] hiera: Set fifo-log-demux prometheus port for eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1047070 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [12:55:25] FIRING: [15x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:55:56] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2950/console" [puppet] - 10https://gerrit.wikimedia.org/r/1047073 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [12:56:06] (03PS1) 10Ssingh: wikimedia.org: switch ntp.$site to ntp-a.anycast.wmnet [dns] - 10https://gerrit.wikimedia.org/r/1047074 (https://phabricator.wikimedia.org/T366360) [12:56:48] !log rolling upgrade on A:cp-eqsin to fifo-log-demux 0.7.5 - T364383 [12:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:15] (03PS1) 10Alexandros Kosiaris: mcrouter: Temporarily disable in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047075 (https://phabricator.wikimedia.org/T346690) [12:58:53] PROBLEM - Check whether ferm is active by checking the default input chain on wikikube-ctrl1003 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:59:15] (03PS3) 10Ssingh: config/common: update list of ntp_servers to use anycast NTP servers [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360) [12:59:21] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [12:59:26] (03CR) 10Ssingh: config/common: update list of ntp_servers to use anycast NTP servers (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1300). [13:00:05] DreamRimmer: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] (03CR) 10Slyngshede: [C:03+2] Update Debian package dependencies for CAS 7.X [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [13:00:18] (03CR) 10Slyngshede: [C:03+2] Update Debian package dependencies for CAS 7.X (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [13:00:22] (03CR) 10Slyngshede: [V:03+2 C:03+2] Update Debian package dependencies for CAS 7.X [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1047066 (https://phabricator.wikimedia.org/T367487) (owner: 10Slyngshede) [13:00:25] FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:01:03] (03CR) 10Alexandros Kosiaris: [C:03+2] mcrouter: Temporarily disable in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047075 (https://phabricator.wikimedia.org/T346690) (owner: 10Alexandros Kosiaris) [13:01:05] (03CR) 10Ssingh: "Thanks for doing that! CR updated for the comment below. This will be merged later but I wanted to get the reviews in first." [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [13:01:35] o/ [13:01:50] (03CR) 10Arnaudb: mariadb: bugfixes mysql_legacy (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [13:02:21] I am around [13:02:27] (03Merged) 10jenkins-bot: mcrouter: Temporarily disable in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047075 (https://phabricator.wikimedia.org/T346690) (owner: 10Alexandros Kosiaris) [13:02:48] (03PS6) 10Ayounsi: Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) [13:02:50] any other deployers around? I have a meeting in 30 minutes, so I’m not sure I’ll be able to deploy both changes [13:04:00] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-coord1004.eqiad.wmnet [13:04:55] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: sync [13:05:03] well, let’s start with the azwiktionary namespace alias then [13:05:22] Lucas_WMDE: gimme 30 seconds, I am disabling the mcrouter stuff in codfw [13:05:25] FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1019:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:05:28] akosiaris: ack [13:05:34] the patch needs an update anyway, I just noticed [13:06:05] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] Add VL namespace alias to Azerbaijani Wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) (owner: 10Dreamrimmer) [13:06:12] DreamRimmer: ^ [13:06:20] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: sync [13:06:21] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: sync [13:06:37] RECOVERY - Host elastic2099 is UP: PING WARNING - Packet loss = 80%, RTA = 30.36 ms [13:07:01] yeah [13:07:11] (#RandomWikiLove for diffConfig, amazing feature to have :)) [13:07:35] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: sync [13:07:36] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: sync [13:07:47] (03PS1) 10Vgutierrez: hiera,openldap::replica: Enable IPIP on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1047076 (https://phabricator.wikimedia.org/T367861) [13:07:57] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: sync [13:07:58] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: sync [13:08:28] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047076 (https://phabricator.wikimedia.org/T367861) (owner: 10Vgutierrez) [13:08:46] doing mw-web now, should be done pretty soon [13:09:14] (03PS1) 10Jforrester: Use isEnumType in selector and isCustomEnum for creating literals [extensions/WikiLambda] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047077 (https://phabricator.wikimedia.org/T367159) [13:09:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:09:24] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: sync [13:09:25] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: sync [13:10:24] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-coord1004.eqiad.wmnet [13:10:25] FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:10:45] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: sync [13:11:15] (03CR) 10Ssingh: dnsbox: announce ntp-[abc].anycast.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [13:12:22] Lucas_WMDE: I am done [13:12:30] thanks for your patience [13:12:32] hnowlan: We're finished with mailman now, FYI [13:12:38] np, I’m still waiting for the new patch set anyway :) [13:13:01] PROBLEM - Host elastic2099 is DOWN: PING CRITICAL - Packet loss = 100% [13:13:58] (03PS3) 10Alexandros Kosiaris: mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [13:14:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:14:29] (03PS2) 10Dreamrimmer: Add VL namespace alias to Azerbaijani Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) [13:14:52] (03CR) 10Vgutierrez: "we need to depool ldap-ro & ldap-ro-ssl on codfw before proceeding with this CR" [puppet] - 10https://gerrit.wikimedia.org/r/1047076 (https://phabricator.wikimedia.org/T367861) (owner: 10Vgutierrez) [13:15:15] (03CR) 10Dreamrimmer: Add VL namespace alias to Azerbaijani Wiktionary (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) (owner: 10Dreamrimmer) [13:15:25] FIRING: [18x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:42] Lucas_WMDE: done [13:15:54] eoghan: ack, thanks! [13:15:57] (03CR) 10Ayounsi: [C:03+1] "Cool, thx for the explanation" [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [13:16:16] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-mariadb1002.eqiad.wmnet [13:16:17] looking [13:16:20] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [13:16:22] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [13:16:36] (03CR) 10Ayounsi: [C:03+1] "lgtm !" [homer/public] - 10https://gerrit.wikimedia.org/r/1046737 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [13:16:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) (owner: 10Dreamrimmer) [13:16:50] let’s see if it finishes within 14 minutes… [13:17:08] (and I need to remember to also run that maintenance script afterwards) [13:17:14] (namespaceDupes) [13:17:18] (03Merged) 10jenkins-bot: Add VL namespace alias to Azerbaijani Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047057 (https://phabricator.wikimedia.org/T367264) (owner: 10Dreamrimmer) [13:17:33] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T367648#9903242 (10phaultfinder) [13:17:35] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [13:17:49] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1047057|Add VL namespace alias to Azerbaijani Wiktionary (T367264)]] [13:17:54] T367264: Add "VL" namespace alias to Azerbaijani Wiktionary - https://phabricator.wikimedia.org/T367264 [13:18:34] (03Merged) 10jenkins-bot: mw-mcrouter: add ClusterIP for codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047032 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [13:19:52] !log btullis@cumin1002 START - Cookbook sre.hosts.remove-downtime for db1208.eqiad.wmnet [13:19:53] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for db1208.eqiad.wmnet [13:20:25] FIRING: [17x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:09] (03PS4) 10Cathal Mooney: Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) [13:21:25] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 308.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:22:22] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, dreamrimmer: Backport for [[gerrit:1047057|Add VL namespace alias to Azerbaijani Wiktionary (T367264)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:22:41] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1002.eqiad.wmnet [13:22:58] DreamRimmer: can you test? [13:23:01] (looks good to me so far) [13:23:15] (03CR) 10Ayounsi: [C:03+1] Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [13:23:16] doing [13:23:32] working [13:23:40] go for it [13:23:45] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde, dreamrimmer: Continuing with sync [13:23:46] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:47] ok! [13:25:13] (03PS8) 10Btullis: Add WMF customisations to the upstream ceph-csi-rbd chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) [13:25:13] (03PS12) 10Btullis: Deploy the ceph-csi-rbd chart to dse-k8s with default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028938 (https://phabricator.wikimedia.org/T364472) [13:25:13] (03PS18) 10Btullis: Add a values file for the ceph-csi plugin on dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031589 (https://phabricator.wikimedia.org/T327259) [13:25:25] FIRING: [15x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:26:47] (03CR) 10Btullis: Add WMF customisations to the upstream ceph-csi-rbd chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028932 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [13:28:03] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [13:28:10] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [13:28:53] RECOVERY - Check whether ferm is active by checking the default input chain on wikikube-ctrl1003 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:29:30] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-db1002.eqiad.wmnet [13:30:25] FIRING: [9x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:31:29] (03PS1) 10Alexandros Kosiaris: Partially revert "mcrouter: Temporarily disable in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047080 [13:32:01] (03PS2) 10Alexandros Kosiaris: Partially revert "mcrouter: Temporarily disable in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047080 [13:32:25] (03PS3) 10Alexandros Kosiaris: Partially revert "mcrouter: Temporarily disable in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047080 [13:32:28] (03PS1) 10Elukey: WIP: alternative proposal for netbox dev refactor [puppet] - 10https://gerrit.wikimedia.org/r/1047081 [13:33:57] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1047057|Add VL namespace alias to Azerbaijani Wiktionary (T367264)]] (duration: 16m 07s) [13:34:02] T367264: Add "VL" namespace alias to Azerbaijani Wiktionary - https://phabricator.wikimedia.org/T367264 [13:34:12] !log lucaswerkmeister-wmde@mwmaint1002:~$ mwscript namespaceDupes azwiktionary --fix # T367264; 7 pages fixed, 10 links fixed [13:34:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:18] also, why did scap exit nonzero? [13:34:31] ah, mw2321 failed to docker pull. doesn’t matter then [13:34:48] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2951/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (owner: 10Elukey) [13:35:02] (03CR) 10Alexandros Kosiaris: [C:03+2] Partially revert "mcrouter: Temporarily disable in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047080 (owner: 10Alexandros Kosiaris) [13:35:09] * Lucas_WMDE afk [13:35:18] if someone else can deploy the other config change that’d be great… [13:35:25] FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:35:43] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1002.eqiad.wmnet [13:36:22] Lucas_WMDE: Thanks for you valuable time :) [13:36:34] (03Merged) 10jenkins-bot: Partially revert "mcrouter: Temporarily disable in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047080 (owner: 10Alexandros Kosiaris) [13:37:05] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe1002.eqiad.wmnet with OS bookworm [13:37:18] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903331 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe1002.eqiad.wmnet with OS bookworm [13:39:12] (03PS1) 10Muehlenhoff: Remove obsolete profile::java::java_8 [puppet] - 10https://gerrit.wikimedia.org/r/1047083 [13:39:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:39:55] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo [13:40:11] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo [13:40:11] (03CR) 10Arnaudb: [C:03+1] Move moss-fe{1,2}001 back to apus cluster [puppet] - 10https://gerrit.wikimedia.org/r/1047033 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [13:40:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9903343 (10Jhancock.wm) The serial number is just a barcode. There's nothing else on that label. I've looked over the guide and I don't see anything in particular that s... [13:40:25] RESOLVED: [14x] SystemdUnitFailed: ferm.service on kubernetes1025:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:44] (03PS2) 10Btullis: Update the contactgroups for all wdqs and wcqs servers [puppet] - 10https://gerrit.wikimedia.org/r/1047056 (https://phabricator.wikimedia.org/T365881) [13:41:45] (03PS1) 10Btullis: Remove conda repository from reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550) [13:42:23] 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9903360 (10Ladsgroup) This is done I think but then maybe we should drop the grant on lists1001 then? [13:43:40] (03CR) 10Btullis: "Once this is removed, we will still have to cleanup reprepro by hand, as per: https://wikitech.wikimedia.org/wiki/Reprepro#Removing_a_comp" [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550) (owner: 10Btullis) [13:44:58] (03PS1) 10Muehlenhoff: profile::java: Add support for Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1047086 (https://phabricator.wikimedia.org/T367487) [13:45:15] (03PS2) 10Elukey: WIP: alternative proposal for netbox dev refactor [puppet] - 10https://gerrit.wikimedia.org/r/1047081 [13:45:35] (03CR) 10Arnaudb: [C:03+1] mariadb: Update production mysql grants with unix_socket & heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/868392 (owner: 10Jcrespo) [13:45:52] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:46:02] 06SRE, 06collaboration-services, 06DBA: Update grants for mailman - https://phabricator.wikimedia.org/T367833#9903367 (10Marostegui) >>! In T367833#9903360, @Ladsgroup wrote: > This is done I think but then maybe we should drop the grant on lists1001 then? +1 - we should review puppet grants in case we ment... [13:46:07] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550) (owner: 10Btullis) [13:46:17] (03CR) 10Arnaudb: [C:03+1] mariadb: removes underscore on striker database name [puppet] - 10https://gerrit.wikimedia.org/r/1020709 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb) [13:46:31] (03PS1) 10Eevans: restbase: upgrade cluster to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1047087 (https://phabricator.wikimedia.org/T350567) [13:46:41] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2952/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (owner: 10Elukey) [13:47:12] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:47:33] (03CR) 10Muehlenhoff: [C:03+1] "Let's also remove modules/aptrepo/files/updates-keys/*_conda.gpg, though." [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550) (owner: 10Btullis) [13:47:48] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync [13:47:55] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [13:49:13] (03CR) 10Jcrespo: [C:03+1] mariadb: removes underscore on striker database name [puppet] - 10https://gerrit.wikimedia.org/r/1020709 (https://phabricator.wikimedia.org/T360149) (owner: 10Arnaudb) [13:49:30] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:49:30] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-fe1002.eqiad.wmnet with OS bookworm [13:49:39] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903409 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe1002.eqiad.wmnet with OS bookworm executed with errors: - moss-fe1002 (... [13:49:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047086 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [13:49:54] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-fe1002.eqiad.wmnet with OS bookworm [13:50:04] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903416 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-fe1002.eqiad.wmnet with OS bookworm [13:50:58] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [13:50:59] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [13:51:28] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [13:51:42] (03CR) 10Brouberol: [C:03+2] ATS: replace service by discovery record for datahub-next [puppet] - 10https://gerrit.wikimedia.org/r/1047040 (https://phabricator.wikimedia.org/T367768) (owner: 10Brouberol) [13:52:22] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [13:52:22] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [13:52:22] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [13:52:22] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [13:52:23] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:52:26] RESOLVED: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:52:29] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [13:52:29] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [13:52:29] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [13:52:58] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1014 - https://phabricator.wikimedia.org/T362841#9903425 (10Eevans) [13:52:59] 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9903423 (10CDanis) I think the last step to do here is to validate that any rsync failures will get reported on IRC... [13:53:53] (03PS1) 10Muehlenhoff: Deprecate system::role for Cloud VPS-specific roles [puppet] - 10https://gerrit.wikimedia.org/r/1047090 [13:54:17] (03PS2) 10Arnaudb: mariadb: prometheus config tweak for db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) [13:54:19] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [13:54:26] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [13:55:01] (03CR) 10Arnaudb: "like Patchset 2?" [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [13:56:06] (03CR) 10Ssingh: "Looks good, one comment inline:" [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [13:57:23] (03CR) 10Ladsgroup: "yup!" [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [13:57:43] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [13:57:43] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [13:57:46] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [13:57:50] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync [13:57:57] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [14:02:30] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be1001.eqiad.wmnet with OS bookworm [14:02:44] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903458 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be1001.eqiad.wmnet with OS bookworm [14:03:23] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: host reimage [14:05:19] (03CR) 10Volans: "I've tried to explain my suggestion with some suggested edit. LMK if it's more clear now." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1043753 (https://phabricator.wikimedia.org/T367496) (owner: 10Arnaudb) [14:06:39] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-fe1002.eqiad.wmnet with reason: host reimage [14:07:15] (03CR) 10Elukey: [C:03+1] Remove obsolete profile::java::java_8 [puppet] - 10https://gerrit.wikimedia.org/r/1047083 (owner: 10Muehlenhoff) [14:08:50] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9903462 (10Kgraessle) [14:09:33] !log included conftool 3.0.0 into buster-wikimedia on apt.w.o for T365123 [14:09:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:38] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [14:10:31] (03PS3) 10JMeybohm: helmfile_psp: Remove seccomp/apparmor mutations from PSP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507) [14:11:09] jayme: wow it is happening [14:12:56] (03CR) 10Elukey: "Eric should we do it or do we wait for the mesh changes?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1025791 (https://phabricator.wikimedia.org/T352647) (owner: 10Eevans) [14:13:38] (03PS1) 10EoghanGaffney: lists: Add symlink to /var/lib/mailman3 when using different root [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) [14:15:05] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2953/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [14:16:07] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete profile::java::java_8 [puppet] - 10https://gerrit.wikimedia.org/r/1047083 (owner: 10Muehlenhoff) [14:16:50] (03Abandoned) 10Vgutierrez: hiera: Set prometheus port on fifo-log-demux@cp4044 [puppet] - 10https://gerrit.wikimedia.org/r/1029213 (https://phabricator.wikimedia.org/T364383) (owner: 10Vgutierrez) [14:17:43] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [14:17:46] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [14:17:49] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync [14:18:06] * Lucas_WMDE back fwiw [14:18:58] (03CR) 10Clément Goubert: [C:03+2] mw-web, mw-api-ext: Raise replicas for 100% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047046 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [14:19:51] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 100% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047046 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [14:19:55] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1001.eqiad.wmnet with reason: host reimage [14:20:16] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync [14:20:18] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [14:20:26] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [14:20:35] (03CR) 10Brennen Bearnes: [C:03+2] AVA: Check earlier if acting user is admin [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039766 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper) [14:20:37] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [14:20:38] (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] AVA: Check earlier if acting user is admin [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039766 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper) [14:20:44] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [14:20:51] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [14:21:09] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [14:21:28] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [14:21:34] (03PS4) 10Aklapper: Count user transactions in Maniphest only in last two million rows [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039786 (https://phabricator.wikimedia.org/T366811) [14:21:34] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [14:21:43] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [14:21:45] (03CR) 10Brennen Bearnes: [C:03+2] Count user transactions in Maniphest only in last two million rows [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039786 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper) [14:21:46] (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] Count user transactions in Maniphest only in last two million rows [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039786 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper) [14:22:09] (03PS2) 10Aklapper: Limit querying latest user transactions in Maniphest to recent IDs [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039791 (https://phabricator.wikimedia.org/T366811) [14:22:10] RESOLVED: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:22:15] (03CR) 10Brennen Bearnes: [C:03+2] Limit querying latest user transactions in Maniphest to recent IDs [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039791 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper) [14:22:17] (03CR) 10Brennen Bearnes: [V:03+2 C:03+2] Limit querying latest user transactions in Maniphest to recent IDs [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039791 (https://phabricator.wikimedia.org/T366811) (owner: 10Aklapper) [14:22:39] 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#9903563 (10jcrespo) [14:22:51] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1001.eqiad.wmnet with reason: host reimage [14:23:05] (03CR) 10Cathal Mooney: [C:03+2] Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [14:23:13] !log btullis@cumin1002 START - Cookbook sre.presto.reboot-workers for Presto an-presto cluster: Reboot Presto nodes [14:23:42] (03CR) 10Giuseppe Lavagetto: [C:03+2] "It is a short click for a man, a huge leap for mankind." [puppet] - 10https://gerrit.wikimedia.org/r/1047047 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [14:23:50] Here we go people [14:24:14] !log trafficserver: move 100% of traffic to mw-on-k8s - T362323 [14:24:14] 07Puppet, 06Data-Persistence, 10database-backups: Possible weird interaction between es backups and puppet runs leading to failures - https://phabricator.wikimedia.org/T367882#9903572 (10jcrespo) [14:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:18] T362323: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323 [14:24:21] :O :O :O [14:24:27] * arnaudb holds his breath [14:24:32] wow [14:24:33] kudos [14:24:34] <_joe_> claime: merged [14:25:03] https://grafana.wikimedia.org/goto/FATzf8UIg?orgId=1 [14:25:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903575 (10Jhancock.wm) [14:25:19] (03Merged) 10jenkins-bot: Include vlans with an IRB int in device vlans even if not on L2 port [software/homer] - 10https://gerrit.wikimedia.org/r/1037773 (https://phabricator.wikimedia.org/T366348) (owner: 10Cathal Mooney) [14:25:24] Then it's into the logs to find what's still calling the bare metal cluster :p [14:25:40] hehe [14:25:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903577 (10Jhancock.wm) [14:26:45] (03CR) 10Vgutierrez: [C:03+1] "looks good to me, please see inline comments" [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:27:19] <_joe_> claime: are you running puppet on the cp hosts or should I? [14:27:44] _joe_: can do [14:27:55] <_joe_> claime: no doing it myself [14:28:06] <_joe_> I wanted to be sure I wasn't stepping on your toes [14:28:07] jerk :p [14:28:23] (I usually let it roll out on its own) [14:29:31] (03PS3) 10Arnaudb: mariadb: prometheus config tweak for db1125 [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) [14:30:15] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1046712 (https://phabricator.wikimedia.org/T367278) (owner: 10Arnaudb) [14:31:42] * Lucas_WMDE watches line go up [14:32:36] <_joe_> I prefer to watch the phys hosts line go down [14:32:38] <_joe_> :D [14:32:51] https://grafana.wikimedia.org/goto/W8wof8USR?orgId=1 [14:32:57] This one [14:33:05] <_joe_> yep [14:33:09] <_joe_> some baseline will remain [14:33:11] heh, looks much more significant there \o/ [14:33:14] <_joe_> and that's LVS checks [14:33:35] yes, and also probably some remnants from somewhere internal [14:33:44] (03PS1) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/1047097 [14:34:19] <_joe_> claime: also there's at least one cp host with puppet disabled I'd say [14:34:26] (03PS3) 10Hnowlan: service: add basic config for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) [14:34:26] (03PS1) 10Hnowlan: services_proxy: add shellbox-video listener [puppet] - 10https://gerrit.wikimedia.org/r/1047098 (https://phabricator.wikimedia.org/T357309) [14:34:32] FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [14:34:38] oh? [14:34:42] (03PS1) 10Superzerocool: cswiki: adding throttle rule, removing old throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047099 (https://phabricator.wikimedia.org/T367858) [14:34:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2003.codfw.wmnet with OS bookworm [14:34:57] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903608 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ml-staging2003.codfw.wmnet with OS boo... [14:34:58] <_joe_> uhhh [14:35:17] <_joe_> I fear irc is related to k8s [14:35:21] (03CR) 10CI reject: [V:04-1] CHANGELOG: add changelogs for release v0.6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/1047097 (owner: 10Cathal Mooney) [14:35:53] that fired yesterday also I believe [14:35:58] yeah [14:36:18] <_joe_> oh maybe we're only sending phys hosts to irc1002? [14:36:24] <_joe_> irc1001 seems to be fine [14:36:42] the rules are here [14:36:50] !log enabling puppet and running puppet agent on cp4037 [14:36:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:23] _joe_: mediawiki-config says it's not active/active? [14:37:37] <_joe_> claime: it shouldn't be from my memory, yes [14:37:49] (03PS2) 10Cathal Mooney: CHANGELOG: add changelogs for release v0.6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/1047097 [14:38:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:00] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9903617 (10klausman) Machine is drained and off, so you're free to reseat memory etc. Let me know when it's back (and what we might... [14:39:12] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1040.eqiad.wmnet with reason: T365984 [14:39:17] T365984: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984 [14:39:26] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: T365984 [14:39:28] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4046.ulsfo.wmnet [14:39:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 depool - T365984', diff saved to https://phabricator.wikimedia.org/P65156 and previous config saved to /var/cache/conftool/dbconfig/20240618-143951-arnaudb.json [14:39:53] I'm more surprised that irc1002 is sending messages at all actually [14:40:04] because if it's not active active, and irc1001 is sending messages [14:40:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047099 (https://phabricator.wikimedia.org/T367858) (owner: 10Superzerocool) [14:40:31] !log klausman@cumin2002 START - Cookbook sre.hosts.downtime for 4 days, 0:00:00 on ml-serve2001.codfw.wmnet with reason: Hardware maintenance for memory errors [14:40:47] !log klausman@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ml-serve2001.codfw.wmnet with reason: Hardware maintenance for memory errors [14:41:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9903626 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ebd7c06d-d85d-4a91-a22b-6101091bac81) set by klausman@c... [14:42:08] bare metal is now serving 45rps (excluding jobrunners because of videoscaling) [14:42:10] FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:42:57] (03CR) 10Hashar: [C:03+2] Point its-phabricator to stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1047058 (owner: 10Hashar) [14:43:55] (03CR) 10Hnowlan: [C:03+2] service: add basic config for shellbox-video [puppet] - 10https://gerrit.wikimedia.org/r/1043724 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [14:44:15] !log reenable puppet on backup2002 [14:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:21] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1001.eqiad.wmnet with OS bookworm [14:44:27] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:44:31] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903639 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be1001.eqiad.wmnet with OS bookworm completed: - moss-be1001 (**PASS**)... [14:45:16] (03PS3) 10Hashar: wmf: `bazel test` our plugins [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1046783 [14:46:34] (03CR) 10Hashar: [C:03+2] wmf: `bazel test` our plugins [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1046783 (owner: 10Hashar) [14:46:37] PROBLEM - Host ml-cache2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:46:53] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:47:20] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:40:00 on lsw1-f7-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f7-eqiad [14:47:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:40:00 on lsw1-f7-eqiad.mgmt with reason: prep JunOS upgrade lsw1-f7-eqiad [14:47:49] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903652 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0039bfdd-84ad-4638-9b4c-c0c23984e401) set by cmooney... [14:48:46] FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:49:00] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bookworm [14:49:01] (03PS1) 10EoghanGaffney: lists: Change lists sync to use quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) [14:49:14] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903661 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host moss-be1003.eqiad.wmnet with OS bookworm [14:49:28] (03PS2) 10BCornwall: hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) [14:49:41] (03CR) 10CI reject: [V:04-1] hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [14:50:14] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2954/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [14:50:30] (03Merged) 10jenkins-bot: Point its-phabricator to stable-3.9 [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1047058 (owner: 10Hashar) [14:51:26] FIRING: RoutinatorRsyncErrors: Routinator rsync fetching issue in eqiad - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:53:41] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host htmldumper1001.eqiad.wmnet [14:54:24] (03PS3) 10BCornwall: hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) [14:55:30] (03Merged) 10jenkins-bot: wmf: `bazel test` our plugins [software/gerrit] (wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1046783 (owner: 10Hashar) [14:55:35] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2956/console" [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [14:55:48] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:56:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:56:26] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on lsw1-f7-eqiad,lsw1-f7-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f7-eqiad [14:56:42] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on lsw1-f7-eqiad,lsw1-f7-eqiad IPv6,ssw1-e1-eqiad.mgmt,ssw1-f1-eqiad.mgmt with reason: JunOS upgrade lsw1-f7-eqiad [14:56:53] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903691 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=b16e0477-5d40-4e59-950e-09e82271c822) set by cmooney... [14:57:19] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:35:00 on an-worker[1172-1174].eqiad.wmnet,es1040.eqiad.wmnet,ms-be1081.eqiad.wmnet with reason: JunOS upgrade lsw1-f7-eqiad [14:57:26] (03CR) 10Ssingh: [C:03+1] hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [14:57:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:35:00 on an-worker[1172-1174].eqiad.wmnet,es1040.eqiad.wmnet,ms-be1081.eqiad.wmnet with reason: JunOS upgrade lsw1-f7-eqiad [14:57:44] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903694 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=80e189d2-8757-4138-ad14-1e0cf5cfbbdb) set by cmooney... [14:58:05] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9903696 (10kamila) 05In progress→03Stalled [14:58:05] (03PS2) 10Klausman: hiera/conftool/manifest: Add ml-staging2003 as a k8s GPU host [puppet] - 10https://gerrit.wikimedia.org/r/1042227 (https://phabricator.wikimedia.org/T357415) [14:58:39] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:59:01] Well that's too bad httpbb, but it's not a problem anymore :P [14:59:31] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 143, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:00:04] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@4f7d29a]: (no justification provided) [15:00:04] eoghan, jelto, arnoldokoth, and mutante: I, the Bot under the Fountain, call upon thee, The Deployer, to do SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1500). [15:00:08] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host htmldumper1001.eqiad.wmnet [15:00:13] !log rebooting lsw1-f7-eqiad to upgrade JunOS on switch T365984 [15:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:30] T365984: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984 [15:00:32] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@4f7d29a]: (no justification provided) (duration: 00m 28s) [15:01:22] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9903731 (10Clement_Goubert) [15:02:25] (03CR) 10Cathal Mooney: [C:03+2] CHANGELOG: add changelogs for release v0.6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/1047097 (owner: 10Cathal Mooney) [15:02:43] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy llama3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047106 (https://phabricator.wikimedia.org/T354870) [15:03:26] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:03:40] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator/Phorge update [15:03:43] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9903729 (10Clement_Goubert) {F55438321} 🚀🚀🚀 [15:03:45] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:03:58] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator/Phorge update [15:04:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T364069)', diff saved to https://phabricator.wikimedia.org/P65157 and previous config saved to /var/cache/conftool/dbconfig/20240618-150416-marostegui.json [15:04:21] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [15:04:28] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.6.6 [software/homer] - 10https://gerrit.wikimedia.org/r/1047097 (owner: 10Cathal Mooney) [15:04:28] !log brennen@deploy1002 Started deploy [phabricator/deployment@ebe3a94]: deploy phab2002 for T367775 [15:04:34] T367775: Deploy Phabricator/Phorge 2024-06-18 - https://phabricator.wikimedia.org/T367775 [15:05:05] !log brennen@deploy1002 Finished deploy [phabricator/deployment@ebe3a94]: deploy phab2002 for T367775 (duration: 00m 36s) [15:05:17] (03CR) 10Elukey: [C:03+1] profile::java: Add support for Java 21 [puppet] - 10https://gerrit.wikimedia.org/r/1047086 (https://phabricator.wikimedia.org/T367487) (owner: 10Muehlenhoff) [15:05:28] !log brennen@deploy1002 Started deploy [phabricator/deployment@ebe3a94]: deploy phab1004 for T367775 [15:06:16] !log brennen@deploy1002 Finished deploy [phabricator/deployment@ebe3a94]: deploy phab1004 for T367775 (duration: 00m 47s) [15:06:49] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [15:06:50] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [15:06:57] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-fe1002.eqiad.wmnet with OS bookworm [15:07:26] !log brennen@deploy1002 Started deploy [phabricator/deployment@ef680d8]: revert phab1004 after breakage for T367775 [15:07:41] !log brennen@deploy1002 Finished deploy [phabricator/deployment@ef680d8]: revert phab1004 after breakage for T367775 (duration: 00m 15s) [15:07:47] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903763 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-fe1002.eqiad.wmnet with OS bookworm completed: - moss-fe1002 (**WARN**)... [15:07:49] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage [15:08:01] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:08:07] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:08:13] ? expected? [15:08:30] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323#9903765 (10Ladsgroup) {meme, src=itshappening} [15:08:39] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:09:31] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:09:58] sukhe: don't think so [15:10:55] Emperor ? [15:10:58] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:11:02] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:11:06] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:11:11] !incidents [15:11:11] 4757 (ACKED) Host db1165 (paged) - PING - Packet loss = 100% [15:11:11] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage [15:11:11] 4758 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [15:11:14] here [15:11:17] here [15:11:19] !ack 4758 [15:11:19] 4758 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [15:11:19] !incidnts [15:11:28] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab-settings: add timer for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [15:12:02] Here. [15:12:07] p99 jumped hard up [15:12:33] titan are the thanos-software front-ends, and godog knows about them [15:12:48] Looking. [15:12:55] rx went from 24mb/s to 1.2GB/s [15:13:02] Emperor: godog is on vacation. [15:13:04] something is blasting it [15:13:44] I think it may be a query. [15:13:46] FIRING: [3x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:14:02] denisse: tell us if we can help/how [15:14:40] it is very likely a very large query [15:14:42] https://grafana.wikimedia.org/d/000000607/cluster-overview?orgId=1&var-site=eqiad&var-cluster=thanos&var-instance=All&from=now-1h&to=now [15:14:54] oof [15:14:54] https://i.imgur.com/FlnOPaV.png [15:15:07] (03PS4) 10JMeybohm: admin_ng: Add toggles for PSP to PSS migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507) [15:15:26] (03CR) 10Klausman: [C:03+1] ml-services: deploy llama3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047106 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [15:15:34] sorry I need to join a meeting [15:15:55] (03CR) 10CI reject: [V:04-1] admin_ng: Add toggles for PSP to PSS migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507) (owner: 10JMeybohm) [15:15:58] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:15:58] titan1001's saturation has dropped off at least [15:16:00] ah [15:16:02] I think it'll self resolve, let me see if I can see the contents of the query. [15:16:24] yeah page has resolved [15:17:06] probably worth adding to T356788 [15:17:06] T356788: thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 [15:17:22] (which I think is where we've been tracking things-that-kill-titan) [15:17:44] Emperor: good idea, let me add it. [15:18:11] oh titan1001 recovered because it OOMkilled :) [15:18:19] at 15:11 [15:18:20] for having a name like titan it seems a bit fragile [15:18:37] (03CR) 10Hnowlan: [C:03+2] shellbox-video: initial helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [15:18:40] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903792 (10cmooney) Switch is back online after upgrade, everything looks good at first glance. [15:18:55] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: deploy llama3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047106 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [15:19:03] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:19:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P65158 and previous config saved to /var/cache/conftool/dbconfig/20240618-151923-marostegui.json [15:19:30] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:20:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 10%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P65159 and previous config saved to /var/cache/conftool/dbconfig/20240618-152031-arnaudb.json [15:20:38] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [15:20:53] (03CR) 10Klausman: [C:03+2] hiera/conftool/manifest: Add ml-staging2003 as a k8s GPU host [puppet] - 10https://gerrit.wikimedia.org/r/1042227 (https://phabricator.wikimedia.org/T357415) (owner: 10Klausman) [15:21:10] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9903802 (10VRiley-WMF) Hey @Eevans This is correct. The backplane was replaced. At this stage we can move forward with a motherboard replacement if you wish. I will be pulling it from a different... [15:21:22] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9903803 (10SCherukuwada) Manager approves. [15:21:25] semi-serious question, should we wait longer before p.aging for those alerts, since titan does typically self-resolve after OOMing? [15:21:37] (03Merged) 10jenkins-bot: shellbox-video: initial helmfile configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [15:21:53] (03Merged) 10jenkins-bot: ml-services: deploy llama3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047106 (https://phabricator.wikimedia.org/T354870) (owner: 10Ilias Sarantopoulos) [15:21:55] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1047087 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [15:21:56] (03CR) 10BCornwall: [V:03+1 C:03+2] hiera: Unify ulsfo trafficserver storage elements [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [15:22:50] Hi [15:23:02] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:23:45] How do I modify Bridgebot repo? [15:24:08] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9903811 (10MatthewVernon) ms swift looks good, thanks. [15:24:39] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9903813 (10VRiley-WMF) Hey @RKemper would Thursday work for you? Around 12:00 EST? [15:25:08] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: Multi-bit errors on DIMM_B1 for an-worker1085.eqiad.wmnet - https://phabricator.wikimedia.org/T367442#9903814 (10RKemper) >>! In T367442#9903813, @VRiley-WMF wrote: > Hey @RKemper would Thursday work for you? Around 12:00 EST? @VRiley-WMF Sounds great! [15:25:34] Gerges: does https://wikitech.wikimedia.org/wiki/Tool:Bridgebot perhaps help? [15:26:05] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903816 (10klausman) It looks like the primary interface can't see the network device (the console shows "media test failure, check cable". {F55438869} [15:26:23] https://gitlab.wikimedia.org/toolforge-repos/bridgebot/-/merge_requests/7 [15:26:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:33] I made a merge request [15:26:47] But I don't know if this is true or not [15:29:12] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9903823 (10BCornwall) 05In progress→03Resolved [15:29:22] (03PS5) 10JMeybohm: admin_ng: Add toggles for PSP to PSS migration [deployment-charts] - 10https://gerrit.wikimedia.org/r/1020313 (https://phabricator.wikimedia.org/T273507) [15:29:38] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:29:44] (03PS2) 10EoghanGaffney: lists: Change lists sync to use quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) [15:30:11] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [15:30:14] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [15:30:19] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync [15:30:27] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1003.eqiad.wmnet with OS bookworm [15:30:38] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9903833 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host moss-be1003.eqiad.wmnet with OS bookworm completed: - moss-be1003 (**PASS**)... [15:31:13] Gerges: I have no idea who maintans bridgebot, but my guess would be that it would help to link to your merge request on the phab task and/or pointing the maintainer at it [15:31:34] (03PS3) 10Elukey: WIP: alternative proposal for netbox dev refactor [puppet] - 10https://gerrit.wikimedia.org/r/1047081 [15:31:58] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:32:07] bd808: hi [15:32:10] RESOLVED: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:32:25] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review, 07User-notice: Mailman Downtime: Migrate mailman from lists1001 to lists1004 - https://phabricator.wikimedia.org/T367521#9903844 (10eoghan) [15:32:33] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2957/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (owner: 10Elukey) [15:32:49] https://gitlab.wikimedia.org/toolforge-repos/bridgebot/-/merge_requests/7 [15:33:10] (03PS1) 10Brennen Bearnes: gitlab-settings: use v1.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/1047110 (https://phabricator.wikimedia.org/T355097) [15:33:31] (03CR) 10Jelto: [C:03+1] gitlab-settings: use v1.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/1047110 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [15:33:47] bd808: Do you have merge privileges in the Bridgebot repository? [15:33:51] Gerges: bd808 might be the right person, but he's out this week. [15:33:56] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2958/console" [puppet] - 10https://gerrit.wikimedia.org/r/1043979 (owner: 10BCornwall) [15:34:00] (03CR) 10CI reject: [V:04-1] WIP: alternative proposal for netbox dev refactor [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (owner: 10Elukey) [15:34:07] (03CR) 10Jelto: [C:03+2] gitlab-settings: use v1.4.0 [puppet] - 10https://gerrit.wikimedia.org/r/1047110 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [15:34:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P65161 and previous config saved to /var/cache/conftool/dbconfig/20240618-153430-marostegui.json [15:34:49] (03CR) 10EoghanGaffney: [C:03+2] lists: Change lists sync to use quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:35:07] (03PS3) 10EoghanGaffney: lists: Change lists sync to use quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) [15:35:18] (03PS2) 10BCornwall: acme-chief: Preparatory PyYAML formatting [puppet] - 10https://gerrit.wikimedia.org/r/1043979 [15:35:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 25%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P65162 and previous config saved to /var/cache/conftool/dbconfig/20240618-153537-arnaudb.json [15:35:42] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [15:35:45] (03PS7) 10Ayounsi: Netbox 4: JOBRESULT_RETENTION -> JOB_RETENTION [puppet] - 10https://gerrit.wikimedia.org/r/918353 (https://phabricator.wikimedia.org/T336275) [15:35:57] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp3066.*} and A:cp [15:36:00] @ [15:36:08] !log upgrade haproxy to v2.8.10 on cp3066 (T367756) [15:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:12] T367756: Upgrade hosts to haproxy 2.8.10 - https://phabricator.wikimedia.org/T367756 [15:36:46] (03PS4) 10Ayounsi: Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1047081 (https://phabricator.wikimedia.org/T336275) (owner: 10Elukey) [15:36:53] (03CR) 10Elukey: cli: modify get_distro_name to return the version id (031 comment) [software/debmonitor-client] - 10https://gerrit.wikimedia.org/r/1043780 (https://phabricator.wikimedia.org/T240193) (owner: 10Elukey) [15:36:57] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 25 probes of 792 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:37:14] (03CR) 10Cwhite: [C:03+2] grafana: Change synthetic performance test proxy endpoint. [puppet] - 10https://gerrit.wikimedia.org/r/1044292 (https://phabricator.wikimedia.org/T367488) (owner: 10Phedenskog) [15:37:28] (03Abandoned) 10Ayounsi: Prepare for netbox-dev [puppet] - 10https://gerrit.wikimedia.org/r/1037784 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [15:37:31] (03PS1) 10Arturo Borrero Gonzalez: toolforge: haproxy: use HTTP healthcheck for the k8s api-server [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389) [15:37:48] (03CR) 10Vgutierrez: [C:03+1] "please merge this one with puppet disabled on acme-chief hosts and check that it's a NOOP at acme-chief level on acmechief-test instances" [puppet] - 10https://gerrit.wikimedia.org/r/1043979 (owner: 10BCornwall) [15:38:00] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp3066.*} and A:cp [15:39:25] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5030.*} and A:cp [15:39:27] !log upgrade haproxy to v2.8.10 on cp5030,cp5032 (T367756) [15:39:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:38] (03PS1) 10Brennen Bearnes: gitlab-settings: update tag to 1.5.0 for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1047114 (https://phabricator.wikimedia.org/T355097) [15:41:44] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5030.*} and A:cp [15:42:01] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp5032.*} and A:cp [15:42:09] (03CR) 10Jelto: [C:03+1] gitlab-settings: update tag to 1.5.0 for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1047114 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [15:42:10] !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:42:39] (03CR) 10Jelto: [C:03+2] gitlab-settings: update tag to 1.5.0 for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1047114 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [15:43:40] (03CR) 10Elukey: [C:03+1] "I love it, really nice!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1046734 (https://phabricator.wikimedia.org/T365372) (owner: 10Volans) [15:43:42] !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:44:03] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9903888 (10Eevans) >>! In T362033#9903802, @VRiley-WMF wrote: > .... Is there a time you would like to proceed with this? I have no time preference; I can be available any time this week. [15:44:06] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp5032.*} and A:cp [15:45:09] (03CR) 10Volans: [C:03+2] redfish: simplify interface of Redfish classes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1046734 (https://phabricator.wikimedia.org/T365372) (owner: 10Volans) [15:45:38] !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:46:18] !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:47:08] 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9903900 (10Dzahn) How about adding a MAILTO to the timer and mail a specific list / team / group? I think that ale... [15:47:13] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:47:36] (03CR) 10EoghanGaffney: [V:03+2 C:03+2] lists: Change lists sync to use quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:47:38] !log included conftool 3.0.0 into buster/bullseye/bookworm-wikimedia on apt.w.o for T365123 [15:47:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:43] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [15:48:10] jouncebot: nowandnext [15:48:10] For the next 0 hour(s) and 11 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1500) [15:48:10] In 0 hour(s) and 11 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1600) [15:48:43] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:49:25] !log robh@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ml-staging2003 [15:49:34] !log robh@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ml-staging2003 [15:49:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T364069)', diff saved to https://phabricator.wikimedia.org/P65163 and previous config saved to /var/cache/conftool/dbconfig/20240618-154938-marostegui.json [15:49:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:49:44] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [15:49:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [15:50:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T364069)', diff saved to https://phabricator.wikimedia.org/P65164 and previous config saved to /var/cache/conftool/dbconfig/20240618-155000-marostegui.json [15:50:26] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [15:50:42] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:50:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 50%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P65165 and previous config saved to /var/cache/conftool/dbconfig/20240618-155042-arnaudb.json [15:50:51] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [15:51:09] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:51:09] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync [15:52:19] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-video: apply [15:52:47] (03Merged) 10jenkins-bot: redfish: simplify interface of Redfish classes [software/spicerack] - 10https://gerrit.wikimedia.org/r/1046734 (https://phabricator.wikimedia.org/T365372) (owner: 10Volans) [15:52:55] (03CR) 10Clément Goubert: [C:03+2] statograph: Use k8s envoy metric for statuspage [puppet] - 10https://gerrit.wikimedia.org/r/1047115 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [15:53:10] (03PS2) 10Arturo Borrero Gonzalez: toolforge: haproxy: check the k8s api-server /healthz endpoint [puppet] - 10https://gerrit.wikimedia.org/r/1047113 (https://phabricator.wikimedia.org/T367389) [15:53:32] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s7 [15:53:36] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s2 [15:53:45] (03PS1) 10MVernon: cephadm: limit mgr daemons to _admin-labelled hosts [puppet] - 10https://gerrit.wikimedia.org/r/1047117 (https://phabricator.wikimedia.org/T279621) [15:54:00] (03PS2) 10EoghanGaffney: lists: Add symlink to /var/lib/mailman3 when using different root [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) [15:54:00] (03PS1) 10EoghanGaffney: lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118 [15:54:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): hw troubleshooting: server fails to reboot for clouddb1018.eqiad.wmnet - https://phabricator.wikimedia.org/T367499#9903941 (10fnegri) Thanks @Jclark-ctr! The host is now repooled. [15:54:28] (03CR) 10CI reject: [V:04-1] lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [15:55:00] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-staging2003.codfw.wmnet with OS bookworm [15:55:08] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9903947 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm executed with errors... [15:55:35] (03PS2) 10EoghanGaffney: lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118 [15:59:31] (03PS6) 10Jdlrobson: Enable dark mode on more pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) [16:00:05] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:02:24] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-video: apply [16:02:30] (03PS2) 10Btullis: Remove conda repository from reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550) [16:05:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 75%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P65166 and previous config saved to /var/cache/conftool/dbconfig/20240618-160548-arnaudb.json [16:05:53] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [16:06:00] (03CR) 10BCornwall: [C:03+2] "Ack" [puppet] - 10https://gerrit.wikimedia.org/r/1043979 (owner: 10BCornwall) [16:11:17] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [16:11:20] (03PS3) 10Btullis: Remove conda repository from reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550) [16:11:25] (03PS1) 10Clément Goubert: statograph: fix wiki response time query [puppet] - 10https://gerrit.wikimedia.org/r/1047121 [16:12:53] (03PS1) 10Bartosz Dziewoński: Revert "Show experimental login popup links on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891) [16:13:10] (03PS2) 10Bartosz Dziewoński: Revert "Show experimental login popup links on the beta cluster" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047122 (https://phabricator.wikimedia.org/T367891) [16:14:10] (03PS2) 10Clément Goubert: statograph: fix wiki response time query [puppet] - 10https://gerrit.wikimedia.org/r/1047121 [16:14:52] RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 75%, RTA = 119.44 ms [16:15:24] (03PS3) 10Clément Goubert: statograph: fix wiki response time query [puppet] - 10https://gerrit.wikimedia.org/r/1047121 [16:16:04] (03PS1) 10Hnowlan: admin_ng: bump limits for shellbox-video [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) [16:16:18] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [16:16:55] (03CR) 10Clément Goubert: [C:03+2] statograph: fix wiki response time query [puppet] - 10https://gerrit.wikimedia.org/r/1047121 (owner: 10Clément Goubert) [16:17:52] (03PS3) 10EoghanGaffney: lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118 [16:18:23] (03CR) 10Eevans: [C:03+2] restbase: upgrade cluster to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1047087 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [16:19:12] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: sync [16:19:49] (03PS1) 10DLynch: Deploy references edit check to phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047125 (https://phabricator.wikimedia.org/T361843) [16:20:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'es1040 (re)pooling @ 100%: post T365983 repool', diff saved to https://phabricator.wikimedia.org/P65167 and previous config saved to /var/cache/conftool/dbconfig/20240618-162053-arnaudb.json [16:21:06] T365983: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983 [16:21:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:22:58] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-codfw: Upgrade to Java 11 — T350567 - eevans@cumin1002 [16:23:02] T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567 [16:23:15] !log depooled / pooled mw2441.codfw.wmnet to smoke-test python3-conftool for T365123 [16:23:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:20] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [16:23:47] !log resetting Wiki response time metric on wikimedia.statuspage.io following complete switch to k8s - T362323 [16:23:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:52] T362323: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323 [16:24:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047125 (https://phabricator.wikimedia.org/T361843) (owner: 10DLynch) [16:24:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson) [16:26:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:27:49] (03CR) 10Eevans: [C:03+1] data-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046752 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [16:28:19] (03PS1) 10DLynch: Turn on Visual Editor collab beta feature on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 [16:28:48] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 (owner: 10DLynch) [16:29:07] !log conftool on cumin2002 updated to 3.0.0 for T365123 [16:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:11] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [16:29:38] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:31:31] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on an-worker1093.eqiad.wmnet with reason: T367825 hw maint [16:31:36] T367825: hw troubleshooting: Multi-bit errors on DIMM_A2 for an-worker1093 - https://phabricator.wikimedia.org/T367825 [16:31:45] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-worker1093.eqiad.wmnet with reason: T367825 hw maint [16:32:07] (03PS1) 10EoghanGaffney: stewards: Allow lists.wm.o to access the stewards rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1047135 [16:34:55] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1017.eqiad.wmnet,service=s1 [16:35:27] (03CR) 10Btullis: [C:03+2] Remove conda repository from reprepro configuration [puppet] - 10https://gerrit.wikimedia.org/r/1047085 (https://phabricator.wikimedia.org/T364550) (owner: 10Btullis) [16:39:19] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [16:39:29] !log validated dbctl 3.0.0 on cumin2002 (noop edit to note: on parsercache spare pc2014) for T365123 [16:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:34] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [16:42:35] !log conftool on puppetmaster2001 updated to 3.0.0 for T365123 [16:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:20] !log validated requestctl 3.0.0 find-ip (new read-only subcommand) on puppetmaster2001 for T365123 [16:47:01] (03PS1) 10Clément Goubert: statograph: Use benthos query to save thanos [puppet] - 10https://gerrit.wikimedia.org/r/1047138 (https://phabricator.wikimedia.org/T362323) [16:50:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.presto.reboot-workers (exit_code=0) for Presto an-presto cluster: Reboot Presto nodes [16:51:10] FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:51:30] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:52:37] (03CR) 10Jelto: [C:03+1] "lgtm. One question in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [16:55:12] (03PS2) 10Clément Goubert: statograph: Use benthos query to save thanos [puppet] - 10https://gerrit.wikimedia.org/r/1047138 (https://phabricator.wikimedia.org/T367894) [16:56:12] (03CR) 10Kamila Součková: [C:03+1] "LGTM other than the CPU limit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047124 (https://phabricator.wikimedia.org/T357309) (owner: 10Hnowlan) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1700) [17:12:10] !log updated conftool to 3.0.0 on remaining buster hosts in codfw for T365123 [17:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:15] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [17:13:41] (03CR) 10Jdlrobson: [C:04-1] "Blocked until June 20th" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042431 (https://phabricator.wikimedia.org/T366378) (owner: 10Jdlrobson) [17:13:47] (03CR) 10CDanis: [C:03+2] statograph: Use benthos query to save thanos [puppet] - 10https://gerrit.wikimedia.org/r/1047138 (https://phabricator.wikimedia.org/T367894) (owner: 10Clément Goubert) [17:14:43] !log updated conftool to 3.0.0 on remaining bookworm hosts in codfw for T365123 [17:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:50] (03CR) 10Dzahn: [C:03+1] "this would have no affect on lists1001 and change the path on lists1004 to /srv/mailman3" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [17:16:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2003.codfw.wmnet with OS bookworm [17:16:23] !log updated conftool to 3.0.0 on remaining bullseye hosts in codfw for T365123 [17:16:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9904356 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ml-staging2003.codfw.wmnet with OS bookworm [17:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:09] !log resetting Wiki response time metric on wikimedia.statuspage.io following complete switch to k8s - T362323 T367894 [17:21:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:15] T362323: Move 100% of external traffic to Kubernetes - https://phabricator.wikimedia.org/T362323 [17:21:16] T367894: update status page latency for mw-on-k8s - https://phabricator.wikimedia.org/T367894 [17:23:46] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:28:59] (03CR) 10Dreamrimmer: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047099 (https://phabricator.wikimedia.org/T367858) (owner: 10Superzerocool) [17:31:48] (03PS1) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047147 [17:34:01] (03PS2) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047147 [17:34:07] !log updated conftool to 3.0.0 on buster hosts in eqiad for T365123 [17:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:11] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [17:35:10] !log updated conftool to 3.0.0 on bookworm hosts in eqiad for T365123 [17:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:28] !log updated conftool to 3.0.0 on bullseye hosts in eqiad for T365123 [17:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:14] (03CR) 10Esanders: [C:03+1] Deploy references edit check to phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047125 (https://phabricator.wikimedia.org/T361843) (owner: 10DLynch) [17:40:11] (03PS1) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047148 [17:40:11] (03PS1) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047149 [17:40:11] (03PS1) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047150 [17:41:17] (03PS2) 10BCornwall: acme-chief: Add new certificates and domains [puppet] - 10https://gerrit.wikimedia.org/r/1047150 [17:41:33] (03CR) 10Dzahn: [C:03+1] lists: Update rsync module path for quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [17:42:35] (03CR) 10BCornwall: [V:03+1 C:03+2] hiera: Unify ulsfo trafficserver storage elements (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1046804 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [17:51:55] (03CR) 10Dzahn: "let's first fix this one that seems related:" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [17:57:13] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:57:59] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:58:09] (03CR) 10BCornwall: "Thanks for that, Taavi. Is that to say that only wikimediacloud.org and wikimedia.cloud being blacklisted is good enough?" [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor) [17:58:29] (03CR) 10Dzahn: lists: Change lists sync to use quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [17:58:45] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor) [17:59:38] (03CR) 10Dzahn: [C:03+1] lists: Update rsync module path for quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [18:00:05] jnuche and brennen: May I have your attention please! MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T1800) [18:00:42] (03CR) 10BBlack: [C:03+1] conftool-data: add ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046675 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [18:01:05] o/ nothing for this window. [18:01:38] (03CR) 10BBlack: [C:03+1] dnsbox: announce ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1046685 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [18:03:10] (03CR) 10BCornwall: "These are all handled but I'm noticing that markmonitor is returning punycode as having ns[0-2].wikimedia.org..." [dns] - 10https://gerrit.wikimedia.org/r/1040335 (owner: 10Ncmonitor) [18:08:34] (03PS4) 10Dzahn: lists: Update rsync module path for quickdatacopy, fix invalid unit name [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [18:09:39] (03CR) 10Dzahn: lists: Update rsync module path for quickdatacopy, fix invalid unit name (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [18:11:59] (03PS1) 10Jdlrobson: Fix codex link styles overriding other link styles [skins/Vector] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047155 (https://phabricator.wikimedia.org/T367844) [18:12:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [skins/Vector] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047155 (https://phabricator.wikimedia.org/T367844) (owner: 10Jdlrobson) [18:12:49] (03CR) 10Dzahn: "with the additional change now it would mean a change on all servers.. I want to avoid that too... sigh" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [18:14:14] (03CR) 10Dzahn: "this being inside a " if $primary_host " it suprises me this has an affect on lists1001 and lists2001" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [18:16:30] (03CR) 10Muehlenhoff: "Or instead fix this in quickdatacopy by sanitising the name, have a look at what I added for in the firewall::service define, line 28 onwa" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [18:16:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host ml-staging2003.codfw.wmnet with OS bookworm [18:16:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-staging2003.codfw.wmnet with OS bookworm [18:17:05] !log updated conftool to 3.0.0 on hosts (cp,ncredir) in ulsfo for T365123 [18:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:10] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [18:17:46] (03CR) 10Dzahn: "ah, nevermind, the rsync::quickdatacopy resource should of exist on both (all) machines, but then internal logic inside it decides what to" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [18:19:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ml-staging2003.codfw.wmnet with OS bookworm [18:23:05] (03CR) 10Dzahn: [C:03+2] stewards: Allow lists.wm.o to access the stewards rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1047135 (owner: 10EoghanGaffney) [18:23:14] (03PS2) 10EoghanGaffney: stewards: Allow lists.wm.o to access the stewards rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1047135 [18:25:48] (03Abandoned) 10Dzahn: codesearch: add support for docker-ce on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1043901 (https://phabricator.wikimedia.org/T367479) (owner: 10Dzahn) [18:27:03] !log updated conftool to 3.0.0 on hosts (cp,ncredir) in magru for T365123 [18:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:08] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [18:27:08] (03CR) 10Dzahn: [V:03+2 C:03+2] stewards: Allow lists.wm.o to access the stewards rsync server [puppet] - 10https://gerrit.wikimedia.org/r/1047135 (owner: 10EoghanGaffney) [18:29:43] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-codfw: Upgrade to Java 11 — T350567 - eevans@cumin1002 [18:29:48] T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567 [18:31:10] (03PS1) 10Ahmon Dancy: mw-web: Add traindev environment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047158 [18:31:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [18:33:12] !log updated conftool to 3.0.0 on hosts (cp,ncredir) in drmrs for T365123 [18:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:16] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [18:34:47] FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [18:38:59] !log updated conftool to 3.0.0 on hosts (cp,ncredir) in eqsin for T365123 [18:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:03] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [18:40:17] (03CR) 10Dzahn: [V:03+2] "unit started manually on lists1004, works fine" [puppet] - 10https://gerrit.wikimedia.org/r/1047135 (owner: 10EoghanGaffney) [18:44:42] !log updated conftool to 3.0.0 on hosts (cp,ncredir) in esams for T365123 [18:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:47] T365123: Make dbctl check for depooled future masters - https://phabricator.wikimedia.org/T365123 [18:46:46] jinxer-wm: help [18:49:52] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:restbase-eqiad: Upgrade to Java 11 — T350567 - eevans@cumin1002 [18:49:58] T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567 [18:53:11] (03PS1) 10Dzahn: lists: fix invalid unit name for rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047160 (https://phabricator.wikimedia.org/T331706) [18:53:33] (03PS18) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [18:53:54] (03CR) 10Dzahn: "follow-up created: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1047160" [puppet] - 10https://gerrit.wikimedia.org/r/1047101 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [19:00:18] (03CR) 10Dzahn: [C:03+2] "disabling puppet on lists*, then deploying on at a time" [puppet] - 10https://gerrit.wikimedia.org/r/1047160 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn) [19:01:34] (03PS19) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [19:03:48] (03PS1) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047161 (https://phabricator.wikimedia.org/T363001) [19:13:01] RECOVERY - Host elastic2088 is UP: PING WARNING - Packet loss = 33%, RTA = 30.35 ms [19:15:48] FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:17:45] (03Abandoned) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047161 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [19:17:51] !log lists1001 - systemctl reset-failed - clean up systemd state due to units not found anymore after migration - disable puppet and then deploy gerrit:1047160 on lists to fix invalid unit name - T331706 [19:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:57] T331706: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706 [19:18:29] PROBLEM - Host elastic2088 is DOWN: PING CRITICAL - Packet loss = 100% [19:18:30] (03PS20) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [19:19:32] (03PS5) 10Hashar: Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) [19:26:54] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:26:55] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [19:29:40] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:29:41] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [19:30:14] (03CR) 10CI reject: [V:04-1] Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) (owner: 10Hashar) [19:30:14] (03PS21) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [19:30:50] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:31:10] (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [19:32:46] (03PS22) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [19:33:33] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:33:54] (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [19:34:26] (03CR) 10Dzahn: [C:03+2] "on lists1001 - no change" [puppet] - 10https://gerrit.wikimedia.org/r/1047160 (https://phabricator.wikimedia.org/T331706) (owner: 10Dzahn) [19:36:32] (03PS5) 10Dzahn: lists: Update rsync module path for quickdatacopy, fix invalid unit name [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [19:36:39] (03PS6) 10Dzahn: lists: Update rsync module path for quickdatacopy, fix invalid unit name [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [19:40:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ml-staging2003.codfw.wmnet with OS bookworm [19:41:43] (03PS23) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [19:42:26] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [19:42:38] (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [19:42:54] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [19:43:27] (03PS7) 10Dzahn: lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [19:44:48] (03CR) 10Dzahn: [C:03+2] lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [19:44:49] (03CR) 10Dzahn: [V:03+2 C:03+2] lists: Update rsync module path for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [19:48:40] (03PS6) 10Hashar: Merge branch 'stable-3.10' into wmf/stable-3.10 [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1043813 (https://phabricator.wikimedia.org/T367419) [19:55:38] (03PS24) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [19:55:41] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 197592184 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:56:31] (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [19:56:41] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 108688 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:57:13] (03PS25) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [19:58:19] (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [19:59:14] (03PS26) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [19:59:56] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [20:00:01] (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240618T2000). [20:00:05] Superzerocool, kemayo, and jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] o/ [20:00:35] hi! [20:00:51] o/ (I can do jdlrobson's patches today) [20:01:15] jan_drewniak: would it be ok if i deploy everything in interest of time? [20:02:38] let's start [20:02:42] urbanecm: you can leave mine for last and I can self-deploy, I have toyofuku shadowing me on a deployement today :) [20:02:45] (03CR) 10Urbanecm: [C:03+2] cswiki: adding throttle rule, removing old throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047099 (https://phabricator.wikimedia.org/T367858) (owner: 10Superzerocool) [20:02:53] jan_drewniak: ah, okay. souds good then [20:03:22] (03Merged) 10jenkins-bot: cswiki: adding throttle rule, removing old throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047099 (https://phabricator.wikimedia.org/T367858) (owner: 10Superzerocool) [20:03:28] (03CR) 10Urbanecm: [C:03+2] Deploy references edit check to phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047125 (https://phabricator.wikimedia.org/T361843) (owner: 10DLynch) [20:04:09] oh, we're releasing collab somewhere? cool! [20:04:20] (03Merged) 10jenkins-bot: Deploy references edit check to phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047125 (https://phabricator.wikimedia.org/T361843) (owner: 10DLynch) [20:04:35] (03CR) 10Urbanecm: [C:03+2] Turn on Visual Editor collab beta feature on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 (owner: 10DLynch) [20:04:40] (03PS2) 10DLynch: Turn on Visual Editor collab beta feature on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 [20:04:55] Technically you can make it happen pretty much anywhere at the moment -- the thing that's gated away is the UI for actually *starting* a session. Once one is started, links to it should work regardless of your own feature-status. [20:04:57] (03CR) 10Urbanecm: [C:03+2] Turn on Visual Editor collab beta feature on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 (owner: 10DLynch) [20:05:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 (owner: 10DLynch) [20:05:36] (03Merged) 10jenkins-bot: Turn on Visual Editor collab beta feature on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047131 (owner: 10DLynch) [20:05:39] Kemayo: but still, exposing the ui somewhere is very cool :) [20:05:45] (03CR) 10Dzahn: [V:03+2 C:03+2] "lists1001 - no change" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [20:06:07] It'll be good to get feedback on the UX / legal questions once people actually use it. And experience weird editing-conflicts that we've not managed to see ourselves yet. :D [20:06:10] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1047099|cswiki: adding throttle rule, removing old throttle rule (T367858)]], [[gerrit:1047125|Deploy references edit check to phase 1 wikis (T361843)]], [[gerrit:1047131|Turn on Visual Editor collab beta feature on officewiki]] [20:06:16] T367858: Lift IP cap on 2024-06-26 for Editathon Human Rights - cs.wikipedia - https://phabricator.wikimedia.org/T367858 [20:06:16] T361843: Make Edit Check (references) available to all newcomers at phase 1 Wikipedias - https://phabricator.wikimedia.org/T361843 [20:06:27] mutante: can i bribe you to puppet merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1045211 please? [20:07:01] (03PS2) 10Urbanecm: admin: urbanecm's home: Update .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/1045211 [20:07:56] (03CR) 10Majavah: [C:03+2] admin: urbanecm's home: Update .gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/1045211 (owner: 10Urbanecm) [20:08:08] thanks taavi [20:08:12] (03CR) 10Dzahn: [V:03+2 C:03+2] "lists2001 - /usr/local/sbin/sync-mailman-root-sync now pulls into /srv/mailman3/ and remote side offers /srv/mailman3 - manually started r" [puppet] - 10https://gerrit.wikimedia.org/r/1047118 (owner: 10EoghanGaffney) [20:08:50] (03PS27) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [20:09:15] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [20:09:17] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [20:09:31] (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [20:10:46] !log urbanecm@deploy1002 urbanecm, superzerocool, kemayo: Backport for [[gerrit:1047099|cswiki: adding throttle rule, removing old throttle rule (T367858)]], [[gerrit:1047125|Deploy references edit check to phase 1 wikis (T361843)]], [[gerrit:1047131|Turn on Visual Editor collab beta feature on officewiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:10:46] !log urbanecm@deploy1002 Sync cancelled. [20:11:04] Kemayo: can you test at mwdebug please? [20:11:11] Sure, just a second. [20:11:38] (both patches please) [20:12:46] (03PS28) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [20:13:36] why sync cancelled... [20:13:37] (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [20:13:42] i didn't cancel anything [20:13:49] restarting [20:14:06] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1047099|cswiki: adding throttle rule, removing old throttle rule (T367858)]], [[gerrit:1047125|Deploy references edit check to phase 1 wikis (T361843)]], [[gerrit:1047131|Turn on Visual Editor collab beta feature on officewiki]] [20:14:06] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [20:14:15] T367858: Lift IP cap on 2024-06-26 for Editathon Human Rights - cs.wikipedia - https://phabricator.wikimedia.org/T367858 [20:14:16] T361843: Make Edit Check (references) available to all newcomers at phase 1 Wikipedias - https://phabricator.wikimedia.org/T361843 [20:14:50] 1047125 is working, but I can't persuade 1047131 to -- does mwdebug actually work on officewiki? [20:15:05] (I don't think I've ever done an officewiki-specific deployment before.) [20:15:37] Kemayo: it should work there [20:15:58] wikitech is the only exception (and i hope not for long) [20:17:58] i do see wgVisualEditorEnableCollabBeta is set to true at mwdebug [20:17:58] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9905039 (10Dzahn) After a little follow-up fix rsync::quickdatacopy is now in use and copies both from and to new path /srv/mailman3 (and /v... [20:18:38] !log urbanecm@deploy1002 kemayo, urbanecm, superzerocool: Backport for [[gerrit:1047099|cswiki: adding throttle rule, removing old throttle rule (T367858)]], [[gerrit:1047125|Deploy references edit check to phase 1 wikis (T361843)]], [[gerrit:1047131|Turn on Visual Editor collab beta feature on officewiki]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:19:08] Kemayo: if it is not breaking something visibly, we might try deploying and see what happens afterwards? unless you object. [20:19:27] Hey folks. Not sure if it's related to the ongoing deployment, but I was just told of a problem with EditCheck that is preventing edits, at least on testwiki. Still gathering some links and will file a task shortly, but I wanted to say it here first. [20:20:39] urbanecm: I'm fine with pushing it out and seeing if that helps [20:20:41] Daimona: my deployment did not reach anything non-debug [20:20:49] so it should not be related [20:20:51] but Kemayo would know more :) [20:21:17] There's some other stuff on the train that changed this week with edit check, so more details would be helpful. [20:21:28] (03CR) 10Dzahn: "I think we don't really need it anymore now. mailman_root is /srv/mailman3 on lists1004 and lists2001 and lists1001 is gone soon. unless t" [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [20:22:58] !log urbanecm@deploy1002 kemayo, urbanecm, superzerocool: Continuing with sync [20:24:13] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [20:26:18] Task filed: T367920 [20:26:18] T367920: Cannot save edits in testwiki with VE: mw.editcheck.findAddedContentNeedingReference is not a function - https://phabricator.wikimedia.org/T367920 [20:29:15] I still haven't checked what wikis are affected and whether certain specific config is needed to reproduce, but for now I just wanted to file the task and get some more eyes on it. [20:29:57] I guess it might also be a deployment blocker, but again, still checking the impact. [20:30:29] Got it, it's a problem with the stuff on the train. I will see if I can write a very quick patch. [20:33:05] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1047099|cswiki: adding throttle rule, removing old throttle rule (T367858)]], [[gerrit:1047125|Deploy references edit check to phase 1 wikis (T361843)]], [[gerrit:1047131|Turn on Visual Editor collab beta feature on officewiki]] (duration: 18m 59s) [20:33:10] T367858: Lift IP cap on 2024-06-26 for Editathon Human Rights - cs.wikipedia - https://phabricator.wikimedia.org/T367858 [20:33:11] T361843: Make Edit Check (references) available to all newcomers at phase 1 Wikipedias - https://phabricator.wikimedia.org/T361843 [20:33:42] okay, patch finished syncing [20:33:53] and i think that settles the first group? [20:33:57] so jan_drewniak, i think you can start [20:34:56] urbanecm: thanks! I'll get to it :) [20:35:02] (03PS1) 10Hashar: Gerrit 3.10.x rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1047175 (https://phabricator.wikimedia.org/T367419) [20:36:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046790 (https://phabricator.wikimedia.org/T367463) (owner: 10Jdlrobson) [20:36:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047155 (https://phabricator.wikimedia.org/T367844) (owner: 10Jdlrobson) [20:41:27] (03PS1) 10Dzahn: admin: add Audrey Penven to ldap_only (wmde/nda) [puppet] - 10https://gerrit.wikimedia.org/r/1047176 (https://phabricator.wikimedia.org/T367184) [20:44:06] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905129 (10Dzahn) Thanks @KFrancis! Can you please add Audrey to the 'NDA and MOU' spreadsheet? [20:45:14] (03CR) 10Scott French: [C:03+2] data-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046752 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [20:46:08] (03Merged) 10jenkins-bot: data-gateway: remove initialDelaySeconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046752 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [20:47:16] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [20:47:19] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Kashmiri Wikimedians User Group - https://phabricator.wikimedia.org/T367640#9905136 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/wikimedia-ks.lists.wikimedia.org [20:47:22] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905139 (10Dzahn) 05Stalled→03In progress [20:47:27] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [20:48:08] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905140 (10KFrancis) Done, thanks! [20:49:14] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [20:49:32] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [20:49:39] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905142 (10Dzahn) thanks Katie! @AudreyPenven_WMDE All is ready, we just still need an approval from one of the WMDE engineering managers (https://wikitech.wikimedia.org... [20:49:56] (03CR) 10Dzahn: "pending WMDE engineering manager approval" [puppet] - 10https://gerrit.wikimedia.org/r/1047176 (https://phabricator.wikimedia.org/T367184) (owner: 10Dzahn) [20:50:33] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [20:50:49] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [20:51:25] 06SRE, 10Wikimedia-Mailing-lists, 07Upstream: Unnecessary horizontal scrollbars - https://phabricator.wikimedia.org/T283028#9905147 (10Ladsgroup) There was a new version of mailman deployed today. I can't reproduce this anymore. Can you check @reedy? [20:51:25] FIRING: HelmReleaseBadStatus: Helm release sessionstore/staging on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:53:04] (03PS2) 10Hashar: Gerrit 3.10.x rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1047175 (https://phabricator.wikimedia.org/T367419) [20:53:35] * hashar sleeps [20:55:31] (03CR) 10Scott French: "Thank you both for the reviews! I'll be out tomorrow (Wednesday), but will aim to get this deployed on Thursday when I return." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1046753 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [20:59:53] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:restbase-eqiad: Upgrade to Java 11 — T350567 - eevans@cumin1002 [20:59:57] T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567 [21:02:30] (03Merged) 10jenkins-bot: Improve responsive images and avoid for inline [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1046790 (https://phabricator.wikimedia.org/T367463) (owner: 10Jdlrobson) [21:02:33] (03Merged) 10jenkins-bot: Fix codex link styles overriding other link styles [skins/Vector] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047155 (https://phabricator.wikimedia.org/T367844) (owner: 10Jdlrobson) [21:03:07] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]] [21:03:13] T367463: Tables with images inside them appear at minuscule size or disappear due to responsive image CSS - https://phabricator.wikimedia.org/T367463 [21:03:14] T367844: Various buttons on Vector 2022 acquired unexpected link styling when hovered - https://phabricator.wikimedia.org/T367844 [21:03:39] 06SRE, 10SRE-Access-Requests: Request to add mnz to analytics-research-admins - https://phabricator.wikimedia.org/T367757#9905192 (10Dzahn) Hi @MunizaA no problem, but we'll need a few more things from you for that. Could you please use the template linked from https://wikitech.wikimedia.org/wiki/SRE/Producti... [21:05:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9905205 (10Dzahn) [21:07:48] !log jdrewniak@deploy1002 jdrewniak, jdlrobson: Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:07:48] !log jdrewniak@deploy1002 Sync cancelled. [21:09:33] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]] [21:09:39] T367463: Tables with images inside them appear at minuscule size or disappear due to responsive image CSS - https://phabricator.wikimedia.org/T367463 [21:09:39] T367844: Various buttons on Vector 2022 acquired unexpected link styling when hovered - https://phabricator.wikimedia.org/T367844 [21:12:31] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9905218 (10Dzahn) Hi @Kgraessle in addition to your manager please get any of the following people to approve of this request here on the ticket. ` a... [21:12:32] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for Kgraessle - https://phabricator.wikimedia.org/T367747#9905219 (10Dzahn) [21:12:51] Daimona: Okay, got what I think is a patch for it, just need to get some code review on it and we can unblock the train. https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1047180 [21:13:59] !log jdrewniak@deploy1002 jdlrobson, jdrewniak: Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:14:03] Thank you! I'd normally offer to take a look, but right now I'm struggling to keep my eyes open and I don't want to do damage. [21:14:43] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367925 (10phaultfinder) 03NEW [21:16:11] !log jdrewniak@deploy1002 jdlrobson, jdrewniak: Continuing with sync [21:19:09] 06SRE, 10SRE-Access-Requests: Requesting access to dumps-roots and to clouddumps*.eqiad.wmnet for xcollazo - https://phabricator.wikimedia.org/T367571#9905250 (10Dzahn) Hi @xcollazo so clouddumps1001.eqiad.wmnet and clouddumps1002.eqiad.wmnet don't exist. But clouddumps1001.wikimedia.org and clouddumps1002.... [21:20:12] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9905262 (10Dzahn) [21:20:41] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudvirt-wdqs100[1,2,3] - https://phabricator.wikimedia.org/T367773#9905260 (10VRiley-WMF) a:03VRiley-WMF [21:21:14] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9905264 (10Dzahn) tagging with data-engineering per the new process to request approval from group approvers [21:21:20] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9905266 (10Dzahn) it's an SRE access request, unrelated to LDAP. adjusting tags [21:21:26] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:21:32] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Grant Access to analytics-privatedata-users for DMburugu - https://phabricator.wikimedia.org/T367872#9905267 (10Dzahn) [21:22:51] 06SRE, 10SRE-Access-Requests: Requesting access to dumps-roots and to clouddumps*.eqiad.wmnet for xcollazo - https://phabricator.wikimedia.org/T367571#9905272 (10BTullis) @xcollazo is already a member of analytics-admins: https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/data/data.ya... [21:24:25] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9905271 (10Dzahn) @WMDE-leszek Can we get approval here from WMDE management? [21:24:55] 06SRE, 10SRE-Access-Requests: Requesting access to dumps-roots and to clouddumps*.eqiad.wmnet for xcollazo - https://phabricator.wikimedia.org/T367571#9905280 (10BTullis) Mind you, membership of the `dumps-roots` group would give more privileges. Full root access: https://github.com/wikimedia/operations-puppet... [21:25:30] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudvirt-wdqs100[1,2,3] - https://phabricator.wikimedia.org/T367773#9905281 (10VRiley-WMF) [21:25:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:26:07] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]] (duration: 16m 33s) [21:26:13] T367463: Tables with images inside them appear at minuscule size or disappear due to responsive image CSS - https://phabricator.wikimedia.org/T367463 [21:26:13] T367844: Various buttons on Vector 2022 acquired unexpected link styling when hovered - https://phabricator.wikimedia.org/T367844 [21:27:57] Hey all, looks like the backport finished, but it did end with the following error (not sure why) [21:28:00] backport failed: Command '['/usr/bin/scap', 'sync-world', '--pause-after-testserver-sync', '--notify-user=jdlrobson', 'Backport for [[gerrit:1046790|Improve responsive images and avoid for inline (T367463)]], [[gerrit:1047155|Fix codex link styles overriding other link styles (T367844)]]']' returned non-zero exit status 1. [21:29:25] 06SRE, 10SRE-Access-Requests: Requesting access to dumps-roots and to clouddumps*.eqiad.wmnet for xcollazo - https://phabricator.wikimedia.org/T367571#9905297 (10Dzahn) Ah, yes, confirmed. You already have clouddumps. And I also see the user on a host like dumpdata1004 or snapshot1010 where I assumed that's d... [21:29:45] jan_drewniak: hrm, was there any other backscroll in the output? [21:29:49] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:29:51] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:31:20] 06SRE, 10LDAP-Access-Requests: Update terms and timeline of access already granted for AndyRussG - https://phabricator.wikimedia.org/T367681#9905306 (10Dzahn) The update to approvers for WMDE would be in T367914 [21:31:26] RESOLVED: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:32:43] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudvirt-wdqs100[1,2,3] - https://phabricator.wikimedia.org/T367773#9905314 (10VRiley-WMF) 05Open→03Resolved [21:34:43] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 52198 bytes in 3.078 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:34:43] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.998 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:35:07] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367925#9905318 (10VRiley-WMF) a:03VRiley-WMF [21:35:21] thcipriani: only a k8s host timeout :/ [21:35:36] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367925#9905319 (10VRiley-WMF) Adjusted power cable. Power supply is back on. [21:35:45] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367925#9905321 (10VRiley-WMF) 05Open→03Resolved [21:37:41] urbanecm: I worked out what my beta feature issue was. I completely forgot about needing to add it to wgBetaFeaturesAllowList. [21:38:16] Kemayo: ahh. Makes sense. [21:38:59] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9905322 (10Ladsgroup) 05Open→03Resolved https://lists.wikimedia.org/postorius/lists/project-wikimoitree.lists.wikimedia.org [21:40:16] 06SRE, 10Wikimedia-Mailing-lists: MM3/postorius: takes too long to load - https://phabricator.wikimedia.org/T314247#9905328 (10Dzahn) Mailman migrated to a new server and a new version just now. Did this get faster? [21:44:18] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:50:31] (03PS1) 10DLynch: Add Visual Editor collab beta feature to wgBetaFeaturesAllowList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047182 [21:50:48] 06SRE, 10Wikimedia-Mailing-lists: MM3/postorius: takes too long to load - https://phabricator.wikimedia.org/T314247#9905359 (10Reedy) →14Duplicate dup:03T353891 [21:52:52] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#9905356 (10Reedy) [21:53:04] The joys of things that don't apply to local development config. [21:53:37] 06SRE, 10Wikimedia-Mailing-lists, 07Performance Issue, 07Upstream: https://lists.wikimedia.org is often slow to load - https://phabricator.wikimedia.org/T353891#9905392 (10Dzahn) >>! In T353891#9684341, @fnegri wrote: > It's very slow for me as well, I hadn't opened it in a while but it was barely usable b... [21:54:25] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [21:55:54] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:56:37] (03PS29) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [21:56:57] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [21:57:33] (03CR) 10CI reject: [V:04-1] dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [22:03:01] (03PS1) 10Dzahn: mailman3: remove buster support [puppet] - 10https://gerrit.wikimedia.org/r/1047184 (https://phabricator.wikimedia.org/T331706) [22:05:04] (03PS1) 10JHathaway: postfix: always send local mail to smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/1047185 (https://phabricator.wikimedia.org/T325406) [22:05:31] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047185 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [22:07:03] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [22:09:40] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:11:34] (03PS2) 10JHathaway: postfix: always send local mail to smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/1047185 (https://phabricator.wikimedia.org/T325406) [22:11:49] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047185 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [22:18:08] (03CR) 10JHathaway: [C:03+2] postfix: always send local mail to smarthosts [puppet] - 10https://gerrit.wikimedia.org/r/1047185 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [22:19:45] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [22:20:55] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:30:28] (03PS1) 10DLynch: findAddedContentNeedingReference was removed accidentally [extensions/VisualEditor] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047188 (https://phabricator.wikimedia.org/T367920) [22:31:00] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [22:34:47] FIRING: [2x] UdpMxIrcEchoThroughput: irc1002:9221 has relayed less than 100 messages over past 5 minutes }} - https://wikitech.wikimedia.org/wiki/Irc.wikimedia.org - https://grafana.wikimedia.org/d/XyXn_CPMz/ircecho - https://alerts.wikimedia.org/?q=alertname%3DUdpMxIrcEchoThroughput [22:34:49] (03PS1) 10Bking: analytics: allow dse-k8s pod network to reach an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001) [22:35:09] (03PS2) 10Bking: analytics: allow dse-k8s pod network to reach an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001) [22:35:45] (03PS1) 10JHathaway: postfix: fix path to aliases [puppet] - 10https://gerrit.wikimedia.org/r/1047190 (https://phabricator.wikimedia.org/T325406) [22:35:55] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047190 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [22:36:23] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [22:37:16] (03PS1) 10BCornwall: ncredir: Remove localized TLD redirects [puppet] - 10https://gerrit.wikimedia.org/r/1047191 [22:38:26] (03CR) 10Jforrester: [C:03+1] Add Visual Editor collab beta feature to wgBetaFeaturesAllowList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047182 (owner: 10DLynch) [22:39:00] (03CR) 10JHathaway: [C:03+2] postfix: fix path to aliases [puppet] - 10https://gerrit.wikimedia.org/r/1047190 (https://phabricator.wikimedia.org/T325406) (owner: 10JHathaway) [22:39:16] I'm going to deploy a minor train backport for Wikifunctions, and a more serious one for VE. [22:39:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/WikiLambda] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047077 (https://phabricator.wikimedia.org/T367159) (owner: 10Jforrester) [22:39:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047188 (https://phabricator.wikimedia.org/T367920) (owner: 10DLynch) [22:40:57] (03PS1) 10EoghanGaffney: lists: Change lists.wm.o A/AAAA records to CNAME and MX [dns] - 10https://gerrit.wikimedia.org/r/1047192 [22:41:14] (03CR) 10Jforrester: [C:03+2] "Officewiki-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047182 (owner: 10DLynch) [22:41:23] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/2963/co" [puppet] - 10https://gerrit.wikimedia.org/r/1047191 (owner: 10BCornwall) [22:41:54] (03Merged) 10jenkins-bot: Add Visual Editor collab beta feature to wgBetaFeaturesAllowList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1047182 (owner: 10DLynch) [22:41:57] (03CR) 10CI reject: [V:04-1] lists: Change lists.wm.o A/AAAA records to CNAME and MX [dns] - 10https://gerrit.wikimedia.org/r/1047192 (owner: 10EoghanGaffney) [22:45:22] (03CR) 10Ryan Kemper: [C:03+1] analytics: allow dse-k8s pod network to reach an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [22:45:48] FIRING: [2x] SystemdUnitFailed: mail-aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:45:49] (03CR) 10Bking: [C:03+2] analytics: allow dse-k8s pod network to reach an-db1001 [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [22:45:56] (03Merged) 10jenkins-bot: Use isEnumType in selector and isCustomEnum for creating literals [extensions/WikiLambda] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047077 (https://phabricator.wikimedia.org/T367159) (owner: 10Jforrester) [22:46:29] Kemayo: (Hello here.) [22:46:48] I always forget how long VE patches take to land. [22:47:27] (03CR) 10Btullis: [C:03+1] "nit: It affects both an-db100[1-2] although 1002 is currently the replica." [puppet] - 10https://gerrit.wikimedia.org/r/1047189 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [22:47:33] James_F: Need me to test them on debug before they go into the train branch? [22:48:00] Kemayo: No, I'm happy to test myself – but also happy to pause for to you to approve, if you'd prefer. [22:49:04] James_F: Someone else testing sounds good overall. I did test myself when I wrote the patch, but more eyes and all. [22:49:08] !log bking@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow: apply [22:49:18] +1 [22:49:36] !log bking@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow: apply [22:51:09] (03PS2) 10EoghanGaffney: lists: Change lists.wm.o A/AAAA records to CNAME and MX [dns] - 10https://gerrit.wikimedia.org/r/1047192 [22:51:56] (03PS2) 10Jdlrobson: Cleanup: Remove wgNavigationTimingSurveyName [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1043880 (https://phabricator.wikimedia.org/T367128) [22:52:05] (03CR) 10CI reject: [V:04-1] lists: Change lists.wm.o A/AAAA records to CNAME and MX [dns] - 10https://gerrit.wikimedia.org/r/1047192 (owner: 10EoghanGaffney) [22:57:15] (03PS2) 10Jdlrobson: Enable dark mode on data table pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373) [23:03:54] (03CR) 10EoghanGaffney: "From my perspective, we do need it as there were some parts of mailman-web that weren't respecting the different mailman root. We need to " [puppet] - 10https://gerrit.wikimedia.org/r/1047094 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [23:04:40] (03Merged) 10jenkins-bot: findAddedContentNeedingReference was removed accidentally [extensions/VisualEditor] (wmf/1.43.0-wmf.10) - 10https://gerrit.wikimedia.org/r/1047188 (https://phabricator.wikimedia.org/T367920) (owner: 10DLynch) [23:04:58] Finally! [23:05:16] Under half an hour! It's pretty good today! [23:05:31] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1047077|Use isEnumType in selector and isCustomEnum for creating literals (T367159)]], [[gerrit:1047188|findAddedContentNeedingReference was removed accidentally (T367920)]] [23:05:38] T367159: Unable to create converters using the UI as identity fields cannot be set - https://phabricator.wikimedia.org/T367159 [23:05:38] T367920: Cannot save edits in testwiki with VE: mw.editcheck.findAddedContentNeedingReference is not a function - https://phabricator.wikimedia.org/T367920 [23:08:17] (03PS1) 10Scott French: drivers/etcd: only attempt to load existing configs [software/conftool] - 10https://gerrit.wikimedia.org/r/1047193 (https://phabricator.wikimedia.org/T367919) [23:10:15] !log jforrester@deploy1002 jforrester, kemayo: Backport for [[gerrit:1047077|Use isEnumType in selector and isCustomEnum for creating literals (T367159)]], [[gerrit:1047188|findAddedContentNeedingReference was removed accidentally (T367920)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:12:48] Kemayo: And it works: https://test.wikipedia.org/w/index.php?title=Test&diff=prev&oldid=599462 [23:12:50] !log jforrester@deploy1002 jforrester, kemayo: Continuing with sync [23:13:06] James_F: Excellent, thanks! [23:13:24] Kemayo: Also https://office.wikimedia.org/wiki/Special:Preferences#mw-prefsection-betafeatures [23:15:48] FIRING: [2x] ProbeDown: Service ml-cache2001-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:16:05] James_F: Looks good there on 1002. Thanks again! [23:16:13] Success. [23:16:22] Now the second half hour wait, this time to sync. [23:16:41] 🎉 [23:22:48] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1047077|Use isEnumType in selector and isCustomEnum for creating literals (T367159)]], [[gerrit:1047188|findAddedContentNeedingReference was removed accidentally (T367920)]] (duration: 17m 16s) [23:22:54] T367159: Unable to create converters using the UI as identity fields cannot be set - https://phabricator.wikimedia.org/T367159 [23:22:54] T367920: Cannot save edits in testwiki with VE: mw.editcheck.findAddedContentNeedingReference is not a function - https://phabricator.wikimedia.org/T367920 [23:22:54] And done. [23:30:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9905564 (10Papaul) @Jhancock.wm @RobH some information on this server. **Information1** The server came with 2 network add-on cards: - 1st card connected to slot A1 is... [23:33:37] RECOVERY - Host mw2321 is UP: PING WARNING - Packet loss = 77%, RTA = 30.33 ms [23:35:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9905575 (10Papaul) [23:36:54] 06SRE, 10Cassandra, 06Data-Persistence: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567#9905577 (10Eevans) [23:38:27] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047199 [23:38:27] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1047199 (owner: 10TrainBranchBot) [23:58:16] (03PS1) 10Jforrester: mathoid: Upgrade image from 2023-11-03-103402 to 2024-06-18-233457 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047201 (https://phabricator.wikimedia.org/T350004) [23:58:54] (03CR) 10Jforrester: "I can deploy this on Thursday, if needed." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1047201 (https://phabricator.wikimedia.org/T350004) (owner: 10Jforrester)