[00:00:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [00:01:31] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9814750 (10Vladis13) >>! In T275319#9814384, @cscott wrote: > This discussion risks going in circles. As I wrote previously in T27... [00:01:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [00:02:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [00:03:10] PROBLEM - SSH on puppetserver1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:05:00] RECOVERY - SSH on puppetserver1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [00:16:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2002.mgmt.codfw.wmnet with reboot policy FORCED [00:16:52] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest2002'] [00:17:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest2002'] [00:18:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host sretest2002.codfw.wmnet with OS bullseye [00:18:22] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9814762 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sretest2002.codfw.w... [00:21:21] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9814765 (10Jhancock.wm) @cmooney I put the server in the wrong vlan. can you fix it for me. private1-a8 to private-a-codfw. thanks! [00:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:04:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T364299)', diff saved to https://phabricator.wikimedia.org/P62728 and previous config saved to /var/cache/conftool/dbconfig/20240521-010423-marostegui.json [01:04:31] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [01:07:41] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.43.0-wmf.6 [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1033393 (https://phabricator.wikimedia.org/T361400) [01:07:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.43.0-wmf.6 [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1033393 (https://phabricator.wikimedia.org/T361400) (owner: 10TrainBranchBot) [01:19:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P62729 and previous config saved to /var/cache/conftool/dbconfig/20240521-011931-marostegui.json [01:29:24] (03Merged) 10jenkins-bot: Branch commit for wmf/1.43.0-wmf.6 [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1033393 (https://phabricator.wikimedia.org/T361400) (owner: 10TrainBranchBot) [01:34:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P62730 and previous config saved to /var/cache/conftool/dbconfig/20240521-013441-marostegui.json [01:49:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T364299)', diff saved to https://phabricator.wikimedia.org/P62731 and previous config saved to /var/cache/conftool/dbconfig/20240521-014949-marostegui.json [01:49:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [01:49:56] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [01:50:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance [01:50:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T364299)', diff saved to https://phabricator.wikimedia.org/P62732 and previous config saved to /var/cache/conftool/dbconfig/20240521-015014-marostegui.json [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240521T0200) [02:01:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T352010)', diff saved to https://phabricator.wikimedia.org/P62733 and previous config saved to /var/cache/conftool/dbconfig/20240521-020126-ladsgroup.json [02:01:32] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:06:46] FIRING: HelmReleaseBadStatus: Helm release datasets-config-next/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datasets-config-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:16:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P62734 and previous config saved to /var/cache/conftool/dbconfig/20240521-021634-ladsgroup.json [02:31:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P62735 and previous config saved to /var/cache/conftool/dbconfig/20240521-023144-ladsgroup.json [02:36:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:46:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T352010)', diff saved to https://phabricator.wikimedia.org/P62736 and previous config saved to /var/cache/conftool/dbconfig/20240521-024652-ladsgroup.json [02:46:55] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [02:46:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:47:08] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [02:47:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T352010)', diff saved to https://phabricator.wikimedia.org/P62737 and previous config saved to /var/cache/conftool/dbconfig/20240521-024715-ladsgroup.json [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240521T0300) [03:01:40] (03PS1) 10TrainBranchBot: testwikis wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034206 (https://phabricator.wikimedia.org/T361400) [03:01:42] (03CR) 10TrainBranchBot: [C:03+2] testwikis wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034206 (https://phabricator.wikimedia.org/T361400) (owner: 10TrainBranchBot) [03:01:46] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:24] (03Merged) 10jenkins-bot: testwikis wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034206 (https://phabricator.wikimedia.org/T361400) (owner: 10TrainBranchBot) [03:02:54] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.43.0-wmf.6 refs T361400 [03:02:59] T361400: 1.43.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T361400 [03:10:26] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv6: Connect - Telxius, AS12956/IPv4: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:10:32] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:10:54] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:11:04] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:16:46] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:44:20] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 1/1 UP : OSPFv3: 1/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:44:34] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 11, down: 2, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:44:54] RECOVERY - BFD status on cr2-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:45:36] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:00:06] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240521T0400) [04:01:46] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.43.0-wmf.6 refs T361400 (duration: 58m 51s) [04:01:52] T361400: 1.43.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T361400 [04:05:29] !log mwpresync@deploy1002 Pruned MediaWiki: 1.43.0-wmf.3 (duration: 05m 28s) [04:05:34] (03PS1) 10KartikMistry: Update cxserver to 2024-05-20-182409-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034211 (https://phabricator.wikimedia.org/T354666) [04:13:08] (03PS2) 10KartikMistry: Fix the mobile experience for a second group of Wikipedias where CX is in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032793 (https://phabricator.wikimedia.org/T361597) [04:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:50:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T352010)', diff saved to https://phabricator.wikimedia.org/P62738 and previous config saved to /var/cache/conftool/dbconfig/20240521-045037-ladsgroup.json [04:50:46] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:56:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 28 hosts with reason: Schema change [04:56:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 28 hosts with reason: Schema change [05:00:31] !log Deploy schema change on s7 (metawiki) codfw dbmaint T365352 [05:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:39] T365352: Stop referencing rev_id as signed int in revtag table to counter revision id overflow in wikidatawiki - https://phabricator.wikimedia.org/T365352 [05:03:47] (03PS1) 10Marostegui: db2102: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034213 [05:04:28] (03CR) 10Marostegui: [C:03+2] db2102: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034213 (owner: 10Marostegui) [05:05:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2102.codfw.wmnet with OS bookworm [05:05:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P62739 and previous config saved to /var/cache/conftool/dbconfig/20240521-050546-ladsgroup.json [05:11:46] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:20:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2102.codfw.wmnet with reason: host reimage [05:20:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P62740 and previous config saved to /var/cache/conftool/dbconfig/20240521-052054-ladsgroup.json [05:22:18] (03PS1) 10Marostegui: Revert "db2102: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034169 [05:22:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2102.codfw.wmnet with reason: host reimage [05:23:44] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:24:00] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:24:04] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:24:30] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:31:46] RESOLVED: JobUnavailable: Reduced availability for job mysql-test in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:16] !log Deploy schema change on s7 (metawiki) eqiad dbmaint T365352 [05:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:23] T365352: Stop referencing rev_id as signed int in revtag table to counter revision id overflow in wikidatawiki - https://phabricator.wikimedia.org/T365352 [05:36:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T352010)', diff saved to https://phabricator.wikimedia.org/P62741 and previous config saved to /var/cache/conftool/dbconfig/20240521-053602-ladsgroup.json [05:36:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [05:36:16] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:36:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1213.eqiad.wmnet with reason: Maintenance [05:36:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1213 (T352010)', diff saved to https://phabricator.wikimedia.org/P62742 and previous config saved to /var/cache/conftool/dbconfig/20240521-053627-ladsgroup.json [05:37:12] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:37:44] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:38:32] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:38:51] (03CR) 10Marostegui: [C:03+2] Revert "db2102: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034169 (owner: 10Marostegui) [05:39:02] RECOVERY - BFD status on cr2-eqiad is OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:40:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2102.codfw.wmnet with OS bookworm [05:42:13] (03PS1) 10Marostegui: db2102: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034215 [05:43:19] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Correct IDP login page Privacy Policy - https://phabricator.wikimedia.org/T350129#9815067 (10Pppery) [05:44:21] (03CR) 10Marostegui: [C:03+2] db2102: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034215 (owner: 10Marostegui) [05:55:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1182 T361543', diff saved to https://phabricator.wikimedia.org/P62743 and previous config saved to /var/cache/conftool/dbconfig/20240521-055501-root.json [05:55:06] T361543: Upgrade s2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T361543 [05:55:57] (03PS1) 10Marostegui: db1182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034216 [05:56:20] (03CR) 10Marostegui: [C:03+2] db1182: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034216 (owner: 10Marostegui) [05:56:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1182.eqiad.wmnet with OS bookworm [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240521T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240521T0600). [06:06:46] FIRING: HelmReleaseBadStatus: Helm release datasets-config-next/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datasets-config-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:10:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1182.eqiad.wmnet with reason: host reimage [06:13:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1182.eqiad.wmnet with reason: host reimage [06:15:13] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2136.codfw.wmnet [06:20:10] (03PS1) 10Muehlenhoff: Switch db2136 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034365 (https://phabricator.wikimedia.org/T349619) [06:20:36] (03PS1) 10Marostegui: Revert "db1182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034170 [06:22:45] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 13Patch-For-Review, and 2 others: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9815102 (10Marostegui) One more test in db1191: ` cumin2024@db1191.eqiad.wmnet... [06:22:56] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 13Patch-For-Review, and 2 others: Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9815103 (10Marostegui) [06:26:02] 10ops-codfw, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T365423 (10phaultfinder) 03NEW [06:26:57] (03CR) 10Muehlenhoff: [C:03+2] Switch db2136 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034365 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [06:29:29] (03PS1) 10Muehlenhoff: Remove obsolete Druid cert [puppet] - 10https://gerrit.wikimedia.org/r/1034366 [06:31:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62744 and previous config saved to /var/cache/conftool/dbconfig/20240521-063109-root.json [06:31:23] (03CR) 10Marostegui: [C:03+2] Revert "db1182: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034170 (owner: 10Marostegui) [06:31:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2136.codfw.wmnet [06:33:12] (03PS1) 10Marostegui: db1182: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034367 [06:33:33] (03CR) 10Marostegui: [C:03+2] db1182: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034367 (owner: 10Marostegui) [06:33:55] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2137.codfw.wmnet [06:34:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1182.eqiad.wmnet with OS bookworm [06:35:44] (03PS1) 10Muehlenhoff: Switch db2137 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034368 (https://phabricator.wikimedia.org/T349619) [06:36:18] !log installing nghttp2 security updates [06:36:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:04] (03CR) 10Muehlenhoff: [C:03+2] Switch db2137 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034368 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [06:42:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2137.codfw.wmnet [06:44:21] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 398203 [06:44:47] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398203 [06:46:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62745 and previous config saved to /var/cache/conftool/dbconfig/20240521-064615-root.json [06:46:45] (03PS1) 10Marostegui: db1208: Add owner comments [puppet] - 10https://gerrit.wikimedia.org/r/1034369 [06:47:08] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2140.codfw.wmnet [06:48:26] (03PS1) 10Muehlenhoff: Switch db2140 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034370 (https://phabricator.wikimedia.org/T349619) [06:48:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: partial power outage for lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365289#9815149 (10ayounsi) [06:50:10] (03CR) 10Marostegui: [C:03+2] db1208: Add owner comments [puppet] - 10https://gerrit.wikimedia.org/r/1034369 (owner: 10Marostegui) [06:51:53] (03CR) 10Slyngshede: [C:03+2] P:idm Use account login page for monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1032789 (owner: 10Slyngshede) [06:52:04] !log installing postgresql-11 security updates [06:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:29] (03CR) 10Muehlenhoff: [C:03+2] Switch db2140 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034370 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [06:53:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1237 T358642', diff saved to https://phabricator.wikimedia.org/P62746 and previous config saved to /var/cache/conftool/dbconfig/20240521-065318-marostegui.json [06:53:24] T358642: Upgrade x1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358642 [06:53:58] (03PS1) 10Marostegui: db1237: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034371 [06:54:24] (03CR) 10Marostegui: [C:03+2] db1237: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034371 (owner: 10Marostegui) [06:54:44] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1237.eqiad.wmnet with OS bookworm [06:56:51] 10ops-codfw, 06SRE, 06DC-Ops: Duplicate IP on mgmt network - https://phabricator.wikimedia.org/T365423#9815169 (10phaultfinder) [07:00:05] Amir1 and Urbanecm: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240521T0700). [07:00:05] matthiasmullie, kart_, and bawolff: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:25] \o/ [07:00:49] * kart_ is here [07:00:50] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 8075 [07:00:58] matthiasmullie: Please go ahead [07:00:59] RECOVERY - BGP status on cr2-esams is OK: BGP OK - up: 25, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:01:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62747 and previous config saved to /var/cache/conftool/dbconfig/20240521-070121-root.json [07:01:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2140.codfw.wmnet [07:03:34] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2147.codfw.wmnet [07:04:57] (03PS1) 10Muehlenhoff: Switch db2147 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034372 (https://phabricator.wikimedia.org/T349619) [07:05:47] (03CR) 10Muehlenhoff: [C:03+2] Switch db2147 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034372 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:08:05] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 8075 [07:09:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1237.eqiad.wmnet with reason: host reimage [07:09:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2147.codfw.wmnet [07:09:57] matthiasmullie: around? [07:11:59] (03CR) 10Slyngshede: [C:03+2] P:ganeti Prometheus monitoring of ganeti noded services. [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:12:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1237.eqiad.wmnet with reason: host reimage [07:12:43] Let me go ahead with my config patch. [07:13:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032793 (https://phabricator.wikimedia.org/T361597) (owner: 10KartikMistry) [07:13:45] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 11170 [07:14:01] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 11170 [07:14:09] (03Merged) 10jenkins-bot: Fix the mobile experience for a second group of Wikipedias where CX is in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032793 (https://phabricator.wikimedia.org/T361597) (owner: 10KartikMistry) [07:14:49] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2172.codfw.wmnet [07:15:06] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1032793|Fix the mobile experience for a second group of Wikipedias where CX is in beta (T361597)]] [07:15:11] T361597: Fix the mobile experience for a second group of Wikipedias where Content Translation is in beta - https://phabricator.wikimedia.org/T361597 [07:16:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62748 and previous config saved to /var/cache/conftool/dbconfig/20240521-071627-root.json [07:16:46] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:17:56] !log kartik@deploy1002 kartik: Backport for [[gerrit:1032793|Fix the mobile experience for a second group of Wikipedias where CX is in beta (T361597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:19:47] o/ [07:19:53] (sorry I'm late!) [07:20:58] kart_ & @bawolff, can you ping me when you're done with your deployments? [07:21:47] !log kartik@deploy1002 kartik: Continuing with sync [07:21:56] matthiasmullie: sure [07:22:01] matthiasmullie: I haven't started yet (Also I'm not a deployer, so I'm hoping that either Amir1 or Urbanecm will help me deploy my patch) [07:22:31] i can't deploy today, unfortunately :/ [07:23:57] I also have to go to lunch outside after my deployment.. [07:24:07] no worries, if someone else can, that would be great, but its also not an urgent patch so no big deal if not [07:24:12] RECOVERY - Router interfaces on cr2-magru is OK: OK: host 195.200.68.129, interfaces up: 47, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:47] (03PS1) 10Marostegui: Revert "db1237: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034171 [07:26:42] bawolff: I can help you deploy that patch if needed, although perhaps Amir1 would be in a better position given he was already involved in that ticket earlier on? [07:27:14] PROBLEM - Router interfaces on cr2-magru is CRITICAL: CRITICAL: host 195.200.68.129, interfaces up: 48, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:27:22] That would be great if you could deploy it [07:28:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P62749 and previous config saved to /var/cache/conftool/dbconfig/20240521-072817-root.json [07:28:25] (03CR) 10Marostegui: [C:03+2] Revert "db1237: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034171 (owner: 10Marostegui) [07:28:51] bawolff Sure. Do you mind if I do mine first, though? Those are quite urgent [07:28:59] sure, go ahead [07:29:15] RECOVERY - Router interfaces on cr2-magru is OK: OK: host 195.200.68.129, interfaces up: 49, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:30:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1237.eqiad.wmnet with OS bookworm [07:30:19] (03PS1) 10Marostegui: db1237: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034420 [07:30:59] (03CR) 10Marostegui: [C:03+2] db1237: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034420 (owner: 10Marostegui) [07:31:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62750 and previous config saved to /var/cache/conftool/dbconfig/20240521-073133-root.json [07:32:05] (03PS1) 10Muehlenhoff: Switch db2172 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034423 (https://phabricator.wikimedia.org/T349619) [07:33:56] RESOLVED: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T352010)', diff saved to https://phabricator.wikimedia.org/P62751 and previous config saved to /var/cache/conftool/dbconfig/20240521-073407-ladsgroup.json [07:34:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [07:34:50] (03CR) 10Muehlenhoff: [C:03+2] Switch db2172 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034423 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:35:25] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1032793|Fix the mobile experience for a second group of Wikipedias where CX is in beta (T361597)]] (duration: 20m 18s) [07:35:29] T361597: Fix the mobile experience for a second group of Wikipedias where Content Translation is in beta - https://phabricator.wikimedia.org/T361597 [07:36:11] matthiasmullie: bawolff: I'm done. [07:36:40] @kart_ thanks, I'm moving forward [07:36:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1002 using scap backport" [extensions/UploadWizard] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032823 (https://phabricator.wikimedia.org/T365107) (owner: 10Matthias Mullie) [07:37:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1221', diff saved to https://phabricator.wikimedia.org/P62752 and previous config saved to /var/cache/conftool/dbconfig/20240521-073727-marostegui.json [07:38:13] (03PS1) 10Marostegui: db1221: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034426 [07:38:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2172.codfw.wmnet [07:39:21] (03CR) 10Marostegui: [C:03+2] db1221: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034426 (owner: 10Marostegui) [07:39:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1221.eqiad.wmnet with OS bookworm [07:40:01] !log installing python 3.7 security updates [07:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:43] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2179.codfw.wmnet [07:40:51] (03PS56) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [07:41:41] (03PS1) 10Muehlenhoff: Switch db2179 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034427 (https://phabricator.wikimedia.org/T349619) [07:42:31] (03PS5) 10Brouberol: [WIP] Deploy the ceph-csi-rbd chart to dse-k8s with default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028938 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [07:43:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P62753 and previous config saved to /var/cache/conftool/dbconfig/20240521-074323-root.json [07:44:19] (03CR) 10Hashar: [C:04-1] "I'd keep it, it notifies `#wikimedia-releng` and our emails. The httpbb tests are also affected by Apache 2 and we can use the more detail" [puppet] - 10https://gerrit.wikimedia.org/r/1032526 (owner: 10Dzahn) [07:44:30] (03CR) 10Muehlenhoff: [C:03+2] Switch db2179 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034427 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:45:06] (03CR) 10Matthias Mullie: [C:03+2] Remove complicated synchronization of caption/description inputs [extensions/UploadWizard] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032824 (https://phabricator.wikimedia.org/T365119) (owner: 10Matthias Mullie) [07:46:40] (03Merged) 10jenkins-bot: Fix automatic numbering of copied titles [extensions/UploadWizard] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032823 (https://phabricator.wikimedia.org/T365107) (owner: 10Matthias Mullie) [07:46:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62754 and previous config saved to /var/cache/conftool/dbconfig/20240521-074639-root.json [07:47:10] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:1032823|Fix automatic numbering of copied titles (T365107)]] [07:47:14] T365107: UploadWizard gives the same automatic number when uploading medias with same name - https://phabricator.wikimedia.org/T365107 [07:49:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P62755 and previous config saved to /var/cache/conftool/dbconfig/20240521-074914-ladsgroup.json [07:49:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2179.codfw.wmnet [07:49:28] !log disable puppet on all mediawiki hardware hosts - T345740 [07:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:32] T345740: Sunset onhost memcached on mediawiki servers and puppet - https://phabricator.wikimedia.org/T345740 [07:49:50] !log mlitn@deploy1002 mlitn: Backport for [[gerrit:1032823|Fix automatic numbering of copied titles (T365107)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:50:31] !log mlitn@deploy1002 mlitn: Continuing with sync [07:51:08] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on es2022 - https://phabricator.wikimedia.org/T365213#9815367 (10ABran-WMF) @Jhancock.wm this would be OK as long as the disks have the same speed [07:51:31] !log installing nginx security updates [07:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:16] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2206.codfw.wmnet [07:52:48] (03CR) 10Effie Mouzeli: [C:03+2] memcached/mcrouter: remove onhost memcached [puppet] - 10https://gerrit.wikimedia.org/r/1020191 (https://phabricator.wikimedia.org/T345740) (owner: 10Effie Mouzeli) [07:53:24] (03Merged) 10jenkins-bot: Remove complicated synchronization of caption/description inputs [extensions/UploadWizard] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032824 (https://phabricator.wikimedia.org/T365119) (owner: 10Matthias Mullie) [07:54:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1221.eqiad.wmnet with reason: host reimage [07:54:40] (03PS1) 10Muehlenhoff: Switch db2206 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034428 (https://phabricator.wikimedia.org/T349619) [07:56:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1221.eqiad.wmnet with reason: host reimage [07:58:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P62756 and previous config saved to /var/cache/conftool/dbconfig/20240521-075830-root.json [08:00:04] andre and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240521T0800) [08:00:10] I'm NOT promoting group0 wikis to 1.43.0-wmf.6 as we are blocked on merging+backporting the CA patch in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1034142 [08:00:11] (03CR) 10Muehlenhoff: [C:03+2] Switch db2206 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034428 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:01:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62757 and previous config saved to /var/cache/conftool/dbconfig/20240521-080145-root.json [08:03:05] andre: will backport that shortly. [08:03:12] tgr: Thanks! [08:04:13] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:1032823|Fix automatic numbering of copied titles (T365107)]] (duration: 17m 02s) [08:04:19] T365107: UploadWizard gives the same automatic number when uploading medias with same name - https://phabricator.wikimedia.org/T365107 [08:04:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213', diff saved to https://phabricator.wikimedia.org/P62758 and previous config saved to /var/cache/conftool/dbconfig/20240521-080422-ladsgroup.json [08:04:47] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:1032824|Remove complicated synchronization of caption/description inputs (T365119)]] [08:04:51] T365119: UploadWizard doesn't allow to copy descriptions to other medias - https://phabricator.wikimedia.org/T365119 [08:05:19] (03PS1) 10Marostegui: Revert "db1221: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034172 [08:05:57] (03CR) 10Brouberol: [C:03+1] an-druid: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032776 (owner: 10Muehlenhoff) [08:06:19] (03CR) 10Brouberol: [C:03+1] an-conf: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032778 (owner: 10Muehlenhoff) [08:07:25] !log mlitn@deploy1002 mlitn: Backport for [[gerrit:1032824|Remove complicated synchronization of caption/description inputs (T365119)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:09:07] !log mlitn@deploy1002 mlitn: Continuing with sync [08:09:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2206.codfw.wmnet [08:10:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9815439 (10cmooney) @Jhancock.wm @Papaul I'd been using the server in b7 for testing already, but I should be able to move over to... [08:11:09] (03PS1) 10Gergő Tisza: Temporarily restore $wgCentralAuthDatabase [extensions/CentralAuth] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034173 (https://phabricator.wikimedia.org/T348486) [08:13:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P62759 and previous config saved to /var/cache/conftool/dbconfig/20240521-081336-root.json [08:14:59] !log enable puppet on mediawiki codfw servers [08:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:30] (03PS2) 10Brian Wolff: Allow async (job queue based) chunked upload on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032888 (https://phabricator.wikimedia.org/T364644) [08:16:47] (03CR) 10Marostegui: [C:03+2] Revert "db1221: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034172 (owner: 10Marostegui) [08:17:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62760 and previous config saved to /var/cache/conftool/dbconfig/20240521-081706-root.json [08:18:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1221.eqiad.wmnet with OS bookworm [08:18:39] (03PS2) 10JMeybohm: Remove role etcd::v3::kubernetes::staging [puppet] - 10https://gerrit.wikimedia.org/r/1034193 (https://phabricator.wikimedia.org/T363307) (owner: 10RLazarus) [08:18:48] (03CR) 10CI reject: [V:04-1] Temporarily restore $wgCentralAuthDatabase [extensions/CentralAuth] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034173 (https://phabricator.wikimedia.org/T348486) (owner: 10Gergő Tisza) [08:19:03] @bawolff still around? mine are just about done, so I can almost move forward with your config patch [08:19:13] matthiasmullie: yep, im still here [08:19:20] (03CR) 10JMeybohm: [C:03+1] "Missed that one, thanks. Hijacked this CR to completely remove the puppet role" [puppet] - 10https://gerrit.wikimedia.org/r/1034193 (https://phabricator.wikimedia.org/T363307) (owner: 10RLazarus) [08:19:30] matthiasmullie: please ping me when done, need to deploy a fix for a train blocker. [08:19:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1213 (T352010)', diff saved to https://phabricator.wikimedia.org/P62761 and previous config saved to /var/cache/conftool/dbconfig/20240521-081930-ladsgroup.json [08:19:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [08:19:35] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [08:19:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1216.eqiad.wmnet with reason: Maintenance [08:19:47] (and also one a for a CI break, apparently.) [08:20:27] tgr|away: will do! [08:20:31] (03PS1) 10Gergő Tisza: Mock Session::getUser in ::testUserScriptsDisabled [extensions/CentralAuth] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034174 (https://phabricator.wikimedia.org/T365403) [08:22:27] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:1032824|Remove complicated synchronization of caption/description inputs (T365119)]] (duration: 17m 40s) [08:22:32] T365119: UploadWizard doesn't allow to copy descriptions to other medias - https://phabricator.wikimedia.org/T365119 [08:22:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032888 (https://phabricator.wikimedia.org/T364644) (owner: 10Brian Wolff) [08:22:35] (03CR) 10Muehlenhoff: "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/1034193 (https://phabricator.wikimedia.org/T363307) (owner: 10RLazarus) [08:23:12] (03CR) 10Gergő Tisza: [C:03+2] Mock Session::getUser in ::testUserScriptsDisabled [extensions/CentralAuth] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034174 (https://phabricator.wikimedia.org/T365403) (owner: 10Gergő Tisza) [08:23:16] (03Merged) 10jenkins-bot: Allow async (job queue based) chunked upload on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032888 (https://phabricator.wikimedia.org/T364644) (owner: 10Brian Wolff) [08:23:46] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:1032888|Allow async (job queue based) chunked upload on all wikis (T364644)]] [08:23:50] T364644: Set $wgEnableAsyncUploads = true on all wikis - https://phabricator.wikimedia.org/T364644 [08:25:00] (03PS3) 10JMeybohm: Remove role etcd::v3::kubernetes::staging [puppet] - 10https://gerrit.wikimedia.org/r/1034193 (https://phabricator.wikimedia.org/T363307) (owner: 10RLazarus) [08:25:37] (03CR) 10JMeybohm: [C:03+1] Remove role etcd::v3::kubernetes::staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034193 (https://phabricator.wikimedia.org/T363307) (owner: 10RLazarus) [08:26:26] !log mlitn@deploy1002 mlitn and bawolff: Backport for [[gerrit:1032888|Allow async (job queue based) chunked upload on all wikis (T364644)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:26:35] @bawolff changes are on mwdebug - please test [08:27:32] matthiasmullie: I tested as good as i can. You can really only tell the difference if uploading a multi-gb file [08:27:44] ok, moving forward [08:27:50] !log mlitn@deploy1002 mlitn and bawolff: Continuing with sync [08:28:08] PROBLEM - Check whether ferm is active by checking the default input chain on parse1013 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:28:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1237 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P62762 and previous config saved to /var/cache/conftool/dbconfig/20240521-082842-root.json [08:29:31] Is anyone backporting right now, or plans to do so? [08:29:41] matthiasmullie / tgr / etc ^ [08:30:09] @andre last patch is syncing [08:30:36] and looks like @tgr|away had 2? [08:30:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1167 for a schema change', diff saved to https://phabricator.wikimedia.org/P62763 and previous config saved to /var/cache/conftool/dbconfig/20240521-083053-root.json [08:31:49] matthiasmullie / tgr: Please let us know when you are done with backporting so we can deploy wmf.6 to group0 afterwards. Thanks a lot! [08:32:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62764 and previous config saved to /var/cache/conftool/dbconfig/20240521-083212-root.json [08:32:16] (03Merged) 10jenkins-bot: Mock Session::getUser in ::testUserScriptsDisabled [extensions/CentralAuth] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034174 (https://phabricator.wikimedia.org/T365403) (owner: 10Gergő Tisza) [08:32:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Long schema change [08:32:34] (03CR) 10Btullis: [WIP] Deploy the ceph-csi-rbd chart to dse-k8s with default values (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028938 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [08:32:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Long schema change [08:32:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Long schema change [08:33:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Long schema change [08:34:38] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [08:34:39] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [08:34:46] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [08:35:17] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1034193 (https://phabricator.wikimedia.org/T363307) (owner: 10RLazarus) [08:35:41] (03CR) 10Gergő Tisza: [C:03+2] Temporarily restore $wgCentralAuthDatabase [extensions/CentralAuth] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034173 (https://phabricator.wikimedia.org/T348486) (owner: 10Gergő Tisza) [08:35:50] !log Deploy schema change on s8 eqiad, this will cause a few hours of replication lag in s8 clouddb replicas T364299 [08:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:55] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:37:13] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns for sretest2002 - cmooney@cumin1002" [08:37:38] !log enable puppet on all mw* baremetal hosts [08:37:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:00] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2210.codfw.wmnet [08:38:24] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add dns for sretest2002 - cmooney@cumin1002" [08:38:24] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:40:09] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache sretest2002.wikimedia.org on all recursors [08:40:13] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2002.wikimedia.org on all recursors [08:41:00] (03PS1) 10Fabfur: benthos:cache: catch missing host header [puppet] - 10https://gerrit.wikimedia.org/r/1034431 (https://phabricator.wikimedia.org/T365441) [08:41:19] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:1032888|Allow async (job queue based) chunked upload on all wikis (T364644)]] (duration: 17m 32s) [08:41:23] @bawolff deployment is complete [08:41:26] !log UTC morning backports done [08:41:26] T364644: Set $wgEnableAsyncUploads = true on all wikis - https://phabricator.wikimedia.org/T364644 [08:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:32] matthiasmullie: Thanks. Appreciate it :) [08:41:36] @tgr|away I'm done, the floor is yours - please ping @andre when you're done [08:42:01] (03CR) 10JMeybohm: [C:03+2] Remove role etcd::v3::kubernetes::staging [puppet] - 10https://gerrit.wikimedia.org/r/1034193 (https://phabricator.wikimedia.org/T363307) (owner: 10RLazarus) [08:42:13] (03PS1) 10Muehlenhoff: Switch db2210 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034432 (https://phabricator.wikimedia.org/T349619) [08:43:23] (03CR) 10Muehlenhoff: [C:03+2] Switch db2210 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034432 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:43:23] (03Merged) 10jenkins-bot: Temporarily restore $wgCentralAuthDatabase [extensions/CentralAuth] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034173 (https://phabricator.wikimedia.org/T348486) (owner: 10Gergő Tisza) [08:43:51] !log installing ghostscript security updates [08:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:45] (03CR) 10JMeybohm: [C:04-1] pki: add temporary profile for prometheus + k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034048 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [08:47:06] tgr patch already got merged [08:47:12] but I guess is pending deployment [08:47:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62765 and previous config saved to /var/cache/conftool/dbconfig/20240521-084718-root.json [08:47:30] !log tgr@deploy1002 Started scap: Backport for [[gerrit:1034173|Temporarily restore $wgCentralAuthDatabase (T348486)]] [08:47:35] T348486: Migrate CentralAuth to use a virtual database domain - https://phabricator.wikimedia.org/T348486 [08:48:44] !log installing edk2 security updates [08:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:47] tgr|away: I am wondering whether you patch might already have been deployed by mathias [08:49:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2210.codfw.wmnet [08:49:12] given it was already merged [08:49:15] well we will seee [08:50:28] !log tgr@deploy1002 tgr: Backport for [[gerrit:1034173|Temporarily restore $wgCentralAuthDatabase (T348486)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:50:32] (03PS1) 10GergesShamon: arwiki: Disable Extension:ContentTranslation for non-autoreview users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034433 [08:51:29] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db2219.codfw.wmnet [08:51:43] (03CR) 10JMeybohm: [C:03+1] prometheus: use 'prometheus' profile for k8s certs [puppet] - 10https://gerrit.wikimedia.org/r/1034050 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [08:51:48] !log tgr@deploy1002 tgr: Continuing with sync [08:51:56] (03CR) 10Klausman: [C:03+1] Skip ROCm packages for ml-staging2001 [puppet] - 10https://gerrit.wikimedia.org/r/1032765 (https://phabricator.wikimedia.org/T363191) (owner: 10Elukey) [08:52:23] (03PS1) 10Muehlenhoff: Switch db2219 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034434 (https://phabricator.wikimedia.org/T349619) [08:52:27] (03PS2) 10GergesShamon: arwiki: Disable Extension:ContentTranslation for non-autoreview users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034433 (https://phabricator.wikimedia.org/T255022) [08:55:40] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudnet1005.eqiad.wmnet with OS bookworm [08:55:56] (03CR) 10Muehlenhoff: [C:03+2] Switch db2219 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034434 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:56:16] (03PS7) 10Majavah: site: Move cloudnet1005 to insetup_noferm to prep for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1032390 (https://phabricator.wikimedia.org/T364459) [08:56:16] (03PS8) 10Majavah: site: Move cloudnet1005 to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) [08:56:17] (03PS10) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) [08:57:04] (03CR) 10Majavah: [C:03+2] site: Move cloudnet1005 to insetup_noferm to prep for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1032390 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [08:57:13] (03PS1) 10Marostegui: phabricator_instance.my.cnf.erb: Do not auto-start replication [puppet] - 10https://gerrit.wikimedia.org/r/1034436 [08:58:05] (03CR) 10JMeybohm: Add proxy_host setting to the S3 cache. (031 comment) [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1032482 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [08:58:08] RECOVERY - Check whether ferm is active by checking the default input chain on parse1013 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:00:20] (03CR) 10Marostegui: [C:03+2] phabricator_instance.my.cnf.erb: Do not auto-start replication [puppet] - 10https://gerrit.wikimedia.org/r/1034436 (owner: 10Marostegui) [09:01:35] (03PS1) 10Btullis: Keep the /srv volume on an-launcher1002 when reimaging [puppet] - 10https://gerrit.wikimedia.org/r/1034437 (https://phabricator.wikimedia.org/T332580) [09:02:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db2219.codfw.wmnet [09:02:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62766 and previous config saved to /var/cache/conftool/dbconfig/20240521-090224-root.json [09:04:39] (03CR) 10Brouberol: [C:03+1] "`" [puppet] - 10https://gerrit.wikimedia.org/r/1034437 (https://phabricator.wikimedia.org/T332580) (owner: 10Btullis) [09:05:16] !log tgr@deploy1002 Finished scap: Backport for [[gerrit:1034173|Temporarily restore $wgCentralAuthDatabase (T348486)]] (duration: 17m 45s) [09:05:21] T348486: Migrate CentralAuth to use a virtual database domain - https://phabricator.wikimedia.org/T348486 [09:05:27] andre: ^ [09:05:30] (03CR) 10Stevemunene: [C:03+1] Keep the /srv volume on an-launcher1002 when reimaging [puppet] - 10https://gerrit.wikimedia.org/r/1034437 (https://phabricator.wikimedia.org/T332580) (owner: 10Btullis) [09:05:32] sorry for the holdup! [09:06:15] tgr|away: No problem! Thanks for the backports! [09:06:17] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1160.eqiad.wmnet [09:06:32] [TRAIN] We'll probably start promoting group0 wikis to 1.43.0-wmf.6 in about 15min [09:09:27] !log UTC morning deploys done [09:09:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:04] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [09:11:28] (03PS1) 10Marostegui: mariadb: Do not automatically start slave [puppet] - 10https://gerrit.wikimedia.org/r/1034439 [09:11:46] (03PS1) 10Muehlenhoff: Switch db1160 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034440 (https://phabricator.wikimedia.org/T349619) [09:13:07] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet1005.eqiad.wmnet with reason: host reimage [09:15:53] (03CR) 10Marostegui: [C:03+2] mariadb: Do not automatically start slave [puppet] - 10https://gerrit.wikimedia.org/r/1034439 (owner: 10Marostegui) [09:16:56] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-launcher1002.eqiad.wmnet with OS bullseye [09:17:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62767 and previous config saved to /var/cache/conftool/dbconfig/20240521-091732-root.json [09:18:53] (03CR) 10Joal: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1032715 (https://phabricator.wikimedia.org/T365223) (owner: 10Aqu) [09:20:21] 06SRE, 10LDAP-Access-Requests: Grant Access to nda for Ricki Jay - https://phabricator.wikimedia.org/T365138#9815711 (10RickiJay-WMDE) 05Resolved→03Open @Dzhan I'm still getting permissions errors trying to log into Grafana: {F54043901} [09:21:43] (03PS17) 10Effie Mouzeli: memcached: add memcached_user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [09:21:55] (03CR) 10CI reject: [V:04-1] memcached: add memcached_user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [09:22:41] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9815721 (10Ladsgroup) Honestly, I think this should be declined as this is a [[https://en.wikipedia.org/wiki/XY_problem|x/y proble... [09:24:41] (03PS1) 10TrainBranchBot: group0 wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034441 (https://phabricator.wikimedia.org/T361400) [09:24:43] (03CR) 10TrainBranchBot: [C:03+2] group0 wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034441 (https://phabricator.wikimedia.org/T361400) (owner: 10TrainBranchBot) [09:25:05] [TRAIN] Starting to promote group0 wikis to 1.43.0-wmf.6 now [09:25:14] (03CR) 10Muehlenhoff: [C:03+2] Switch db1160 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034440 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:25:19] (03PS18) 10Effie Mouzeli: memcached: add memcached_user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [09:25:25] (03Merged) 10jenkins-bot: group0 wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034441 (https://phabricator.wikimedia.org/T361400) (owner: 10TrainBranchBot) [09:25:51] (03PS19) 10Effie Mouzeli: memcached: add memcached_user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [09:28:26] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-launcher1002.eqiad.wmnet with reason: host reimage [09:29:15] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet1005.eqiad.wmnet with OS bookworm [09:31:15] (03PS2) 10Fabfur: benthos:cache: catch missing host header and delete meta field [puppet] - 10https://gerrit.wikimedia.org/r/1034431 (https://phabricator.wikimedia.org/T365441) [09:31:27] (03CR) 10Effie Mouzeli: [C:03+2] memcached: add memcached_user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [09:31:57] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-launcher1002.eqiad.wmnet with reason: host reimage [09:32:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62768 and previous config saved to /var/cache/conftool/dbconfig/20240521-093238-root.json [09:33:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1160.eqiad.wmnet [09:33:31] !log decommissioning 6 appservers in advance of reimaging to k8s control nodes [09:33:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:04] !log hnowlan@cumin1002 START - Cookbook sre.hosts.decommission for hosts mw[2331,2361,2391].codfw.wmnet,mw[1372,1429,1436].eqiad.wmnet [09:36:19] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1190.eqiad.wmnet [09:37:58] (03PS1) 10Hnowlan: conftool: remove appservers before renaming [puppet] - 10https://gerrit.wikimedia.org/r/1034442 (https://phabricator.wikimedia.org/T353464) [09:38:35] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1034046 (https://phabricator.wikimedia.org/T284145) (owner: 10LSobanski) [09:38:37] (03CR) 10Alexandros Kosiaris: [C:03+1] conftool: remove appservers before renaming [puppet] - 10https://gerrit.wikimedia.org/r/1034442 (https://phabricator.wikimedia.org/T353464) (owner: 10Hnowlan) [09:39:03] (03CR) 10JMeybohm: [C:03+1] conftool: remove appservers before renaming [puppet] - 10https://gerrit.wikimedia.org/r/1034442 (https://phabricator.wikimedia.org/T353464) (owner: 10Hnowlan) [09:39:22] (03CR) 10Hnowlan: [C:03+2] conftool: remove appservers before renaming [puppet] - 10https://gerrit.wikimedia.org/r/1034442 (https://phabricator.wikimedia.org/T353464) (owner: 10Hnowlan) [09:39:39] (03PS1) 10Muehlenhoff: Switch db1190 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034443 (https://phabricator.wikimedia.org/T349619) [09:41:21] !log aklapper@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.43.0-wmf.6 refs T361400 [09:41:25] T361400: 1.43.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T361400 [09:42:38] (03PS3) 10Ayounsi: sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 [09:43:06] (03CR) 10Ayounsi: "Thanks, addressing the comments, to be tested." [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [09:43:12] (03PS1) 10JMeybohm: Add wikikube-ctrl2001 to server SRV record for etcd [dns] - 10https://gerrit.wikimedia.org/r/1034444 (https://phabricator.wikimedia.org/T353464) [09:43:14] (03PS1) 10JMeybohm: Add wikikube-ctrl2002 to server SRV record for etcd [dns] - 10https://gerrit.wikimedia.org/r/1034445 (https://phabricator.wikimedia.org/T353464) [09:43:16] (03PS1) 10JMeybohm: Add wikikube-ctrl2003 to server SRV record for etcd [dns] - 10https://gerrit.wikimedia.org/r/1034446 (https://phabricator.wikimedia.org/T353464) [09:43:17] (03PS1) 10JMeybohm: Remove kubetcd200[4-6] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1034447 (https://phabricator.wikimedia.org/T353464) [09:44:10] (03CR) 10CI reject: [V:04-1] Remove kubetcd200[4-6] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1034447 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [09:44:11] (03CR) 10CI reject: [V:04-1] Add wikikube-ctrl2002 to server SRV record for etcd [dns] - 10https://gerrit.wikimedia.org/r/1034445 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [09:44:12] (03CR) 10CI reject: [V:04-1] Add wikikube-ctrl2001 to server SRV record for etcd [dns] - 10https://gerrit.wikimedia.org/r/1034444 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [09:44:17] (03CR) 10CI reject: [V:04-1] Add wikikube-ctrl2003 to server SRV record for etcd [dns] - 10https://gerrit.wikimedia.org/r/1034446 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [09:46:24] (03CR) 10Muehlenhoff: [C:03+2] Switch db1190 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034443 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [09:46:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9815821 (10akosiaris) I 've managed to fix some issues I had with my partman recipe and after some unsuccessful reimages I have succeeded in the following: -... [09:47:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1221 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62769 and previous config saved to /var/cache/conftool/dbconfig/20240521-094744-root.json [09:49:53] (03CR) 10Gmodena: [C:03+1] "Discussed async. LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1034431 (https://phabricator.wikimedia.org/T365441) (owner: 10Fabfur) [09:49:59] (03PS10) 10Effie Mouzeli: memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) [09:50:29] (03CR) 10Volans: "As discussed in the meet" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [09:50:48] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9815839 (10akosiaris) As an update, I had to reimage these servers as I had messed up the original recipe. [09:50:54] (03CR) 10Jelto: [C:03+2] allow all images from docker-registry.tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/976355 (https://phabricator.wikimedia.org/T334512) (owner: 10Brennen Bearnes) [09:52:13] (03CR) 10Effie Mouzeli: [C:03+2] memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [09:53:15] FIRING: AppserversUnreachable: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [09:53:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1190.eqiad.wmnet [09:53:25] jouncebot: nowandnext [09:53:25] For the next 0 hour(s) and 6 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240521T0800) [09:53:25] In 0 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240521T1000) [09:53:56] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9815860 (10Jelto) [09:55:04] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [09:55:21] (03PS1) 10JMeybohm: Add new role: kubernetes::master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1034448 (https://phabricator.wikimedia.org/T353464) [09:55:23] (03PS1) 10JMeybohm: Add wikikube-ctrl200[1-3] as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1034449 (https://phabricator.wikimedia.org/T353464) [09:56:09] (03CR) 10Fabfur: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1034431 (https://phabricator.wikimedia.org/T365441) (owner: 10Fabfur) [09:56:27] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1199.eqiad.wmnet [09:56:44] (03CR) 10JMeybohm: "This should include the change to site.pp, adding the new role to wikikube-ctrl2001" [puppet] - 10https://gerrit.wikimedia.org/r/1034449 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [09:57:21] !log installing mariadb-10.3 security updates (libs/tools as packaged in Debian, unrelated to wmf-db) [09:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:36] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2331,2361,2391].codfw.wmnet,mw[1372,1429,1436].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1002" [09:58:20] (03PS1) 10Muehlenhoff: Switch db1199 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034450 (https://phabricator.wikimedia.org/T349619) [09:58:47] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: mw[2331,2361,2391].codfw.wmnet,mw[1372,1429,1436].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - hnowlan@cumin1002" [09:58:47] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:58:47] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts mw[2331,2361,2391].codfw.wmnet,mw[1372,1429,1436].eqiad.wmnet [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240521T1000) [10:00:24] andre: hi, are you done? Can I deploy a change? [10:00:40] (03CR) 10Muehlenhoff: [C:03+2] Switch db1199 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034450 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:00:47] Amir1: We might roll back because of https://phabricator.wikimedia.org/T365451 so please don't yet [10:01:12] noted, please let me know when I can deploy stuff [10:01:33] [TRAIN] Rolling back on group0 now [10:04:25] FIRING: SystemdUnitFailed: wmf_auto_restart_prometheus-memcached-exporter.service on mw2264:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:04:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1199.eqiad.wmnet [10:06:46] FIRING: HelmReleaseBadStatus: Helm release datasets-config-next/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=datasets-config-next - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:09:25] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-memcached-exporter.service on mw1427:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:10] (03PS1) 10Effie Mouzeli: mc2055: switch to memcache user [puppet] - 10https://gerrit.wikimedia.org/r/1034452 [10:18:47] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1238.eqiad.wmnet [10:19:25] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-memcached-exporter.service on mw1358:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:20:06] (03CR) 10Effie Mouzeli: [C:03+2] mc2055: switch to memcache user [puppet] - 10https://gerrit.wikimedia.org/r/1034452 (owner: 10Effie Mouzeli) [10:20:40] 10SRE-tools, 10Spicerack: Spicerack: allow cookbooks to abort execution from __init__ - https://phabricator.wikimedia.org/T365454 (10Volans) 03NEW p:05Triage→03Medium [10:20:49] !restart memcached on mc2055 [10:20:58] !log restart memcached on mc2055 [10:21:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:19] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack: don't IRC log start/stop of cookbook - https://phabricator.wikimedia.org/T324655#9815952 (10Volans) Some use case could be covered with this approach: T365454 [10:21:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [10:21:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [10:23:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [10:24:02] !log aklapper@deploy1002 rebuilt and synchronized wikiversions files: Revert "group0 wikis to 1.43.0-wmf.5" [10:24:53] * hashar andre earned "ROLLED BACK THE TRAIN! TCHOU TCHOU!" achievement [10:27:36] 10ops-codfw, 06DC-Ops: msw1-codfw links are connected to wrong ports - https://phabricator.wikimedia.org/T365455 (10cmooney) 03NEW p:05Triage→03Medium [10:30:31] (03PS1) 10Muehlenhoff: Switch db1238 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034453 (https://phabricator.wikimedia.org/T349619) [10:31:30] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [10:31:39] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [10:31:44] (03PS1) 10Hashar: Revert "group0 wikis to 1.43.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034454 (https://phabricator.wikimedia.org/T361400) [10:31:45] (03CR) 10Hashar: [C:03+2] Revert "group0 wikis to 1.43.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034454 (https://phabricator.wikimedia.org/T361400) (owner: 10Hashar) [10:32:29] (03Merged) 10jenkins-bot: Revert "group0 wikis to 1.43.0-wmf.6" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034454 (https://phabricator.wikimedia.org/T361400) (owner: 10Hashar) [10:32:55] (03CR) 10JMeybohm: sre.hosts.rename: initial commit (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [10:33:47] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:34:56] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [10:36:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [10:37:10] !log joal@deploy1002 Started deploy [analytics/refinery@4d42877]: Deploy of Refinery after reimage of an-launcher1002 [analytics/refinery@4d42877e] [10:38:11] !log joal@deploy1002 Finished deploy [analytics/refinery@4d42877]: Deploy of Refinery after reimage of an-launcher1002 [analytics/refinery@4d42877e] (duration: 01m 01s) [10:38:14] Amir1: We finished rollback of group0 to wmf.5 now, you can go ahead and deploy a change [10:38:37] thank you! [10:38:40] Amir1: Please ping when you're finished :) [10:39:20] sure, thanks [10:41:16] * hashar lunches [10:41:17] (03CR) 10Muehlenhoff: [C:03+2] Switch db1238 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034453 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:41:24] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:41:28] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [10:41:32] (03CR) 10Volans: [C:03+1] "LGTM, thanks a lot for this followup patch!" [software/conftool] - 10https://gerrit.wikimedia.org/r/1034163 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [10:43:44] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [10:46:12] !log hnowlan@cumin1002 START - Cookbook sre.dns.netbox [10:46:16] (03PS1) 10Fabfur: haproxy:cache: discard requests w/o Host header [puppet] - 10https://gerrit.wikimedia.org/r/1034459 (https://phabricator.wikimedia.org/T365456) [10:49:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1238.eqiad.wmnet [10:49:38] !log hnowlan@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add renamed k8s ctrl nodes - hnowlan@cumin1002" [10:49:43] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2534/co" [puppet] - 10https://gerrit.wikimedia.org/r/1034459 (https://phabricator.wikimedia.org/T365456) (owner: 10Fabfur) [10:50:18] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [10:50:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add renamed k8s ctrl nodes - hnowlan@cumin1002" [10:50:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:50:31] (03CR) 10JMeybohm: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1034447 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [10:51:08] (03PS1) 10Fabfur: benthos:cache: removed trailing spaces [puppet] - 10https://gerrit.wikimedia.org/r/1034460 (https://phabricator.wikimedia.org/T358109) [10:51:27] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033394 [10:51:29] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2001 [10:51:33] andre: I have to go a meeting now, feel free to take over [10:51:46] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-ctrl2001 [10:52:14] Amir1: Thanks, still waiting for one blocker backport here before trying to run the train again [10:54:25] FIRING: [5x] SystemdUnitFailed: wmf_auto_restart_prometheus-memcached-exporter.service on mw1358:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:55:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [10:55:19] (03CR) 10Fabfur: [C:03+2] benthos:cache: removed trailing spaces [puppet] - 10https://gerrit.wikimedia.org/r/1034460 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [10:56:06] (03CR) 10JMeybohm: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1034446 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [10:56:13] (03CR) 10JMeybohm: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1034445 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [10:56:18] (03CR) 10JMeybohm: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1034444 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [10:57:13] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl2002 [10:57:42] !log hnowlan@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host wikikube-ctrl2002 [10:57:48] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1001 [10:58:42] (03CR) 10Effie Mouzeli: [C:03+1] Add new role: kubernetes::master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1034448 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [10:58:57] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1001 [10:59:13] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1002 [10:59:57] (03CR) 10JMeybohm: [C:03+2] Add new role: kubernetes::master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1034448 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [11:00:29] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1002 [11:00:35] (03CR) 10Muehlenhoff: [C:03+2] Pass Druid middle manager ports as port range [puppet] - 10https://gerrit.wikimedia.org/r/1032712 (owner: 10Muehlenhoff) [11:00:37] !log hnowlan@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1003 [11:02:07] !log hnowlan@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1003 [11:04:22] (03CR) 10Volans: "Nice addition! Few comments inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [11:04:37] (03CR) 10Majavah: [C:03+2] site: Move cloudnet1005 to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [11:04:53] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete wmflabs dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/1032713 (owner: 10Muehlenhoff) [11:05:12] (03CR) 10Effie Mouzeli: [C:03+1] Add wikikube-ctrl200[1-3] as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1034449 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [11:06:14] (03CR) 10Muehlenhoff: [C:03+2] Undeploy openldap prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1031875 (owner: 10Muehlenhoff) [11:06:16] (03CR) 10Effie Mouzeli: [C:03+1] Remove kubetcd200[4-6] from etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/1034447 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [11:07:19] (03PS2) 10Muehlenhoff: an-test-druid: Switch to use nftables instead of iptables [puppet] - 10https://gerrit.wikimedia.org/r/1032632 [11:09:25] FIRING: [9x] SystemdUnitFailed: wmf_auto_restart_prometheus-memcached-exporter.service on mw1358:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:03] !log taavi@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudnet1005 [11:10:27] (03CR) 10Effie Mouzeli: [C:03+1] Add wikikube-ctrl2003 to server SRV record for etcd [dns] - 10https://gerrit.wikimedia.org/r/1034446 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [11:10:44] !log taavi@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudnet1005 [11:10:44] (03CR) 10Effie Mouzeli: [C:03+1] Add wikikube-ctrl2001 to server SRV record for etcd [dns] - 10https://gerrit.wikimedia.org/r/1034444 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [11:11:52] (03CR) 10Btullis: [C:03+2] Run Gobblin later to let time for Canary events [puppet] - 10https://gerrit.wikimedia.org/r/1032715 (https://phabricator.wikimedia.org/T365223) (owner: 10Aqu) [19:37:52] (03PS4) 10Scott French: services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) [19:38:20] (03PS3) 10Dzahn: add additional Amazon domainkey value to learn.wiki domain [dns] - 10https://gerrit.wikimedia.org/r/1034565 (https://phabricator.wikimedia.org/T365435) [19:38:40] (03PS1) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1034566 [19:39:14] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1034566 (owner: 10BCornwall) [19:39:33] (03CR) 10Scott French: [C:04-1] "-1'ing this while we sort out cassandra environment for staging and possibly decoupling from aqs-http-gateway." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [19:41:46] RECOVERY - Router interfaces on cr3-eqsin is OK: OK: host 103.102.166.131, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:45:59] (03PS1) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1034567 [19:46:56] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1034567 (owner: 10BCornwall) [19:47:22] !log reedy@deploy1002 Synchronized dblists-index.php: T365467 (duration: 15m 00s) [19:47:26] T365467: Move translate enabling to a dblist - https://phabricator.wikimedia.org/T365467 [19:48:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62798 and previous config saved to /var/cache/conftool/dbconfig/20240521-194856-root.json [19:51:43] (03PS1) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1034570 [19:51:46] (03PS1) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1034571 [19:52:05] (03CR) 10CI reject: [V:04-1] Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1034571 (owner: 10BCornwall) [19:54:05] (03PS6) 10Dzahn: base: add a firewall alias for the default docker network [puppet] - 10https://gerrit.wikimedia.org/r/1017367 [19:54:32] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, and 2 others: Degraded RAID on cloudcephosd1031 - https://phabricator.wikimedia.org/T364060#9818758 (10taavi) [19:58:11] I can do the backport window; I want to see the new scap feature in practice. [20:00:03] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001'] [20:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240521T2000). [20:00:04] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-ctrl1001'] [20:00:21] Jdlrobson: OK to proceed with them all in one go? [20:00:35] Sorry, not all, the two config ones together? [20:00:41] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001'] [20:00:49] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-ctrl1001'] [20:01:07] I'll do the wmf.5 ones together, at least. [20:01:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [extensions/MobileFrontend] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034168 (owner: 10Jdlrobson) [20:01:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032808 (https://phabricator.wikimedia.org/T109277) (owner: 10Jdlrobson) [20:01:50] (03PS2) 10Jdlrobson: Cleanup night mode exclude list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034545 (https://phabricator.wikimedia.org/T365084) [20:02:00] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001'] [20:02:17] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-ctrl1001'] [20:02:59] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: T365467 (duration: 14m 56s) [20:03:03] T365467: Move translate enabling to a dblist - https://phabricator.wikimedia.org/T365467 [20:03:26] Reedy: Ouch. 14 mins?! [20:03:33] glhf [20:03:52] 2000+ k8s [20:04:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62799 and previous config saved to /var/cache/conftool/dbconfig/20240521-200402-root.json [20:04:12] * James_F sighs. [20:04:27] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9818815 (10Fuzzy) I've compiled a table detailing some of the largest legislative texts found on Hebrew Wikisource. Before drawing... [20:04:36] im bere [20:04:57] @James_F yeh these can all go together except the last config change for Watchlist which is blocked on the 2 backports to feature branches [20:05:19] Hey Jdlrobson. Deploys are apparently slower today than ever, so… Let's see how it goes. [20:06:51] Hmm, maybe I'll abort this scap and do the first config one solo, given CI taking time too. [20:07:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034545 (https://phabricator.wikimedia.org/T365084) (owner: 10Jdlrobson) [20:07:29] Rather than waiting another 20 minutes to *start* scap. [20:07:41] (03Merged) 10jenkins-bot: Cleanup night mode exclude list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034545 (https://phabricator.wikimedia.org/T365084) (owner: 10Jdlrobson) [20:08:10] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1034545|Cleanup night mode exclude list (T365084)]] [20:08:14] T365084: Night mode exclude list doesn't appear to be working with various pages (including Special:AbuseLog or diff pages) - https://phabricator.wikimedia.org/T365084 [20:10:53] !log jforrester@deploy1002 jforrester and jdlrobson: Backport for [[gerrit:1034545|Cleanup night mode exclude list (T365084)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:11:07] Jdlrobson: Look OK? [20:12:57] James_F: give me a few more mins [20:13:05] long list :) [20:13:47] James_F: yep i think this is good to sync. [20:13:57] !log jforrester@deploy1002 jforrester and jdlrobson: Continuing with sync [20:14:01] Thanks! [20:19:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1167 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62800 and previous config saved to /var/cache/conftool/dbconfig/20240521-201910-root.json [20:20:05] FWIW, we just hit 50%. [20:20:27] (03PS3) 10Jdlrobson: Enable desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032833 (https://phabricator.wikimedia.org/T109277) [20:20:32] (03PS2) 10Jforrester: Don't define wmgUseListings, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029506 [20:21:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [20:22:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2208.codfw.wmnet with reason: Maintenance [20:22:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2208 (T352010)', diff saved to https://phabricator.wikimedia.org/P62801 and previous config saved to /var/cache/conftool/dbconfig/20240521-202218-ladsgroup.json [20:22:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:23:56] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:33] (03Merged) 10jenkins-bot: Decouple MFUseDesktopSpecialWatchlistPage from EditWatchlist page [extensions/MobileFrontend] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034168 (owner: 10Jdlrobson) [20:24:35] (03Merged) 10jenkins-bot: Drop responsive behaviour [skins/MinervaNeue] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032808 (https://phabricator.wikimedia.org/T109277) (owner: 10Jdlrobson) [20:24:42] Huh, I'm slightly astonished we have as many as 91 hosts when we're serving 85% of all traffic from k8s. [20:27:04] Commons was on metal until yesterday, so I think there is a backlog of metal hosts to convert to k8s exec nodes now. [20:27:37] (03PS1) 10TChin: datasets-config: Add volume for configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034581 (https://phabricator.wikimedia.org/T357434) [20:27:42] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1034545|Cleanup night mode exclude list (T365084)]] (duration: 19m 32s) [20:27:46] T365084: Night mode exclude list doesn't appear to be working with various pages (including Special:AbuseLog or diff pages) - https://phabricator.wikimedia.org/T365084 [20:27:49] Yeah, though I'm not sure how much MW traffic Commons actually gets (unlike upload.…). Anyway, 2004 k8s takes long enough. :-) [20:27:53] OK, time for the mega-one. [20:30:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032833 (https://phabricator.wikimedia.org/T109277) (owner: 10Jdlrobson) [20:30:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029506 (owner: 10Jforrester) [20:31:00] (03Merged) 10jenkins-bot: Enable desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032833 (https://phabricator.wikimedia.org/T109277) (owner: 10Jdlrobson) [20:31:52] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001'] [20:31:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-ctrl1001'] [20:32:06] (03PS3) 10Jforrester: Don't define wmgUseListings, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029506 [20:32:10] (03CR) 10Jforrester: [C:03+2] Don't define wmgUseListings, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029506 (owner: 10Jforrester) [20:32:15] * James_F shakes fist at self. [20:32:46] (03Merged) 10jenkins-bot: Don't define wmgUseListings, no longer read [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029506 (owner: 10Jforrester) [20:32:47] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1034571 (owner: 10BCornwall) [20:32:55] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001'] [20:32:58] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1034570 (owner: 10BCornwall) [20:33:07] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-ctrl1001'] [20:33:26] !log jforrester@deploy1002 Started scap: Backport for [[gerrit:1032808|Drop responsive behaviour (T109277)]], [[gerrit:1034168|Decouple MFUseDesktopSpecialWatchlistPage from EditWatchlist page]], [[gerrit:1032833|Enable desktop watchlist HTML on mobile (T109277)]], [[gerrit:1029506|Don't define wmgUseListings, no longer read]] [20:33:29] T109277: [EPIC]: Use core watchlist code for mobile experience - https://phabricator.wikimedia.org/T109277 [20:33:41] (03PS2) 10Jforrester: Remove 'changetags' from default's user group, grant to +sysop and +bot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1013975 (https://phabricator.wikimedia.org/T355639) [20:34:48] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001.mgmt.eqiad.wmnet'] [20:34:55] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-ctrl1001.mgmt.eqiad.wmnet'] [20:36:09] !log jforrester@deploy1002 jforrester and jdlrobson: Backport for [[gerrit:1032808|Drop responsive behaviour (T109277)]], [[gerrit:1034168|Decouple MFUseDesktopSpecialWatchlistPage from EditWatchlist page]], [[gerrit:1032833|Enable desktop watchlist HTML on mobile (T109277)]], [[gerrit:1029506|Don't define wmgUseListings, no longer read]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:16] Jdlrobson: OK to proceed? [20:38:23] James_F: ..yep LGTM [20:38:26] !log jforrester@deploy1002 jforrester and jdlrobson: Continuing with sync [20:38:28] Yay. [20:48:10] James_F: ill get back to you on wikifunctions shortly. Just clearing out my brain stack :) [20:48:17] Jdlrobson: No rush! [20:48:42] (03PS1) 10JHathaway: postfix: dkim sign subdomains of wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1034583 (https://phabricator.wikimedia.org/T365395) [20:49:23] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1034583 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [20:49:59] (03PS1) 10Jdlrobson: Always use desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034584 (https://phabricator.wikimedia.org/T109277) [20:50:03] :) [20:51:43] !log jforrester@deploy1002 Finished scap: Backport for [[gerrit:1032808|Drop responsive behaviour (T109277)]], [[gerrit:1034168|Decouple MFUseDesktopSpecialWatchlistPage from EditWatchlist page]], [[gerrit:1032833|Enable desktop watchlist HTML on mobile (T109277)]], [[gerrit:1029506|Don't define wmgUseListings, no longer read]] (duration: 18m 17s) [20:51:47] T109277: [EPIC]: Use core watchlist code for mobile experience - https://phabricator.wikimedia.org/T109277 [20:51:48] Finally. [20:51:56] OK, window closed. [20:52:21] (03CR) 10JHathaway: [C:03+2] postfix: dkim sign subdomains of wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1034583 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [20:53:06] urandom: shall I merge your patch? [20:53:20] in puppet, cassandra-dev: enable Commons Impact Metrics grants (81ddcb4e46) [20:53:48] 10ops-eqiad, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T365531 (10phaultfinder) 03NEW [20:55:25] RESOLVED: [2x] SystemdUnitFailed: prometheus-blazegraph-exporter-wdqs-categories.service on wdqs2023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:56:43] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001.mgmt.eqiad.wmnet'] [20:58:03] jhathaway: ha, just came here to ask the same. yes! [20:59:01] urandom: great, done! [21:06:18] (03PS1) 10Eevans: cassandra-dev: add aqsloader role [puppet] - 10https://gerrit.wikimedia.org/r/1034589 (https://phabricator.wikimedia.org/T362697) [21:06:28] (03CR) 10Stoyofuku-wmf: [C:03+1] "This looks great, thank you for your patience as I figured it out!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) (owner: 10Mabualruz) [21:07:15] (03CR) 10Eevans: [C:03+2] cassandra-dev: add aqsloader role [puppet] - 10https://gerrit.wikimedia.org/r/1034589 (https://phabricator.wikimedia.org/T362697) (owner: 10Eevans) [21:07:54] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['wikikube-ctrl1001.mgmt.eqiad.wmnet'] [21:08:30] !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1001.mgmt.eqiad.wmnet'] [21:08:55] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9819103 (10cscott) @Fuzzy you may be interested in {T254522}, which will eventually replace the limits in the legacy parser. Appr... [21:09:10] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1001.mgmt.eqiad.wmnet'] [21:13:25] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:13:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T352010)', diff saved to https://phabricator.wikimedia.org/P62802 and previous config saved to /var/cache/conftool/dbconfig/20240521-211335-ladsgroup.json [21:13:39] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:20:17] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9819125 (10jhathaway) [21:25:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:26:00] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:26:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:28:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P62803 and previous config saved to /var/cache/conftool/dbconfig/20240521-212842-ladsgroup.json [21:28:50] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51924 bytes in 0.205 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:29:18] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:29:24] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:32:56] (03PS1) 10JHathaway: gerrit: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1034593 (https://phabricator.wikimedia.org/T365395) [21:33:02] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9819156 (10neriah) Agree with @Fuzzy's suggestion above. >>! In T275319#9813641, @Fuzzy wrote: > However, I suggested a straightfo... [21:41:01] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate sessionstore.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:43:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P62804 and previous config saved to /var/cache/conftool/dbconfig/20240521-214352-ladsgroup.json [21:44:57] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:45:22] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1034593 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [21:59:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T352010)', diff saved to https://phabricator.wikimedia.org/P62805 and previous config saved to /var/cache/conftool/dbconfig/20240521-215900-ladsgroup.json [21:59:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [21:59:05] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:59:16] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2171.codfw.wmnet with reason: Maintenance [21:59:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2171 (T352010)', diff saved to https://phabricator.wikimedia.org/P62806 and previous config saved to /var/cache/conftool/dbconfig/20240521-215924-ladsgroup.json [22:14:10] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [22:15:06] !log zabe@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=eswiki --logwiki=metawiki '17420g' 'Ras I' # T365533 [22:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:13] T365533: Unblock stuck global rename of 17420g → Ras, Aurelio de Sandoval → Aurelio Sandoval, and QFTP2024 → Organic2024 - https://phabricator.wikimedia.org/T365533 [22:16:12] !log zabe@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=ptwiki --logwiki=metawiki 'Aurelio de Sandoval' 'Aurelio Sandoval' # T365533 [22:16:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:10] !log zabe@mwmaint1002:~$ mwscript extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=ukwiki --logwiki=metawiki 'QFTP2024' 'Organic2024' # T365533 [22:17:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:26] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9819230 (10Fuzzy) >>! In T275319#9819103, @cscott wrote: > [...] Appropriate metrics are not easy to find, because ideally they mu... [22:21:56] (03PS1) 10Zabe: Stop writing to af_user(_text)/afh_user(_text) on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034595 (https://phabricator.wikimedia.org/T337920) [22:22:37] jouncebot: nowandnext [22:22:37] No deployments scheduled for the next 7 hour(s) and 37 minute(s) [22:22:37] In 7 hour(s) and 37 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T0600) [22:23:18] (03CR) 10Zabe: [C:03+2] Stop writing to af_user(_text)/afh_user(_text) on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034595 (https://phabricator.wikimedia.org/T337920) (owner: 10Zabe) [22:23:54] (03Merged) 10jenkins-bot: Stop writing to af_user(_text)/afh_user(_text) on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034595 (https://phabricator.wikimedia.org/T337920) (owner: 10Zabe) [22:24:33] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1034595|Stop writing to af_user(_text)/afh_user(_text) on test wikis (T337920)]] [22:24:37] T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920 [22:27:13] !log zabe@deploy1002 zabe: Backport for [[gerrit:1034595|Stop writing to af_user(_text)/afh_user(_text) on test wikis (T337920)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:27:42] !log zabe@deploy1002 zabe: Continuing with sync [22:29:36] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 226.61 ms [22:38:36] (03PS8) 10Zabe: Use encrypted Argon2 Hashes to store user passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) [22:40:56] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1034595|Stop writing to af_user(_text)/afh_user(_text) on test wikis (T337920)]] (duration: 16m 23s) [22:41:00] T337920: Stop writing to af_user(_text)/afh_user(_text) - https://phabricator.wikimedia.org/T337920 [22:55:37] (03CR) 10Zabe: [C:03+2] Use encrypted Argon2 Hashes to store user passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [22:56:16] (03Merged) 10jenkins-bot: Use encrypted Argon2 Hashes to store user passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) (owner: 10Zabe) [22:56:51] !log zabe@deploy1002 Started scap: Backport for [[gerrit:1029183|Use encrypted Argon2 Hashes to store user passwords (T150647 T216682)]] [22:56:57] T150647: Deploy EncryptedPassword to Wikimedia Sites - https://phabricator.wikimedia.org/T150647 [22:56:59] T216682: Switch WMF production to Argon2 password hashes - https://phabricator.wikimedia.org/T216682 [22:59:06] (03PS5) 10Scott French: services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) [22:59:31] !log zabe@deploy1002 zabe: Backport for [[gerrit:1029183|Use encrypted Argon2 Hashes to store user passwords (T150647 T216682)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [23:10:24] !log zabe@deploy1002 zabe: Continuing with sync [23:13:38] (03PS1) 10Eevans: cassandra-dev: make client encryption optional [puppet] - 10https://gerrit.wikimedia.org/r/1034598 (https://phabricator.wikimedia.org/T362697) [23:15:16] (03CR) 10Eevans: [C:03+2] cassandra-dev: make client encryption optional [puppet] - 10https://gerrit.wikimedia.org/r/1034598 (https://phabricator.wikimedia.org/T362697) (owner: 10Eevans) [23:19:00] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for sg912 - https://phabricator.wikimedia.org/T365118#9819350 (10Dzahn) a:05KOfori→03Dzahn [23:19:18] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for sg912 - https://phabricator.wikimedia.org/T365118#9819351 (10Dzahn) [23:19:42] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Apply change that makes encryption optional - eevans@cumin1002 [23:22:09] (03PS2) 10Jdlrobson: Always use desktop watchlist HTML on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034584 (https://phabricator.wikimedia.org/T109277) [23:22:23] (03PS3) 10Scott French: DNM: services: add commons-impact-analytics service helmfile configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023957 (https://phabricator.wikimedia.org/T361835) [23:22:23] (03PS3) 10Scott French: DNM: rest-gateway: route commons-analytics via rest-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1023958 (https://phabricator.wikimedia.org/T361835) [23:22:55] (03PS1) 10Dzahn: admin: add user sg912 to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1034599 (https://phabricator.wikimedia.org/T365118) [23:23:43] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:1029183|Use encrypted Argon2 Hashes to store user passwords (T150647 T216682)]] (duration: 26m 51s) [23:23:49] T150647: Deploy EncryptedPassword to Wikimedia Sites - https://phabricator.wikimedia.org/T150647 [23:23:50] T216682: Switch WMF production to Argon2 password hashes - https://phabricator.wikimedia.org/T216682 [23:24:29] (03CR) 10Jdlrobson: deploy(Popups): Make use of conditional user defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) (owner: 10Mabualruz) [23:38:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033401 [23:38:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033401 (owner: 10TrainBranchBot) [23:39:51] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Apply change that makes encryption optional - eevans@cumin1002 [23:58:08] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1033401 (owner: 10TrainBranchBot)