[00:01:43] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncmonitor1001.eqiad.wmnet with reason: host reimage [00:03:14] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) I did another wmf-reimage cookbook run on this host and the installation finished, including the grub install. I can't explain why it wouldn't work... [00:03:51] (SystemdUnitFailed) resolved: dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:51] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10Dzahn) [00:04:21] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10Dzahn) 05Open→03In progress [00:04:36] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncmonitor1001.eqiad.wmnet with reason: host reimage [00:11:06] (03PS1) 10Dzahn: convert ncmonitor role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619) [00:11:32] (03CR) 10Dzahn: "mu" [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619) (owner: 10Dzahn) [00:14:12] (03PS2) 10Dzahn: convert ncmonitor role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619) [00:16:48] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10Dzahn) The host should be usable now: ` [ncmonitor1001:~] $ uptime 00:15:21 up 1 min, 1 user, load average: 0.15, 0.04,... [00:18:01] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710 (10Dzahn) ` 23:55 <+logmsgbot> !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host ncmonitor1001.eqiad.wmnet with OS bookworm 00:01 <+logmsgbot> !log dza... [00:19:42] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Jhancock.wm) [00:20:05] (03CR) 10Dzahn: [C: 03+2] phabricator: re-activate public dump job [puppet] - 10https://gerrit.wikimedia.org/r/1003070 (https://phabricator.wikimedia.org/T355502) (owner: 10Dzahn) [00:36:24] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:39:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003031 [00:39:14] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003031 (owner: 10TrainBranchBot) [00:43:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:44:44] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:01:13] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003031 (owner: 10TrainBranchBot) [01:03:02] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:05:18] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:05:30] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:06:38] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:13:16] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:16:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:50:50] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:51:58] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [02:07:48] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:09:37] (03CR) 10Ssingh: "Thanks for the patch. Will defer to Brett on this if he wants this to be Puppet 7 for all hosts or just ncredir1001." [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619) (owner: 10Dzahn) [02:25:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T352010)', diff saved to https://phabricator.wikimedia.org/P56735 and previous config saved to /var/cache/conftool/dbconfig/20240214-022544-ladsgroup.json [02:25:50] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:38:49] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:40:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P56736 and previous config saved to /var/cache/conftool/dbconfig/20240214-024050-ladsgroup.json [02:55:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P56737 and previous config saved to /var/cache/conftool/dbconfig/20240214-025557-ladsgroup.json [03:11:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T352010)', diff saved to https://phabricator.wikimedia.org/P56738 and previous config saved to /var/cache/conftool/dbconfig/20240214-031103-ladsgroup.json [03:11:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [03:11:08] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:11:19] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [03:11:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T352010)', diff saved to https://phabricator.wikimedia.org/P56739 and previous config saved to /var/cache/conftool/dbconfig/20240214-031125-ladsgroup.json [03:13:49] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:51:45] (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:01:46] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 22 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:05:16] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:21:45] (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [04:51:38] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 21 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:52:48] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:27:36] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 20 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:28:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:55:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 20 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:56:36] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:08:04] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:18:02] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 0 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:19:12] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 60 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:21:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:22:47] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [06:22:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1035.eqiad.wmnet with OS bullseye [06:22:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1035.eqiad.wmnet with OS bullseye completed: - restbase1035 (**PASS**) - D... [06:23:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Jclark-ctr) [06:23:52] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Jclark-ctr) 05Open→03Resolved [06:28:27] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) @Andrew following up to see if this has been put back into service? [06:39:03] 10SRE, 10ops-eqiad: Degraded RAID on an-worker1173 - https://phabricator.wikimedia.org/T357460 (10Jclark-ctr) a:03Jclark-ctr Submitted Request for replacement 4tb hdd [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T0700) [07:01:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:02:12] PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 0 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:03:20] RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 60 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:07:09] (03CR) 10Slyngshede: [C: 03+2] C:external_cloud_vendors add owner to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/1002985 (owner: 10Slyngshede) [07:13:38] (03PS1) 10Slyngshede: P:puppet::client_bucket remove monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003269 (https://phabricator.wikimedia.org/T350694) [07:14:26] (03Abandoned) 10Slyngshede: P:puppet::client_bucket Start moving monitoring to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:26:59] (03PS1) 10Slyngshede: D:service::uwsgi Parse the Icinga disable flag to uwsgi:app. [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694) [07:30:51] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1366/co" [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:32:02] (03CR) 10Slyngshede: D:service::uwsgi Parse the Icinga disable flag to uwsgi:app. [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:32:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 18 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:33:25] (03PS2) 10Slyngshede: D:service::uwsgi Parse the Icinga disable flag to uwsgi:app. [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694) [07:33:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:34:59] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003360 [07:35:24] (03PS2) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003360 (https://phabricator.wikimedia.org/T356736) (owner: 10STran) [07:36:17] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003360 (https://phabricator.wikimedia.org/T356736) (owner: 10STran) [07:37:10] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003360 (https://phabricator.wikimedia.org/T356736) (owner: 10STran) [07:47:04] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [07:48:04] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [07:48:48] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [07:48:52] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [07:49:10] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [07:50:25] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [07:50:54] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [07:51:50] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [07:55:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T352010)', diff saved to https://phabricator.wikimedia.org/P56740 and previous config saved to /var/cache/conftool/dbconfig/20240214-075545-ladsgroup.json [07:55:50] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [08:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:06:07] (03CR) 10Arnaudb: [C: 03+2] mariadb: disable systematic wiping of /srv on db2194 [puppet] - 10https://gerrit.wikimedia.org/r/1002410 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [08:09:33] (03PS1) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) [08:10:42] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 18 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:10:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P56741 and previous config saved to /var/cache/conftool/dbconfig/20240214-081051-ladsgroup.json [08:11:32] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddumps1001.wikimedia.org [08:11:50] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:12:50] !log restart apache2 on lists1001 to remove traces of old, soon-to-expire TLS certificate [08:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:33] (03CR) 10CI reject: [V: 04-1] P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah) [08:14:17] (03PS2) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) [08:19:50] (03PS1) 10Majavah: hieradata: Failover all dumps traffic to clouddumps1001 [puppet] - 10https://gerrit.wikimedia.org/r/1003363 (https://phabricator.wikimedia.org/T321313) [08:20:28] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1001.wikimedia.org [08:20:42] (03PS1) 10Majavah: Revert "Failover dumps to clouddumps1002" [dns] - 10https://gerrit.wikimedia.org/r/1003364 (https://phabricator.wikimedia.org/T321313) [08:25:11] (03PS1) 10Ayounsi: Add BGP sessions to KPN in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1003366 (https://phabricator.wikimedia.org/T322630) [08:25:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P56742 and previous config saved to /var/cache/conftool/dbconfig/20240214-082558-ladsgroup.json [08:27:14] (03PS1) 10Ayounsi: Add KPN in the list of critical BGP peers [puppet] - 10https://gerrit.wikimedia.org/r/1003367 (https://phabricator.wikimedia.org/T322630) [08:30:22] 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10LSobanski) [08:30:45] 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10LSobanski) There are also alerts for wdqs1023 and wdqs1024. [08:31:03] (03CR) 10Ayounsi: [C: 03+2] Add BGP sessions to KPN in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1003366 (https://phabricator.wikimedia.org/T322630) (owner: 10Ayounsi) [08:31:06] 10SRE, 10SRE-Access-Requests: Updating access key - rkhan - https://phabricator.wikimedia.org/T357483 (10Peachey88) [08:31:27] (03PS1) 10Alexandros Kosiaris: eventstreams: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003368 (https://phabricator.wikimedia.org/T355686) [08:31:29] (03PS1) 10Alexandros Kosiaris: cxserver: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003369 (https://phabricator.wikimedia.org/T355686) [08:31:37] (03Merged) 10jenkins-bot: Add BGP sessions to KPN in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1003366 (https://phabricator.wikimedia.org/T322630) (owner: 10Ayounsi) [08:31:44] (03Restored) 10Hashar: python-build: set date of source files in the wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [08:33:46] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [08:35:27] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) 05Open→03Resolved a:03cmooney Closing - thanks all for the help! [08:39:38] (03CR) 10Slyngshede: [C: 03+2] P:kerberos::kadminserver absent Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/995181 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [08:40:10] (03CR) 10Muehlenhoff: convert ncmonitor role to puppet7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619) (owner: 10Dzahn) [08:41:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T352010)', diff saved to https://phabricator.wikimedia.org/P56743 and previous config saved to /var/cache/conftool/dbconfig/20240214-084104-ladsgroup.json [08:41:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [08:41:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [08:41:23] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [08:41:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:41:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:41:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T352010)', diff saved to https://phabricator.wikimedia.org/P56744 and previous config saved to /var/cache/conftool/dbconfig/20240214-084146-ladsgroup.json [08:45:35] (03CR) 10Muehlenhoff: puppetserver: Also install the tool to update netboot images on puppet servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003004 (https://phabricator.wikimedia.org/T341056) (owner: 10Muehlenhoff) [08:45:37] (03PS1) 10Slyngshede: C:puppetmaster::monitoring Absent Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1003370 (https://phabricator.wikimedia.org/T350694) [08:45:51] (03CR) 10Volans: "The code looks good to me but I have just one main doubt." [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah) [08:50:16] (03PS3) 10Hashar: python-build: default to run as nobody from /deploy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346) [08:50:18] (03PS2) 10Hashar: python-build: add make and virtualenv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1003002 (https://phabricator.wikimedia.org/T342346) [08:50:20] (03PS6) 10Hashar: python-build: set date of source files in the wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T259611) [08:50:26] (03PS1) 10Ayounsi: don't require a cable ID on planned cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1003371 (https://phabricator.wikimedia.org/T357259) [08:50:29] (03PS1) 10Slyngshede: P:ganeti: Remove Icinga memory monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003372 (https://phabricator.wikimedia.org/T350694) [08:50:54] (03PS2) 10Muehlenhoff: puppetserver: Also install the tool to update netboot images on puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1003004 (https://phabricator.wikimedia.org/T341056) [08:51:40] (03PS2) 10Hashar: python-build: ensure frozen-requirements is exhaustive [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941443 (https://phabricator.wikimedia.org/T342346) [08:59:52] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: re-notify for SystemdUnitFailed after 24h [puppet] - 10https://gerrit.wikimedia.org/r/1003009 (https://phabricator.wikimedia.org/T357333) (owner: 10Filippo Giunchedi) [09:00:25] (03CR) 10Filippo Giunchedi: [C: 03+1] P:ganeti: Remove Icinga memory monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003372 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:00:46] (03CR) 10Filippo Giunchedi: [C: 03+1] C:puppetmaster::monitoring Absent Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1003370 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:01:22] (03PS1) 10Slyngshede: P:ganeti: Absent checks for generic Ganeti services. [puppet] - 10https://gerrit.wikimedia.org/r/1003374 (https://phabricator.wikimedia.org/T350694) [09:02:18] (03CR) 10Muehlenhoff: [C: 03+2] puppetserver: Also install the tool to update netboot images on puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1003004 (https://phabricator.wikimedia.org/T341056) (owner: 10Muehlenhoff) [09:03:55] (03CR) 10Filippo Giunchedi: [C: 03+1] P:puppet::client_bucket remove monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003269 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:05:09] (03CR) 10Hashar: "I have restored a couple changes I have made in July 2023 and stacked them on top:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [09:05:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [09:05:26] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Draining ganeti2023.codfw.wmnet of running VMs [09:06:49] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:08:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:08:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [09:08:32] (03CR) 10Filippo Giunchedi: [C: 03+1] nagios_common/planet: remove check_lastmod check, script and config [puppet] - 10https://gerrit.wikimedia.org/r/1003084 (https://phabricator.wikimedia.org/T353298) (owner: 10Dzahn) [09:08:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1003098 (owner: 10Dzahn) [09:11:04] (03CR) 10Filippo Giunchedi: [C: 03+1] D:service::uwsgi Parse the Icinga disable flag to uwsgi:app. [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:12:05] (03PS3) 10Slyngshede: Monitoring of PKI infrastructure certs. [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) [09:13:27] (03CR) 10Slyngshede: Monitoring of PKI infrastructure certs. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:14:18] (03CR) 10Slyngshede: [C: 03+2] P:puppet::client_bucket remove monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003269 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:17:42] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [09:20:11] (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003031 (owner: 10TrainBranchBot) [09:22:06] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): volatile: We need to configure the volatile endpoint on puppetserveres - https://phabricator.wikimedia.org/T341056 (10MoritzMuehlenhoff) >>! In T341056#9540458, @jhathaway wrote: > @Muehlenhoff I thin... [09:28:12] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003033 [09:28:14] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003033 (owner: 10TrainBranchBot) [09:29:42] (03PS1) 10Muehlenhoff: update-netboot-image: Update instructions for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003375 (https://phabricator.wikimedia.org/T341056) [09:30:43] (03CR) 10Majavah: [C: 03+2] Revert "Failover dumps to clouddumps1002" [dns] - 10https://gerrit.wikimedia.org/r/1003364 (https://phabricator.wikimedia.org/T321313) (owner: 10Majavah) [09:31:16] (03CR) 10Majavah: [C: 03+2] hieradata: Failover all dumps traffic to clouddumps1001 [puppet] - 10https://gerrit.wikimedia.org/r/1003363 (https://phabricator.wikimedia.org/T321313) (owner: 10Majavah) [09:37:27] (03PS3) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) [09:38:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet [09:38:14] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Draining ganeti2024.codfw.wmnet of running VMs [09:38:23] (03CR) 10Majavah: "Good point. PS3 includes a config file with the repositories checked out on the host." [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah) [09:38:39] (03CR) 10CI reject: [V: 04-1] P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah) [09:39:25] (03PS4) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) [09:40:25] PROBLEM - Host sretest2005 is DOWN: CRITICAL - Host Unreachable (10.192.24.3) [09:40:54] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1367/co" [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah) [09:41:40] (03CR) 10Muehlenhoff: [C: 03+2] update-netboot-image: Update instructions for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003375 (https://phabricator.wikimedia.org/T341056) (owner: 10Muehlenhoff) [09:42:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet [09:43:47] (03PS1) 10Alexandros Kosiaris: rec-api: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003376 (https://phabricator.wikimedia.org/T355686) [09:43:50] (03PS1) 10Alexandros Kosiaris: wikifunctions: Add mesh.configuration in package.json [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003377 (https://phabricator.wikimedia.org/T355686) [09:46:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003368 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris) [09:46:51] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:47:54] (03CR) 10Jelto: [C: 03+2] Release 0.7 prometheus-etherpad-exporter [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/1003007 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [09:48:19] (03Merged) 10jenkins-bot: eventstreams: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003368 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris) [09:48:34] (ProbeDown) firing: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:49:12] !log imported openssl11 1.1.1w-0+deb11u1+wmf1 to component/haproxy26 T352744 [09:49:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:17] T352744: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 [09:51:16] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003033 (owner: 10TrainBranchBot) [09:52:36] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619) (owner: 10Dzahn) [09:52:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1002922 (https://phabricator.wikimedia.org/T354959) (owner: 10Muehlenhoff) [09:53:33] (JobUnavailable) firing: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:55:44] !log installing Linux 5.10.209 on Bullseye hosts [09:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:57] RECOVERY - Host sretest2005 is UP: PING OK - Packet loss = 0%, RTA = 30.94 ms [10:01:30] PROBLEM - Host sretest2005 is DOWN: CRITICAL - Host Unreachable (10.192.24.3) [10:02:07] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [10:02:52] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1003371 (https://phabricator.wikimedia.org/T357259) (owner: 10Ayounsi) [10:06:32] RECOVERY - Host sretest2005 is UP: PING OK - Packet loss = 0%, RTA = 32.19 ms [10:08:33] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:08:52] (03CR) 10Volans: "minor nit and we're good to go :)" [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah) [10:12:49] (03PS5) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) [10:13:08] (03CR) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah) [10:13:12] (03PS1) 10Jelto: miscweb: reduce resource requests for apache and envoy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003380 (https://phabricator.wikimedia.org/T357413) [10:15:46] (03CR) 10Volans: [C: 03+1] "LGTM! Thanks for the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah) [10:16:59] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10MoritzMuehlenhoff) [10:18:33] !log powercycle titan1001 [10:18:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetserver2003.codfw.wmnet with OS bookworm [10:19:57] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetserver2003.codfw.wmnet with... [10:20:46] (03PS2) 10Jelto: miscweb: reduce resource requests for apache and envoy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003380 (https://phabricator.wikimedia.org/T357413) [10:21:18] PROBLEM - Host titan1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:48] RECOVERY - Host titan1001 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [10:22:48] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:23:33] (JobUnavailable) resolved: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:23:51] (ProbeDown) resolved: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:24:39] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10klausman) [10:28:40] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply [10:28:52] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [10:31:02] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host puppetserver2003.codfw.wmnet with OS bookworm [10:32:47] (03CR) 10Slyngshede: [C: 03+2] Monitoring of PKI infrastructure certs. [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:33:18] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [10:33:48] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [10:34:21] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney) [10:34:36] (03Merged) 10jenkins-bot: Monitoring of PKI infrastructure certs. [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:34:38] (03CR) 10Slyngshede: [C: 03+2] C:puppetmaster::monitoring Absent Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1003370 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:37:35] !log Deploying new PKI checks to alertmanager [10:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:39] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:37:41] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [10:37:50] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:38:04] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:38:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P56745 and previous config saved to /var/cache/conftool/dbconfig/20240214-103810-ladsgroup.json [10:38:19] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:40:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P56746 and previous config saved to /var/cache/conftool/dbconfig/20240214-104024-ladsgroup.json [10:41:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetserver2003.codfw.wmnet with OS bookworm [10:41:16] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetserver2003.codfw.wmnet with... [10:45:51] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2005.codfw.wmnet [10:46:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2005.codfw.wmnet [10:48:18] !log import prometheus-etherpad-exporter 0.7 to bookworm-wikimedia on apt hosts - T316421 [10:48:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:28] T316421: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 [10:48:30] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [10:48:59] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [10:50:27] 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10fgiunchedi) Happened again albeit on titan1001 only, where query-frontend and store both using cpu and memory, and the host becoming unr... [10:52:15] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [10:53:57] (03PS1) 10Slyngshede: P:kerberos::kdc absent Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1003382 (https://phabricator.wikimedia.org/T350694) [10:55:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P56747 and previous config saved to /var/cache/conftool/dbconfig/20240214-105530-ladsgroup.json [10:56:53] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [10:57:46] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [10:57:47] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [10:58:21] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [10:58:22] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [10:58:27] !log aokoth@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM vrts1002.eqiad.wmnet [10:58:34] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [10:58:38] (03PS1) 10Slyngshede: P:kerberos::replication absent Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1003383 (https://phabricator.wikimedia.org/T350694) [10:59:38] (03PS1) 10Hnowlan: admin: update ssh key for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1003384 (https://phabricator.wikimedia.org/T357483) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1100) [11:00:57] (03CR) 10Clément Goubert: "docker-pkg only builds the latest change in the changelog file, if it's not already built, so if all that stack is merged roughly at the s" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [11:03:00] (03PS2) 10Alexandros Kosiaris: rec-api: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003376 (https://phabricator.wikimedia.org/T355686) [11:03:02] (03PS2) 10Alexandros Kosiaris: cxserver: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003369 (https://phabricator.wikimedia.org/T355686) [11:03:04] (03PS2) 10Alexandros Kosiaris: wikifunctions: Add mesh.configuration in package.json [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003377 (https://phabricator.wikimedia.org/T355686) [11:03:40] (03PS1) 10Muehlenhoff: Use 2disk Partman recipe for puppetserver2003 [puppet] - 10https://gerrit.wikimedia.org/r/1003385 (https://phabricator.wikimedia.org/T356991) [11:04:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] rec-api: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003376 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris) [11:05:31] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1003385 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff) [11:06:02] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host puppetserver2003.codfw.wmnet with OS bookworm [11:06:07] (03Merged) 10jenkins-bot: rec-api: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003376 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris) [11:09:12] (03CR) 10Slyngshede: [C: 03+2] P:ganeti: Remove Icinga memory monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003372 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:10:01] (03CR) 10Muehlenhoff: [C: 03+2] Use 2disk Partman recipe for puppetserver2003 [puppet] - 10https://gerrit.wikimedia.org/r/1003385 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff) [11:10:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P56748 and previous config saved to /var/cache/conftool/dbconfig/20240214-111037-ladsgroup.json [11:12:56] (03PS5) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [11:14:06] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [11:14:29] (03CR) 10CI reject: [V: 04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [11:14:33] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [11:14:34] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [11:14:53] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [11:14:54] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [11:15:05] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [11:17:48] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507 (10Clement_Goubert) [11:19:09] PROBLEM - spamassassin on vrts1002 is CRITICAL: PROCS CRITICAL: 0 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [11:20:12] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507 (10Clement_Goubert) p:05Triage→03High [11:20:56] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508 (10Clement_Goubert) [11:21:31] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Scap, 10serviceops: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 (10Clement_Goubert) [11:21:39] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508 (10Clement_Goubert) [11:23:43] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Scap, 10serviceops: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 (10Clement_Goubert) [11:23:49] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:25:30] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508 (10Clement_Goubert) 05Open→03Stalled p:05Triage→03High We'll stay at 50% or under because we want scap to check canaries before... [11:25:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P56749 and previous config saved to /var/cache/conftool/dbconfig/20240214-112543-ladsgroup.json [11:25:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:25:49] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:25:59] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:26:06] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P56750 and previous config saved to /var/cache/conftool/dbconfig/20240214-112606-ladsgroup.json [11:26:51] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:28:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P56751 and previous config saved to /var/cache/conftool/dbconfig/20240214-112818-ladsgroup.json [11:28:47] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 45% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003393 (https://phabricator.wikimedia.org/T357507) [11:30:16] (03PS1) 10Clément Goubert: trafficserver: move 45% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1003394 (https://phabricator.wikimedia.org/T357507) [11:33:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetserver2003.codfw.wmnet with OS bookworm [11:33:19] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetserver2003.codfw.wmnet with... [11:37:08] (03PS6) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [11:37:26] RECOVERY - Disk space on vrts1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1002&var-datasource=eqiad+prometheus/ops [11:40:07] !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be1045.eqiad.wmnet [11:40:16] (03CR) 10Clément Goubert: [C: 03+1] php: add env[MCROUTER_SERVER] variable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:40:27] (03CR) 10Clément Goubert: [C: 03+1] mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:43:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P56752 and previous config saved to /var/cache/conftool/dbconfig/20240214-114325-ladsgroup.json [11:45:38] (03CR) 10David Caro: "I did not hit enter... sorry" [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah) [11:46:13] jouncebot: nowandnext [11:46:13] For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1100) [11:46:13] In 2 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1400) [11:47:01] I think I’ll give T355685 another go (cc akosiaris, hashar) [11:47:01] T355685: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685 [11:50:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be1045.eqiad.wmnet [11:50:45] (03PS1) 10Lucas Werkmeister (WMDE): Reapply "termbox: update to 2024-01-22-163619-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003400 (https://phabricator.wikimedia.org/T331403) [11:51:57] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2003.codfw.wmnet with reason: host reimage [11:53:33] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "I’m guessing that a self-merge is okay here because it’s only restoring a previously reviewed change, after the cause of the revert was co" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003400 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [11:54:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2003.codfw.wmnet with reason: host reimage [11:55:15] (03Merged) 10jenkins-bot: Reapply "termbox: update to 2024-01-22-163619-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003400 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE)) [11:57:18] (03PS1) 10Muehlenhoff: Make puppetserver2003 a Puppet server [puppet] - 10https://gerrit.wikimedia.org/r/1003402 (https://phabricator.wikimedia.org/T356991) [11:57:48] !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be1045 [11:58:02] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply [11:58:20] !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply [11:58:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P56753 and previous config saved to /var/cache/conftool/dbconfig/20240214-115831-ladsgroup.json [11:58:55] (03PS1) 10Muehlenhoff: Advertise puppetserver2003 as active Puppet 7 server [dns] - 10https://gerrit.wikimedia.org/r/1003403 (https://phabricator.wikimedia.org/T356991) [11:59:01] test command from https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Deployment works for staging, at least [11:59:05] so I’ll go ahead with eqiad and codfw [11:59:25] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply [12:00:15] !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply [12:02:06] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply [12:02:44] !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply [12:03:32] termbox SSR is working as far as I can tell \o/ [12:03:37] * Lucas_WMDE done deploying [12:06:22] (03PS7) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [12:07:25] PROBLEM - Disk space on vrts1002 is CRITICAL: DISK CRITICAL - free space: /srv/otrs-data 5468 MB (0% inode=61%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1002&var-datasource=eqiad+prometheus/ops [12:09:27] (03CR) 10Slyngshede: [C: 03+2] D:service::uwsgi Parse the Icinga disable flag to uwsgi:app. [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:11:16] (03CR) 10Majavah: [C: 03+2] "Will do, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah) [12:11:39] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [12:13:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P56754 and previous config saved to /var/cache/conftool/dbconfig/20240214-121337-ladsgroup.json [12:13:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:13:45] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:13:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:14:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T352010)', diff saved to https://phabricator.wikimedia.org/P56755 and previous config saved to /var/cache/conftool/dbconfig/20240214-121401-ladsgroup.json [12:16:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T352010)', diff saved to https://phabricator.wikimedia.org/P56756 and previous config saved to /var/cache/conftool/dbconfig/20240214-121614-ladsgroup.json [12:16:22] (03CR) 10JMeybohm: [C: 03+1] miscweb: reduce resource requests for apache and envoy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003380 (https://phabricator.wikimedia.org/T357413) (owner: 10Jelto) [12:17:07] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Extend "test-cookbook" to support wmcs-cookbooks - https://phabricator.wikimedia.org/T345069 (10taavi) 05Open→03Resolved a:03taavi [12:21:48] (03PS1) 10Slyngshede: P:puppetdb::microservice absent uwsgi Icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/1003405 (https://phabricator.wikimedia.org/T350694) [12:23:46] (03PS1) 10Slyngshede: P:puppetboard absent Icinga checks for PuppetBoard. [puppet] - 10https://gerrit.wikimedia.org/r/1003406 (https://phabricator.wikimedia.org/T350694) [12:31:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P56757 and previous config saved to /var/cache/conftool/dbconfig/20240214-123120-ladsgroup.json [12:31:51] (03PS1) 10Hnowlan: kubernetes: make 5 codfw appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/1003407 (https://phabricator.wikimedia.org/T351074) [12:32:29] (03PS8) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [12:32:59] (03PS2) 10Effie Mouzeli: services_proxy: set retry for ipoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1003001 (https://phabricator.wikimedia.org/T356766) [12:33:48] (03CR) 10CI reject: [V: 04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [12:35:36] (03CR) 10Alexandros Kosiaris: [C: 03+1] services_proxy: set keepalive for ipoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1003000 (https://phabricator.wikimedia.org/T356766) (owner: 10Effie Mouzeli) [12:35:58] (03PS2) 10Effie Mouzeli: mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 [12:36:11] (03CR) 10Effie Mouzeli: [C: 03+2] services_proxy: set keepalive for ipoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1003000 (https://phabricator.wikimedia.org/T356766) (owner: 10Effie Mouzeli) [12:36:15] (03CR) 10CI reject: [V: 04-1] mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 (owner: 10Effie Mouzeli) [12:36:31] (03PS1) 10Ladsgroup: Enable echo conditional defaults for loginwiki since 2013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003408 (https://phabricator.wikimedia.org/T357072) [12:39:00] (03PS9) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [12:39:21] (03PS4) 10Effie Mouzeli: mediawiki: Bump sextant module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/999780 (owner: 10Alexandros Kosiaris) [12:39:40] (03CR) 10CI reject: [V: 04-1] mediawiki: Bump sextant module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/999780 (owner: 10Alexandros Kosiaris) [12:39:57] (03PS3) 10Effie Mouzeli: mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 [12:40:09] (03CR) 10CI reject: [V: 04-1] mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 (owner: 10Effie Mouzeli) [12:40:11] (03CR) 10Ladsgroup: "We shouldn't need to do any clean ups there and it's already 6GB, clean up should be easy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003408 (https://phabricator.wikimedia.org/T357072) (owner: 10Ladsgroup) [12:40:59] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: make 5 codfw appservers kubernetes workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003407 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [12:41:16] (03CR) 10Effie Mouzeli: [C: 03+2] cache.mcrouter: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000032 (owner: 10Effie Mouzeli) [12:42:14] (03Merged) 10jenkins-bot: cache.mcrouter: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000032 (owner: 10Effie Mouzeli) [12:45:14] (03PS5) 10Effie Mouzeli: mediawiki: Bump sextant module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/999780 (owner: 10Alexandros Kosiaris) [12:46:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P56758 and previous config saved to /var/cache/conftool/dbconfig/20240214-124627-ladsgroup.json [12:49:33] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be1045 [12:51:52] 10SRE, 10Domains, 10Fundraising-Backlog, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10AKanji-WMF) Thank you @RLazarus ! The fundraising team would like the redirect to be active for two years - considering the life cycle of the podcast and surrounding marketing c... [12:52:31] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [12:54:16] (03PS1) 10Slyngshede: P:debmonitor::server Add CDN endpoint check. [puppet] - 10https://gerrit.wikimedia.org/r/1003409 (https://phabricator.wikimedia.org/T350694) [12:57:06] (03PS2) 10Hnowlan: kubernetes: make 5 codfw appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/1003407 (https://phabricator.wikimedia.org/T351074) [12:57:47] Lucas_WMDE: +1 about Termbox, then I don't know what that services is doing exactly but I guess you are in the best position to deploy and monitor it :) [12:59:40] (03CR) 10Jforrester: "This looks great, thanks! Do we also need to apply it (or similar) to charts/function-evaluator/Chart.yaml ? It has a bunch of mesh.* entr" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003377 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris) [13:01:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T352010)', diff saved to https://phabricator.wikimedia.org/P56760 and previous config saved to /var/cache/conftool/dbconfig/20240214-130134-ladsgroup.json [13:01:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:01:51] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:01:54] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:01:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T352010)', diff saved to https://phabricator.wikimedia.org/P56761 and previous config saved to /var/cache/conftool/dbconfig/20240214-130157-ladsgroup.json [13:01:59] (03CR) 10Clément Goubert: [C: 03+1] kubernetes: make 5 codfw appservers kubernetes workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003407 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [13:02:14] (03CR) 10Hnowlan: [C: 03+2] kubernetes: make 5 codfw appservers kubernetes workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003407 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan) [13:04:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T352010)', diff saved to https://phabricator.wikimedia.org/P56762 and previous config saved to /var/cache/conftool/dbconfig/20240214-130410-ladsgroup.json [13:05:52] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:07:50] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage [13:10:43] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage [13:12:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T352010)', diff saved to https://phabricator.wikimedia.org/P56763 and previous config saved to /var/cache/conftool/dbconfig/20240214-131231-ladsgroup.json [13:12:37] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:13:14] (03PS1) 10Slyngshede: P:netbox monitoring not required on systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/1003413 (https://phabricator.wikimedia.org/T350694) [13:16:15] (03PS2) 10Slyngshede: P:netbox monitoring not required on systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/1003413 (https://phabricator.wikimedia.org/T350694) [13:17:15] (03PS9) 10KartikMistry: Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) [13:19:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P56764 and previous config saved to /var/cache/conftool/dbconfig/20240214-131916-ladsgroup.json [13:23:08] PROBLEM - ensure kvm processes are running on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:24:07] !log brouberol@cumin1002 START - Cookbook sre.hosts.reimage for host eventlog1003.eqiad.wmnet with OS bullseye [13:24:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002" [13:24:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2003.codfw.wmnet with OS bookworm [13:24:24] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host puppetserver200... [13:24:33] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2005.codfw.wmnet with OS bookworm [13:24:40] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host apifeatureusage2001.codfw.wmnet [13:26:28] !log T357007 Profiling current master version of CampaignEvents:GenerateInvitationList with excimer in mwmaint2002 [13:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:32] T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007 [13:26:53] (03PS1) 10Muehlenhoff: Switch apifeatureusage2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003415 (https://phabricator.wikimedia.org/T349619) [13:27:07] RECOVERY - ensure kvm processes are running on cloudvirt1036 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:27:22] (03CR) 10Jelto: [C: 03+2] miscweb: reduce resource requests for apache and envoy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003380 (https://phabricator.wikimedia.org/T357413) (owner: 10Jelto) [13:27:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P56765 and previous config saved to /var/cache/conftool/dbconfig/20240214-132737-ladsgroup.json [13:28:27] (03Merged) 10jenkins-bot: miscweb: reduce resource requests for apache and envoy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003380 (https://phabricator.wikimedia.org/T357413) (owner: 10Jelto) [13:34:17] (03PS1) 10Ayounsi: Add support for routed Ganeti in D-I early_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1003416 (https://phabricator.wikimedia.org/T300152) [13:34:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P56766 and previous config saved to /var/cache/conftool/dbconfig/20240214-133422-ladsgroup.json [13:36:33] !log brouberol@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on eventlog1003.eqiad.wmnet with reason: host reimage [13:37:38] (03CR) 10Muehlenhoff: [C: 03+2] Switch apifeatureusage2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003415 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:37:43] (03CR) 10Majavah: [C: 03+2] openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [13:38:06] taavi: ok to merge your patch along? [13:38:08] yes please [13:38:40] (03CR) 10Majavah: [C: 03+2] openstack: overhaul the floating IP updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah) [13:39:01] done, merged [13:39:18] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on eventlog1003.eqiad.wmnet with reason: host reimage [13:39:40] (03PS4) 10Effie Mouzeli: mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 [13:40:38] thanks! [13:42:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host apifeatureusage2001.codfw.wmnet [13:42:20] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:42:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P56767 and previous config saved to /var/cache/conftool/dbconfig/20240214-134244-ladsgroup.json [13:48:29] (03PS1) 10Muehlenhoff: Switch wikireplicas::dedicated::analytics_multiinstance to Puppet 7 on role level [puppet] - 10https://gerrit.wikimedia.org/r/1003418 (https://phabricator.wikimedia.org/T349619) [13:49:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T352010)', diff saved to https://phabricator.wikimedia.org/P56768 and previous config saved to /var/cache/conftool/dbconfig/20240214-134929-ladsgroup.json [13:49:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:49:34] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:49:40] (03CR) 10CI reject: [V: 04-1] Switch wikireplicas::dedicated::analytics_multiinstance to Puppet 7 on role level [puppet] - 10https://gerrit.wikimedia.org/r/1003418 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:49:46] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [13:49:48] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:49:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [13:50:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T352010)', diff saved to https://phabricator.wikimedia.org/P56769 and previous config saved to /var/cache/conftool/dbconfig/20240214-134959-ladsgroup.json [13:52:44] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:54:53] PROBLEM - statsv process on webperf2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [13:55:53] RECOVERY - statsv process on webperf2003 is OK: PROCS OK: 2 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv [13:57:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T352010)', diff saved to https://phabricator.wikimedia.org/P56770 and previous config saved to /var/cache/conftool/dbconfig/20240214-135750-ladsgroup.json [13:57:53] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [13:57:56] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:58:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance [13:58:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T352010)', diff saved to https://phabricator.wikimedia.org/P56771 and previous config saved to /var/cache/conftool/dbconfig/20240214-135813-ladsgroup.json [13:59:14] !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host eventlog1003.eqiad.wmnet with OS bullseye [13:59:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T352010)', diff saved to https://phabricator.wikimedia.org/P56772 and previous config saved to /var/cache/conftool/dbconfig/20240214-135953-ladsgroup.json [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1400). [14:00:05] Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:17] o/ [14:00:21] o/ [14:01:08] * TheresNoTime can deploy [14:02:21] o/ [14:02:40] Daimona: just doing the beta-only patch first [14:03:28] yup sure [14:03:37] not sure why logmsgbot isn't logging here... [14:03:57] !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:05:16] Daimona: now starting the prod one [14:05:45] !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:05:52] Thanks! Looks like we missed the beta update job by a few seconds. I could perhaps retrigger it manually to test in beta first [14:06:01] Lucas_WMDE: logmsgbot is still meant to be logging deployment here right..? [14:06:12] I think so yeah? [14:06:12] !log samtar@deploy2002 Started scap: Backport for [[gerrit:991352|prod: Stop setting $wgCampaignEventsEnableParticipantQuestions (T347608)]] [14:06:15] there it goes [14:06:16] T347608: Remove feature flag for Participant Questions - https://phabricator.wikimedia.org/T347608 [14:06:22] and yeah good idea Daimona [14:06:42] TheresNoTime: was it still waiting for the patch to be merged? I don’t think it logs until the actual scap sync starts [14:06:58] (though I would’ve expected messages about the +2 and “merged” from another bot) [14:07:04] (wikibugs ig) [14:07:11] ah another bot does that [14:07:19] wikibugs: status [14:07:29] (idk if it has a status command, worth a shot though) [14:08:33] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:08:58] wikibugs’ toolforge jobs are running, at least… [14:09:51] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1034.eqiad.wmnet [14:10:23] !log samtar@deploy2002 samtar and daimona: Backport for [[gerrit:991352|prod: Stop setting $wgCampaignEventsEnableParticipantQuestions (T347608)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:10:25] in #wikimedia-dev wikibugs hasn’t written anything new since 15:06 CET [14:10:28] Daimona: prod patch is live on mwdebug (you can test it after you've checked beta if you want?) [14:10:30] beta has been updated, so testing there now @HouseOfM [14:11:39] well, there goes wikibugs [14:11:56] just did a `toolforge jobs load libera/k8s-jobs.yaml` [14:12:09] seems to be back in -dev at least [14:12:39] Beta looking fine, now testing prod. HouseOfM: I think you can also test in prod directly [14:13:02] thx [14:14:12] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [14:14:33] I made this test event: https://test.wikipedia.org/wiki/Event:T347608 [14:14:34] T347608: Remove feature flag for Participant Questions - https://phabricator.wikimedia.org/T347608 [14:14:55] You can try registering there and see if you get the questions [14:15:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P56773 and previous config saved to /var/cache/conftool/dbconfig/20240214-141459-ladsgroup.json [14:15:06] !log Draining and cordoning kubernetes2019.codfw.wmnet kubernetes2018.codfw.wmnet mw2420.codfw.wmnet mw2421.codfw.wmnet mw2406.codfw.wmnet mw2422.codfw.wmnet mw2423.codfw.wmnet for T355864 [14:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:11] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [14:15:50] LGTM [14:15:58] happy to sync? :) [14:17:34] @Daimona, it's not redirecting after answering the questions. is that expected? [14:17:53] wdym? [14:18:30] (03CR) 10Hashar: "Yup then my trouble is whether we want to build each commit or if it is fine to only build an image which would include all the changes ;)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [14:19:17] When registering after answering the questions, it gives a positive message, but stays on the questions form [14:20:28] (03CR) 10Herron: [C: 03+1] Profiler: Silence "RedisException: Connection timed out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003083 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle) [14:22:06] nm, ignore that. It's fine [14:22:12] go ahead [14:22:27] ack [14:22:32] !log samtar@deploy2002 samtar and daimona: Continuing with sync [14:23:19] (03PS1) 10Muehlenhoff: Switch restbase1034 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003427 (https://phabricator.wikimedia.org/T349619) [14:24:32] (03CR) 10Clément Goubert: "I don't see a problem with building only the end result if you've tested local builds (including a sample of dependent images) work with y" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [14:25:37] Sorry folks, power went out :D [14:25:54] :D that patch is currently syncing [14:25:58] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2311.codfw.wmnet with OS bullseye [14:26:01] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2379.codfw.wmnet with OS bullseye [14:26:03] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2380.codfw.wmnet with OS bullseye [14:26:05] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2383.codfw.wmnet with OS bullseye [14:26:07] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2335.codfw.wmnet with OS bullseye [14:26:16] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10Jhancock.wm) @MatthewVernon this port errored three times yesterday. There's no active alert on it right now but I think I want to replace the DAC anyway. Is it safe to do so? [14:27:07] (03CR) 10Muehlenhoff: [C: 03+2] Switch restbase1034 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003427 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:27:09] (03PS1) 10Muehlenhoff: Switch restbase1035 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003428 (https://phabricator.wikimedia.org/T349619) [14:27:11] (03PS1) 10Muehlenhoff: Switch restbase1036 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003429 (https://phabricator.wikimedia.org/T349619) [14:27:13] (03PS1) 10Muehlenhoff: Switch restbase1037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003430 (https://phabricator.wikimedia.org/T349619) [14:27:15] (03PS1) 10Muehlenhoff: Switch restbase1038 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003431 (https://phabricator.wikimedia.org/T349619) [14:27:17] (03PS1) 10Muehlenhoff: Switch restbase1039 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003432 (https://phabricator.wikimedia.org/T349619) [14:27:19] (03PS1) 10Muehlenhoff: Switch restbase1040 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003433 (https://phabricator.wikimedia.org/T349619) [14:27:21] (03PS1) 10Muehlenhoff: Switch restbase1041 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003434 (https://phabricator.wikimedia.org/T349619) [14:27:23] (03PS1) 10Muehlenhoff: Switch restbase1042 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003435 (https://phabricator.wikimedia.org/T349619) [14:27:36] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage [14:29:29] (03PS1) 10Brouberol: eventlogging: tweak PYTHONPATH to allow eventlogging to import _mysql.so [puppet] - 10https://gerrit.wikimedia.org/r/1003438 (https://phabricator.wikimedia.org/T349289) [14:29:49] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:991352|prod: Stop setting $wgCampaignEventsEnableParticipantQuestions (T347608)]] (duration: 23m 37s) [14:29:53] T347608: Remove feature flag for Participant Questions - https://phabricator.wikimedia.org/T347608 [14:30:07] Daimona and HouseOfM: live on prod, can you double-check? [14:30:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P56774 and previous config saved to /var/cache/conftool/dbconfig/20240214-143006-ladsgroup.json [14:30:56] yup [14:31:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1034.eqiad.wmnet [14:31:23] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:31:41] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage [14:31:50] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1035.eqiad.wmnet [14:32:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:33:12] 10SRE, 10SRE-swift-storage, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10MatthewVernon) Yes, please go ahead whenever is convenient (if you can let me know when done I can check the node is still happy). [14:33:26] 10SRE, 10SRE-swift-storage, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10MatthewVernon) [14:33:29] !log close UTC afternoon backport window [14:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:56] LGTM [14:34:03] :D [14:34:51] !log Depooling mw2402|mw2403|mw2404|mw2405|mw2407|mw2408|mw2409|mw2401|mw2410|mw2411|parse2001|parse2002|parse2003 for T355864 [14:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:55] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [14:35:19] !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=(mw2402|mw2403|mw2404|mw2405|mw2407|mw2408|mw2409|mw2401|mw2410|mw2411|parse2001|parse2002|parse2003).* [14:36:00] topranks: ^good in two minutes [14:36:50] claime: ok great, we're not starting till 16:00 utc so lots of time [14:36:51] thanks! [14:37:02] HouseOfM: anything on your side? [14:38:34] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1035.eqiad.wmnet [14:38:54] no, all good [14:39:23] Then I think we're done! Thanks TheresNoTime! [14:39:31] np! [14:39:43] ty! [14:40:22] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1037.eqiad.wmnet [14:41:58] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2335.codfw.wmnet with reason: host reimage [14:42:01] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2311.codfw.wmnet with reason: host reimage [14:42:58] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2383.codfw.wmnet with OS bullseye [14:43:04] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2379.codfw.wmnet with OS bullseye [14:43:17] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2380.codfw.wmnet with OS bullseye [14:43:34] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Jelto) >>! In T316421#9538988, @Dzahn wrote: > test instance etherpad-bookworm.devtools now has etherpad-lite 1.9.7-2 installed by puppet and `... [14:44:21] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2380.codfw.wmnet with OS bullseye [14:44:26] !log Restarted rsyslog on A:wikikube-master [14:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:43] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2335.codfw.wmnet with reason: host reimage [14:44:47] phabricator hanging for anyone else? [14:44:50] yes [14:44:55] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2379.codfw.wmnet with OS bullseye [14:45:03] aand it's back [14:45:10] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2383.codfw.wmnet with OS bullseye [14:45:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T352010)', diff saved to https://phabricator.wikimedia.org/P56776 and previous config saved to /var/cache/conftool/dbconfig/20240214-144514-ladsgroup.json [14:45:17] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [14:45:22] snoozing on the job [14:45:24] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:45:26] bit slow, but seems back yes [14:45:31] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [14:45:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T352010)', diff saved to https://phabricator.wikimedia.org/P56777 and previous config saved to /var/cache/conftool/dbconfig/20240214-144537-ladsgroup.json [14:45:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2005.codfw.wmnet with OS bookworm [14:45:58] (ProbeDown) firing: (3) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:46:09] checking [14:46:12] hm [14:46:21] I see 5xx and big drop in requests [14:46:43] riccardo acked it [14:47:06] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2311.codfw.wmnet with reason: host reimage [14:47:16] I think I had already done it :D [14:48:23] 10SRE, 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T357445 (10Jhancock.wm) @cmooney hey I checked these in netbox. none of the ports listed are active. Can you run homer for this when you have a chance? thanks! [14:50:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1037.eqiad.wmnet [14:50:58] (ProbeDown) resolved: (3) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:51:18] (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:51:29] acked [14:51:41] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1038.eqiad.wmnet [14:51:51] 10SRE, 10SRE-swift-storage, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10Jhancock.wm) it's been replaced. [14:52:06] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2383.codfw.wmnet with OS bullseye [14:52:11] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2379.codfw.wmnet with OS bullseye [14:52:14] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2380.codfw.wmnet with OS bullseye [14:56:18] (NELHigh) resolved: (2) Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [14:56:56] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [14:57:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1038.eqiad.wmnet [14:57:30] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2379.codfw.wmnet with OS bullseye [14:57:56] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1039.eqiad.wmnet [14:58:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [14:58:14] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [14:58:20] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [14:58:37] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:47] 10SRE, 10Domains, 10Fundraising-Backlog, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10AKanji-WMF) Tagging you @Dwisehaupt as Jeff is out this AM. [15:00:04] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1500) [15:00:58] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncmonitor1001.eqiad.wmnet with OS bookworm [15:03:12] 10SRE, 10SRE-swift-storage, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10MatthewVernon) Great, thanks, I can confirm that swift is happy with that node. [15:03:55] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2335.codfw.wmnet with OS bullseye [15:05:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1039.eqiad.wmnet [15:06:04] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2379.codfw.wmnet with OS bullseye [15:06:18] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2379.codfw.wmnet with OS bullseye [15:07:35] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1040.eqiad.wmnet [15:07:40] PROBLEM - ensure kvm processes are running on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:07:50] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2311.codfw.wmnet with OS bullseye [15:09:11] 10SRE, 10observability, 10serviceops: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10lmata) 05Open→03Resolved a:03lmata I will boldly resolve this. I discussed this with the team, and we agreed the strategy here is to renew/purchase dates to be m... [15:09:26] 10SRE, 10SRE-swift-storage, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:11:36] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2379.codfw.wmnet with OS bullseye [15:13:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1040.eqiad.wmnet [15:14:05] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2380.codfw.wmnet with OS bullseye [15:15:09] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1041.eqiad.wmnet [15:21:32] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2380.codfw.wmnet with OS bullseye [15:21:45] RECOVERY - ensure kvm processes are running on cloudvirt1041 is OK: PROCS OK: 2 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [15:22:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1041.eqiad.wmnet [15:24:07] 10ops-codfw, 10serviceops: Multiple hosts in codfw fail to PXE boot upon reimage - https://phabricator.wikimedia.org/T357539 (10hnowlan) [15:26:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) >>! In T355333#9538260, @Jhancock.wm wrote: > idk if this would help, but can we run the provisioning script with the --no-dhcp and --no-user tags. to catch any bios... [15:30:15] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1042.eqiad.wmnet [15:30:42] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644 (10aborrero) [15:31:44] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, 10User-aborrero: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) [15:31:54] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, 10User-aborrero: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) 05Stalled→03Open In a 2024-02-14 network sync meeting we decided to continue moving older cloudvirts into the new single NI... [15:35:32] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) sre.hosts.provision --no-dhcp --no-user [15:36:37] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Volans) >>! In T355333#9542804, @Jhancock.wm wrote: > sre.hosts.provision --no-dhcp --no-user Also `--no-switch` in this case I'd say. [15:37:38] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): volatile: We need to configure the volatile endpoint on puppetserveres - https://phabricator.wikimedia.org/T341056 (10jhathaway) thanks for the additional context @Muehlenhoff! [15:37:52] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10akosiaris) >>! In T316421#9542499, @Jelto wrote: > We probably don't want two etherpad services running in parallel. We most definitely don't... [15:37:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1042.eqiad.wmnet [15:44:22] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2005.codfw.wmnet [15:44:53] !log hnowlan@cumin2002 START - Cookbook sre.hosts.provision for host mw2282.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:45:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2121.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:45:15] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [15:45:23] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2121.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:45:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2132.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:45:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2132.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:45:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2145.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:46:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2145.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:46:07] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2104.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:46:21] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2104.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:46:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2153.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:46:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2153.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:46:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2154.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:46:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2154.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:46:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2175.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:47:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2175.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:47:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2176.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:47:35] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2176.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw [15:47:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T355864 - Depool db2121 db2132 db2145 db2104 db2153 db2154 db2175 db2176', diff saved to https://phabricator.wikimedia.org/P56778 and previous config saved to /var/cache/conftool/dbconfig/20240214-154753-arnaudb.json [15:50:53] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a5-codfw.mgmt with reason: prepping for server uplink migration codfw rack a5 [15:51:10] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a5-codfw.mgmt with reason: prepping for server uplink migration codfw rack a5 [15:51:30] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9a43620e-deca-432c-aa1f-5d6e939b51bc) set by cmooney@cumin... [15:53:05] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [15:53:10] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [15:53:16] !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2282.mgmt.codfw.wmnet with reboot policy GRACEFUL [15:53:45] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [15:54:43] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [15:55:06] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [15:59:07] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [15:59:12] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [15:59:38] !log disable puppet fleet-wide to allow for distruption to puppetmaster/puppetserver during network maint T355864 [15:59:41] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 (10thcipriani) [15:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:43] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [16:04:58] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [16:05:21] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [16:06:43] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 38 hosts with reason: Migrating servers in codfw rack A5 to lsw1-a5-codfw [16:07:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 38 hosts with reason: Migrating servers in codfw rack A5 to lsw1-a5-codfw [16:07:27] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage [16:07:30] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage [16:07:46] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ec1ab967-b8f5-4bfd-914e-e76afe369468) set by cmooney@cumin... [16:07:53] !log Moving server uplinks from old switch to new codfw rack A5 T355864 [16:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:03] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [16:11:07] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10BCornwall) 05In progress→03Resolved a:03BCornwall I'm still not sure where the problem lies and am concerned that this... [16:14:57] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10cmooney) All links moved and all devices pinging ok again. [16:15:07] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) This command also fails - but interestingly the host itself appears to have lost network connectivity. `ethtool` reports that the link is up but I can't connect in o... [16:16:09] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ABran-WMF) awesome, will start repooling, thanks @cmooney [16:16:14] !log Uncordoning kubernetes2019.codfw.wmnet kubernetes2018.codfw.wmnet mw2420.codfw.wmnet mw2421.codfw.wmnet mw2406.codfw.wmnet mw2422.codfw.wmnet mw2423.codfw.wmnet for T355864 [16:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:19] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [16:16:50] !log Repooling mw2402|mw2403|mw2404|mw2405|mw2407|mw2408|mw2409|mw2401|mw2410|mw2411|parse2001|parse2002|parse2003 for T355864 [16:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Eevans) 05Resolved→03Open @Jclark-ctr did restbase1036 get imaged? I don't see any comments from the cookbook... [16:17:07] !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=(mw2402|mw2403|mw2404|mw2405|mw2407|mw2408|mw2409|mw2401|mw2410|mw2411|parse2001|parse2002|parse2003).* [16:18:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 25%: T355864 - Post migration repool of db2121', diff saved to https://phabricator.wikimedia.org/P56779 and previous config saved to /var/cache/conftool/dbconfig/20240214-161824-arnaudb.json [16:19:25] 10SRE, 10Release-Engineering-Team (Backlog): mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 (10thcipriani) [16:19:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2005.codfw.wmnet [16:20:28] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [16:22:34] RECOVERY - spamassassin on vrts1002 is OK: PROCS OK: 2 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [16:25:34] PROBLEM - spamassassin on vrts1002 is CRITICAL: PROCS CRITICAL: 0 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin [16:33:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 50%: T355864 - Post migration repool of db2121', diff saved to https://phabricator.wikimedia.org/P56780 and previous config saved to /var/cache/conftool/dbconfig/20240214-163330-arnaudb.json [16:33:38] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [16:34:46] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 165 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:36:00] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10cmooney) >>! In T355333#9543167, @hnowlan wrote: > This command also fails - but interestingly the host itself appears to have lost network connectivity. `ethtool` reports th... [16:37:41] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2005.codfw.wmnet with OS bookworm [16:37:48] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10cmooney) The reimage problem may be the firmware issue - link not coming up during the debian installer. @hnowlan if you want to try the reimage again I can take a look at t... [16:39:14] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) port shows activity on the server, but the network side is showing as down. Reseating either cable does nothing. but reseating the SFP makes it come back up Pos... [16:39:46] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 44 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:39:59] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) @cmooney it was me, I was reseating the cable [16:41:32] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10cmooney) >>! In T355333#9543231, @Jhancock.wm wrote: > port shows activity on the server, but the network side is showing as down. Reseating either cable does nothing. but re... [16:42:05] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710 (10BCornwall) 05In progress→03Resolved Thanks! [16:47:28] RECOVERY - Disk space on vrts1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1002&var-datasource=eqiad+prometheus/ops [16:48:16] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [16:48:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 75%: T355864 - Post migration repool of db2121', diff saved to https://phabricator.wikimedia.org/P56781 and previous config saved to /var/cache/conftool/dbconfig/20240214-164834-arnaudb.json [16:48:39] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [16:49:49] (03PS1) 10JHathaway: puppetserver: don't fail setting up rsync_module [puppet] - 10https://gerrit.wikimedia.org/r/1003486 [16:52:14] !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye [16:52:17] (03CR) 10Ssingh: [C: 03+1] Remove deprecated X-Webkit-CSP-Report-Only response header [puppet] - 10https://gerrit.wikimedia.org/r/1003109 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ) [16:53:38] (03PS2) 10JHathaway: puppetserver: don't fail setting up rsync_module [puppet] - 10https://gerrit.wikimedia.org/r/1003486 (https://phabricator.wikimedia.org/T356991) [16:53:53] (03CR) 10Fabfur: [V: 03+1 C: 03+2] Remove deprecated X-Webkit-CSP-Report-Only response header [puppet] - 10https://gerrit.wikimedia.org/r/1003109 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ) [16:54:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) Reimaging fails still after these changes fwiw - however, a reboot has broken network connectivity again?! The host is up and rebooted in the management interface, b... [16:55:16] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003486 (https://phabricator.wikimedia.org/T356991) (owner: 10JHathaway) [16:56:37] !log disabled puppet on A:cp-upload to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003109 selectively (T357479) [16:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:42] T357479: Stop sending X-Webkit-CSP and X-Webkit-CSP-Report-Only headers - https://phabricator.wikimedia.org/T357479 [16:59:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) replaced the SFP this time. came up. server reboot is causing the port to go down, possibly [17:03:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 100%: T355864 - Post migration repool of db2121', diff saved to https://phabricator.wikimedia.org/P56782 and previous config saved to /var/cache/conftool/dbconfig/20240214-170339-arnaudb.json [17:03:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 25%: T355864 - Post migration repool of db2145', diff saved to https://phabricator.wikimedia.org/P56783 and previous config saved to /var/cache/conftool/dbconfig/20240214-170345-arnaudb.json [17:03:56] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [17:05:07] (03CR) 10Volans: [C: 03+1] "This file is starting to get a bit out of control in size, but I guess is out of scope. The change looks ok to me but I'd like some more " [puppet] - 10https://gerrit.wikimedia.org/r/1003464 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [17:10:32] !log enabled puppet on A:cp-upload to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003109 selectively (T357479) [17:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:47] T357479: Stop sending X-Webkit-CSP and X-Webkit-CSP-Report-Only headers - https://phabricator.wikimedia.org/T357479 [17:13:16] !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye [17:13:58] (03CR) 10Fabfur: [V: 03+1 C: 03+2] "Thanks @TheDJ for this patch, has been applied successfully to our servers!" [puppet] - 10https://gerrit.wikimedia.org/r/1003109 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ) [17:16:01] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10cmooney) >>! In T355333#9543255, @hnowlan wrote: > Reimaging fails still after these changes fwiw - however, a reboot has broken network connectivity again?! The host is up a... [17:18:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 50%: T355864 - Post migration repool of db2145', diff saved to https://phabricator.wikimedia.org/P56784 and previous config saved to /var/cache/conftool/dbconfig/20240214-171850-arnaudb.json [17:18:53] (03CR) 10Daniel Kinzler: Configure parser cache filters for parsoid-pcache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler) [17:19:03] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [17:19:15] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Dzahn) >>! In T316421#9542499, @Jelto wrote: > and `prometheus-etherpad-exporter` `0.7` as well. The `etherpad-lite` package also installed `no... [17:20:30] (03CR) 10Dzahn: [C: 04-1] "https://phabricator.wikimedia.org/T316421#9542499" [puppet] - 10https://gerrit.wikimedia.org/r/999973 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [17:20:37] (03CR) 10Dzahn: [C: 04-1] "https://phabricator.wikimedia.org/T316421#9542499" [puppet] - 10https://gerrit.wikimedia.org/r/1003073 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [17:23:02] (03PS1) 10Ayounsi: makevm: pass the v6 IP to GntInstance.add [cookbooks] - 10https://gerrit.wikimedia.org/r/1003490 (https://phabricator.wikimedia.org/T300152) [17:27:11] (03PS1) 10Ayounsi: Ganeti: pass the v4 and v6 IPs to the VM as fw_cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152) [17:27:36] (03PS1) 10Dzahn: etherpad: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/1003492 [17:29:21] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2282.codfw.wmnet with reason: host reimage [17:31:37] (03CR) 10Ladsgroup: "It's mostly because it's basically the second or third biggest user_properties table in the whole infra, so we can just clean that up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003408 (https://phabricator.wikimedia.org/T357072) (owner: 10Ladsgroup) [17:31:57] jouncebot: nowandnext [17:31:58] No deployments scheduled for the next 0 hour(s) and 28 minute(s) [17:31:58] In 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1800) [17:32:05] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet [17:32:14] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2282.codfw.wmnet with reason: host reimage [17:32:23] (03PS1) 10Dzahn: etherpad: add $service_ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/1003493 (https://phabricator.wikimedia.org/T316421) [17:32:33] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) I tried another reimage and it currently proceeding successfully - maybe replacing the SFP did the job? This is all a bit inexplicable. [17:32:41] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1001.eqiad.wmnet [17:32:57] (03CR) 10Dzahn: "he" [puppet] - 10https://gerrit.wikimedia.org/r/1003493 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn) [17:33:04] (03CR) 10Ladsgroup: [C: 03+2] Enable echo conditional defaults for loginwiki since 2013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003408 (https://phabricator.wikimedia.org/T357072) (owner: 10Ladsgroup) [17:33:53] (03Merged) 10jenkins-bot: Enable echo conditional defaults for loginwiki since 2013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003408 (https://phabricator.wikimedia.org/T357072) (owner: 10Ladsgroup) [17:33:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 75%: T355864 - Post migration repool of db2145', diff saved to https://phabricator.wikimedia.org/P56785 and previous config saved to /var/cache/conftool/dbconfig/20240214-173355-arnaudb.json [17:34:01] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [17:34:36] (03CR) 10CI reject: [V: 04-1] Ganeti: pass the v4 and v6 IPs to the VM as fw_cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [17:35:02] (03PS39) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [17:36:05] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:1003408|Enable echo conditional defaults for loginwiki since 2013 (T357072)]] [17:36:11] T357072: Echo: Drop droppable rows from user_properties - https://phabricator.wikimedia.org/T357072 [17:38:42] (03PS4) 10Effie Mouzeli: mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) [17:38:47] (03CR) 10Effie Mouzeli: mw-mcrouter: add helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [17:38:54] (03CR) 10CI reject: [V: 04-1] mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [17:39:02] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1003408|Enable echo conditional defaults for loginwiki since 2013 (T357072)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:39:25] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet [17:39:49] (PuppetZeroResources) firing: Puppet has failed generate resources on puppetserver2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [17:41:02] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [17:41:33] 10SRE, 10ops-codfw, 10serviceops: Multiple hosts in codfw fail to PXE boot upon reimage - https://phabricator.wikimedia.org/T357539 (10cmooney) All of these are connected to lsw1-a3-codfw (new L3 switch) and they may be the first we've tried to reimage connected to new switch. Investigating if it is related... [17:44:14] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [17:44:20] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [17:48:14] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:1003408|Enable echo conditional defaults for loginwiki since 2013 (T357072)]] (duration: 12m 08s) [17:48:18] T357072: Echo: Drop droppable rows from user_properties - https://phabricator.wikimedia.org/T357072 [17:48:33] (03CR) 10Volans: "Apart the tests looks ok" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [17:48:48] (03PS1) 10Dzahn: phabricator,etherpad: fix some puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/1003496 [17:49:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 100%: T355864 - Post migration repool of db2145', diff saved to https://phabricator.wikimedia.org/P56786 and previous config saved to /var/cache/conftool/dbconfig/20240214-174900-arnaudb.json [17:49:05] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [17:49:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: T355864 - Post migration repool of db2104', diff saved to https://phabricator.wikimedia.org/P56787 and previous config saved to /var/cache/conftool/dbconfig/20240214-174906-arnaudb.json [17:49:50] (03CR) 10Volans: "LGTM, to be merged after the spicerack release with the related patch." [cookbooks] - 10https://gerrit.wikimedia.org/r/1003490 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [17:50:38] (03PS2) 10Dzahn: phabricator,etherpad: fix some puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/1003496 [17:50:38] 10SRE, 10ops-codfw, 10serviceops: Multiple hosts in codfw fail to PXE boot upon reimage - https://phabricator.wikimedia.org/T357539 (10cmooney) I think what's happening is the new switch is not configured to insert the port information for DHCP requests over the legacy row-wide vlan. Best way forward is to... [17:51:25] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Dzahn) >>! In T316421#9542499, @Jelto wrote: >There is no puppet flag to enable or disable the process. https://gerrit.wikimedia.org/r/c/opera... [17:54:38] (03PS2) 10Dzahn: etherpad: add $service_ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/1003493 (https://phabricator.wikimedia.org/T316421) [17:56:38] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2282.codfw.wmnet with OS bullseye [17:58:33] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1002.eqiad.wmnet [17:59:01] (03PS3) 10Dzahn: site: add etherpad role to etherpad1004 [puppet] - 10https://gerrit.wikimedia.org/r/999973 (https://phabricator.wikimedia.org/T316421) [17:59:29] (03PS4) 10Dzahn: site: add etherpad role to etherpad1004 [puppet] - 10https://gerrit.wikimedia.org/r/999973 (https://phabricator.wikimedia.org/T316421) [17:59:43] !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on mw2282.codfw.wmnet with reason: Testing if reimage is stable T355333 [17:59:47] !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw2282.codfw.wmnet with reason: Testing if reimage is stable T355333 [17:59:48] T355333: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1800) [18:01:28] (03PS4) 10Hashar: python-build: default to run as nobody from /deploy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346) [18:01:30] (03PS3) 10Hashar: python-build: add make and virtualenv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1003002 (https://phabricator.wikimedia.org/T342346) [18:01:32] (03PS7) 10Hashar: python-build: set date of source files in the wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T259611) [18:01:34] (03PS3) 10Hashar: python-build: ensure frozen-requirements is exhaustive [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941443 (https://phabricator.wikimedia.org/T342346) [18:01:36] (03PS1) 10Hashar: Rebuild python-build images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1003497 [18:01:38] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [18:01:55] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [18:02:34] (03PS2) 10Dzahn: site: apply etherpad role on both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/1003073 (https://phabricator.wikimedia.org/T316421) [18:02:51] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:03:37] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [18:04:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: T355864 - Post migration repool of db2104', diff saved to https://phabricator.wikimedia.org/P56788 and previous config saved to /var/cache/conftool/dbconfig/20240214-180411-arnaudb.json [18:04:28] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [18:05:12] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1002.eqiad.wmnet [18:05:41] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for mw2379 - cmooney@cumin1002" [18:06:32] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for mw2379 - cmooney@cumin1002" [18:06:32] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:06:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T352010)', diff saved to https://phabricator.wikimedia.org/P56789 and previous config saved to /var/cache/conftool/dbconfig/20240214-180647-ladsgroup.json [18:07:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:08:33] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:09:20] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache mw2379.codfw.wmnet on all recursors [18:09:24] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mw2379.codfw.wmnet on all recursors [18:11:14] !log running `homer 'cr*codfw*' commit 'T351074'` to pick up mw2282's bgp change [18:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:20] T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 [18:11:25] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) 05Open→03Resolved a:03hnowlan Reimage was successful, networking survived a reboot. All done! [18:12:07] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host mw2379.codfw.wmnet with OS bullseye [18:13:44] (03PS5) 10Effie Mouzeli: mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) [18:14:24] !log hnowlan@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=mw2282.codfw.wmnet [18:17:29] (03PS1) 10Ladsgroup: exim: Avoid considering wikimedia domains as local in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925) [18:18:24] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1003486 (https://phabricator.wikimedia.org/T356991) (owner: 10JHathaway) [18:18:24] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1003.eqiad.wmnet [18:19:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: T355864 - Post migration repool of db2104', diff saved to https://phabricator.wikimedia.org/P56790 and previous config saved to /var/cache/conftool/dbconfig/20240214-181916-arnaudb.json [18:19:21] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [18:21:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P56791 and previous config saved to /var/cache/conftool/dbconfig/20240214-182154-ladsgroup.json [18:24:54] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1003.eqiad.wmnet [18:27:29] (03PS1) 10Hnowlan: mw-jobrunner: bump replicas for cirrusSearchLinksUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003499 (https://phabricator.wikimedia.org/T349796) [18:31:40] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2379.codfw.wmnet with reason: host reimage [18:34:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: T355864 - Post migration repool of db2104', diff saved to https://phabricator.wikimedia.org/P56792 and previous config saved to /var/cache/conftool/dbconfig/20240214-183421-arnaudb.json [18:34:26] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [18:34:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 25%: T355864 - Post migration repool of db2153', diff saved to https://phabricator.wikimedia.org/P56793 and previous config saved to /var/cache/conftool/dbconfig/20240214-183426-arnaudb.json [18:34:36] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2379.codfw.wmnet with reason: host reimage [18:34:51] 10SRE, 10Domains, 10Fundraising-Backlog, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10Dwisehaupt) @RLazarus Does this really need an apache config patch or just an update to the redirect rules in `hieradata/common/mediawiki.yaml`? [18:37:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P56794 and previous config saved to /var/cache/conftool/dbconfig/20240214-183700-ladsgroup.json [18:37:07] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [18:37:12] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [18:39:32] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:43:13] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) I will mark the SFP I pulled as bad. See if I can test it on a new server. [18:46:53] (03PS1) 10Ebernhardson: cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303) [18:47:46] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for codfw mw servers - cmooney@cumin1002" [18:47:51] (03CR) 10CI reject: [V: 04-1] cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303) (owner: 10Ebernhardson) [18:48:39] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for codfw mw servers - cmooney@cumin1002" [18:48:39] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:49:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 50%: T355864 - Post migration repool of db2153', diff saved to https://phabricator.wikimedia.org/P56795 and previous config saved to /var/cache/conftool/dbconfig/20240214-184931-arnaudb.json [18:49:36] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [18:51:57] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache mw2380.codfw.wmnet on all recursors [18:52:00] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mw2380.codfw.wmnet on all recursors [18:52:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T352010)', diff saved to https://phabricator.wikimedia.org/P56796 and previous config saved to /var/cache/conftool/dbconfig/20240214-185207-ladsgroup.json [18:52:09] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance [18:52:12] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:52:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance [18:52:15] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache mw2383.codfw.wmnet on all recursors [18:52:18] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mw2383.codfw.wmnet on all recursors [18:52:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T352010)', diff saved to https://phabricator.wikimedia.org/P56797 and previous config saved to /var/cache/conftool/dbconfig/20240214-185218-ladsgroup.json [18:53:30] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host mw2380.codfw.wmnet with OS bullseye [18:54:13] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host mw2383.codfw.wmnet with OS bullseye [18:57:07] (03PS1) 10Jdlrobson: New communities will not share scripts going forward [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679) [18:57:09] (03PS2) 10Ebernhardson: cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303) [18:58:03] (03CR) 10CI reject: [V: 04-1] cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303) (owner: 10Ebernhardson) [18:58:23] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2379.codfw.wmnet with OS bullseye [19:00:04] jeena and brennen: gettimeofday() says it's time for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1900) [19:00:04] jeena and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1900). [19:00:54] o/ [19:04:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 75%: T355864 - Post migration repool of db2153', diff saved to https://phabricator.wikimedia.org/P56798 and previous config saved to /var/cache/conftool/dbconfig/20240214-190436-arnaudb.json [19:04:53] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [19:08:06] 10SRE, 10procurement, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse) [19:08:52] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2380.codfw.wmnet with reason: host reimage [19:09:02] 10SRE, 10procurement, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse) [19:09:29] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2383.codfw.wmnet with reason: host reimage [19:10:13] (03CR) 10JHathaway: [C: 03+2] puppetserver: don't fail setting up rsync_module [puppet] - 10https://gerrit.wikimedia.org/r/1003486 (https://phabricator.wikimedia.org/T356991) (owner: 10JHathaway) [19:10:35] (03PS3) 10Ebernhardson: cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303) [19:11:23] (03PS1) 10Dzahn: add alert for planet content updates (last modified) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298) [19:11:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2380.codfw.wmnet with reason: host reimage [19:12:03] 10SRE, 10Domains, 10Fundraising-Backlog, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10RLazarus) Sorry yeah, I was using the term broadly. The goal is to edit the Apache config, but that hieradata file is how you'd do it. :) [19:12:52] (03CR) 10CI reject: [V: 04-1] add alert for planet content updates (last modified) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298) (owner: 10Dzahn) [19:13:43] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2383.codfw.wmnet with reason: host reimage [19:14:25] !log train 1.42.0-wmf.18 (T354436): logs chill, no current blockers, rolling to group1. [19:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:30] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [19:15:02] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003508 (https://phabricator.wikimedia.org/T354436) [19:15:04] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003508 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [19:15:44] ACKNOWLEDGEMENT - BFD status on lsw1-a4-codfw.mgmt is CRITICAL: Down: 2 Cathal Mooney BFD is configured towards ganeti2034 but not configured on host. - The acknowledgement expires at: 2024-02-29 19:15:18. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:15:49] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003508 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot) [19:16:58] ACKNOWLEDGEMENT - BFD status on lsw1-b7-codfw.mgmt is CRITICAL: Down: 2 Cathal Mooney BFD is down to ganeti2023 as its not configured host side. - The acknowledgement expires at: 2024-02-29 19:16:25. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:17:18] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303) (owner: 10Ebernhardson) [19:18:11] (03Merged) 10jenkins-bot: cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303) (owner: 10Ebernhardson) [19:19:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 100%: T355864 - Post migration repool of db2153', diff saved to https://phabricator.wikimedia.org/P56799 and previous config saved to /var/cache/conftool/dbconfig/20240214-191941-arnaudb.json [19:19:47] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [19:19:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 25%: T355864 - Post migration repool of db2154', diff saved to https://phabricator.wikimedia.org/P56800 and previous config saved to /var/cache/conftool/dbconfig/20240214-191946-arnaudb.json [19:23:00] (03PS1) 10Eevans: Bring restbase & aqs targets up to current [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003509 (https://phabricator.wikimedia.org/T353550) [19:23:47] (03CR) 10Eevans: [C: 03+2] Bring restbase & aqs targets up to current [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003509 (https://phabricator.wikimedia.org/T353550) (owner: 10Eevans) [19:23:50] (03CR) 10Eevans: [V: 03+2 C: 03+2] Bring restbase & aqs targets up to current [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003509 (https://phabricator.wikimedia.org/T353550) (owner: 10Eevans) [19:24:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on puppetserver2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [19:26:06] 10SRE, 10procurement, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse) [19:26:24] 10SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse) [19:26:37] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.18 refs T354436 [19:26:42] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [19:27:31] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@5c2dd00]: Deploying to updated target list — T353550 [19:27:37] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [19:28:12] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@5c2dd00]: Deploying to updated target list — T353550 (duration: 00m 41s) [19:30:57] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@5c2dd00]: Deploying to updated target list — T353550 [19:31:17] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@5c2dd00]: Deploying to updated target list — T353550 (duration: 00m 20s) [19:32:50] 10SRE, 10Fundraising-Backlog, 10Wikimedia-Apache-configuration, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10RLazarus) [19:33:03] (03PS1) 10Cathal Mooney: Remove BFD from routed ganeti peerings on router side [homer/public] - 10https://gerrit.wikimedia.org/r/1003511 (https://phabricator.wikimedia.org/T300152) [19:33:34] (SystemdUnitFailed) firing: (2) sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:33:43] (03PS1) 10Eevans: Fix canary name typo [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003512 (https://phabricator.wikimedia.org/T353550) [19:34:01] (03CR) 10Eevans: [V: 03+2 C: 03+2] Fix canary name typo [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003512 (https://phabricator.wikimedia.org/T353550) (owner: 10Eevans) [19:34:13] !log brennen@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.18 refs T354436 (duration: 07m 35s) [19:34:17] T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436 [19:34:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 50%: T355864 - Post migration repool of db2154', diff saved to https://phabricator.wikimedia.org/P56801 and previous config saved to /var/cache/conftool/dbconfig/20240214-193451-arnaudb.json [19:34:55] (SystemdUnitFailed) firing: ferm.service on mw2282:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:34:57] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [19:35:38] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1036.mgmt.eqiad.wmnet with reboot policy FORCED [19:35:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Jclark-ctr) @Eevans sorry about missing that. kicking of image now [19:35:53] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2380.codfw.wmnet with OS bullseye [19:35:56] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 [19:36:05] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [19:36:13] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 16s) [19:36:27] 10SRE, 10ops-codfw, 10serviceops: Multiple hosts in codfw fail to PXE boot upon reimage - https://phabricator.wikimedia.org/T357539 (10cmooney) 05Open→03Resolved a:03cmooney Yeah the issue here was the hosts being connected to the new switches, but still configured for the legacy vlan. That's fine, bu... [19:37:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2383.codfw.wmnet with OS bullseye [19:38:06] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 [19:39:23] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 01m 17s) [19:41:23] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 [19:41:31] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [19:42:08] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 45s) [19:42:42] (03PS2) 10JHathaway: exim: Avoid considering wikimedia domains as local in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925) (owner: 10Ladsgroup) [19:42:48] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925) (owner: 10Ladsgroup) [19:43:09] (03PS1) 10Andrea Denisse: alert: Failover Icinga and Alertmanager to alert2001 [puppet] - 10https://gerrit.wikimedia.org/r/1003513 (https://phabricator.wikimedia.org/T333615) [19:43:14] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 [19:43:29] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 14s) [19:43:34] (SystemdUnitFailed) firing: (2) sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:41] (03PS1) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 [19:43:46] (03PS1) 10Dwisehaupt: Add wikihole redirect for donatewiki [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436) [19:44:52] (03CR) 10CI reject: [V: 04-1] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson) [19:46:08] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 [19:46:42] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 34s) [19:46:46] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [19:47:31] (03CR) 10Bking: [C: 03+2] cloudelastic: Begin private IP migration for cloudelastic1007 [puppet] - 10https://gerrit.wikimedia.org/r/999088 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:48:06] (03CR) 10Bking: [C: 03+2] "Excellent, thanks for the review." [puppet] - 10https://gerrit.wikimedia.org/r/999088 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:49:05] (03PS1) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/1003516 (https://phabricator.wikimedia.org/T333615) [19:49:11] (03CR) 10JHathaway: [C: 03+1] exim: Avoid considering wikimedia domains as local in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925) (owner: 10Ladsgroup) [19:49:21] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925) (owner: 10Ladsgroup) [19:49:33] (03PS2) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 [19:49:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 75%: T355864 - Post migration repool of db2154', diff saved to https://phabricator.wikimedia.org/P56802 and previous config saved to /var/cache/conftool/dbconfig/20240214-194956-arnaudb.json [19:50:04] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [19:50:19] (03CR) 10CI reject: [V: 04-1] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson) [19:50:24] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 [19:50:29] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 05s) [19:50:44] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1007.wikimedia.org [19:50:54] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 [19:50:58] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 04s) [19:51:01] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse) [19:51:07] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 [19:51:10] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 03s) [19:51:41] (03PS3) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 [19:52:31] (03CR) 10CI reject: [V: 04-1] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson) [19:52:56] PROBLEM - Check whether ferm is active by checking the default input chain on mw2282 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [19:53:05] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 [19:53:10] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [19:53:13] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 07s) [19:53:34] (SystemdUnitFailed) firing: (2) sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:53:46] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 [19:53:52] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 06s) [19:53:59] !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 [19:54:05] !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 05s) [19:57:12] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse) [19:58:33] (03PS2) 10Jdlrobson: New communities will not share scripts going forward [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679) [19:58:34] (SystemdUnitFailed) firing: (3) puppetserver.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:58:41] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:58:55] (03PS3) 10Jdlrobson: New communities will not share scripts going forward [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679) [19:59:13] (03PS4) 10Jdlrobson: New communities will not share scripts going forward [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679) [19:59:14] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "a"} and A:restbase and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002 [19:59:19] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [20:01:39] (03CR) 10RLazarus: "Thanks Dallas! Adding Scott from my team to review and merge." [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436) (owner: 10Dwisehaupt) [20:02:08] (03PS4) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 [20:02:52] (03CR) 10CI reject: [V: 04-1] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson) [20:03:34] (SystemdUnitFailed) firing: (3) puppetserver.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:48] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1007.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [20:04:56] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1007.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002" [20:04:57] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:04:58] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudelastic1007.wikimedia.org [20:05:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 100%: T355864 - Post migration repool of db2154', diff saved to https://phabricator.wikimedia.org/P56803 and previous config saved to /var/cache/conftool/dbconfig/20240214-200501-arnaudb.json [20:05:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: T355864 - Post migration repool of db2175', diff saved to https://phabricator.wikimedia.org/P56804 and previous config saved to /var/cache/conftool/dbconfig/20240214-200507-arnaudb.json [20:06:11] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [20:07:17] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro here is an update from dell I found a couple of online articles: What Does Uncorrectable Se... [20:07:28] (03PS5) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 [20:08:15] (03CR) 10CI reject: [V: 04-1] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson) [20:09:38] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) https://www.minitool.com/lib/uncorrectable-sector-count.html https://community.wd.com/t/how-to-interp... [20:09:56] (03PS6) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 [20:12:09] !log bking@cumin2002 START - Cookbook sre.dns.netbox [20:12:22] (03CR) 10Scott French: [C: 04-1] "Thanks, Dallas! Two quick comments." [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436) (owner: 10Dwisehaupt) [20:12:39] (03PS7) 10BCornwall: fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) [20:13:34] (SystemdUnitFailed) resolved: (3) puppetserver.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:13:37] (03CR) 10BCornwall: fifo-log-demux: Decouple service from nginx/ats (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [20:14:43] (03PS7) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 [20:16:04] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1007 to private IPs - bking@cumin2002" [20:16:57] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1007 to private IPs - bking@cumin2002" [20:16:57] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:17:37] (03PS2) 10Dwisehaupt: Add wikihole redirect for donatewiki [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436) [20:18:12] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:20:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: T355864 - Post migration repool of db2175', diff saved to https://phabricator.wikimedia.org/P56805 and previous config saved to /var/cache/conftool/dbconfig/20240214-202012-arnaudb.json [20:20:14] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [20:20:21] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1007 [20:20:27] (03CR) 10Dwisehaupt: "Updated to resolve the issues." [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436) (owner: 10Dwisehaupt) [20:21:39] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1007 [20:22:24] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye [20:22:36] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:22:59] (03PS8) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 [20:23:12] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:27:36] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:28:12] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson) [20:29:02] (03Merged) 10jenkins-bot: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson) [20:31:34] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:31:43] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:34:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1036.mgmt.eqiad.wmnet with reboot policy FORCED [20:35:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: T355864 - Post migration repool of db2175', diff saved to https://phabricator.wikimedia.org/P56806 and previous config saved to /var/cache/conftool/dbconfig/20240214-203517-arnaudb.json [20:35:19] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [20:36:13] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse) [20:36:48] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage [20:36:51] (03PS1) 10BCornwall: ncmonitor: Remove useless apt-get require [puppet] - 10https://gerrit.wikimedia.org/r/1003524 [20:36:55] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [20:37:05] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:38:06] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1003439 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [20:39:28] (03PS1) 10Scott French: httpbb: add donate.wikimedia.org redirect tests [puppet] - 10https://gerrit.wikimedia.org/r/1003525 (https://phabricator.wikimedia.org/T357436) [20:39:30] (KubernetesCalicoDown) firing: mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2379.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [20:39:46] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage [20:41:33] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1036.eqiad.wmnet with OS bullseye [20:41:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1036.eqiad.wmnet with OS bullseye [20:42:24] (03PS6) 10Bking: cloudelastic: Complete cloudelastic1007's migration [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) [20:42:30] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:44:27] (03PS1) 10Eevans: cassandra: install git-fat to satisfy scap requirement [puppet] - 10https://gerrit.wikimedia.org/r/1003526 (https://phabricator.wikimedia.org/T353550) [20:46:28] (03CR) 10Hashar: python-build: ensure frozen-requirements is exhaustive (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941443 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar) [20:46:49] (03PS1) 10Andrea Denisse: alert: Ensure the alert2001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615) [20:47:02] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:48:34] (ProbeDown) firing: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:49:38] (03PS2) 10Andrea Denisse: alert: Ensure the alert2001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615) [20:50:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: T355864 - Post migration repool of db2175', diff saved to https://phabricator.wikimedia.org/P56807 and previous config saved to /var/cache/conftool/dbconfig/20240214-205021-arnaudb.json [20:50:27] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [20:50:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 25%: T355864 - Post migration repool of db2176', diff saved to https://phabricator.wikimedia.org/P56808 and previous config saved to /var/cache/conftool/dbconfig/20240214-205027-arnaudb.json [20:51:09] !log bking@puppetmaster1001 manually updating facts data for PCC T355617 [20:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:13] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [20:51:38] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Andrew) It's back in service but only as of today. [20:52:26] (03PS1) 10Andrea Denisse: alert: Ensure the alert1001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003531 (https://phabricator.wikimedia.org/T333615) [20:52:28] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "a"} and A:restbase and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002 [20:52:33] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [20:53:34] (JobUnavailable) firing: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:54:49] (03PS3) 10Andrea Denisse: alert: Ensure the alert2001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615) [20:55:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:55:32] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:56:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:56:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51452 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [20:56:24] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [20:56:29] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1036.eqiad.wmnet with reason: host reimage [20:56:36] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse) [20:56:53] !log bking@pcc-db1001.puppet-diffs.eqiad1.wikimedia.cloud updating puppet facts for PCC [20:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:57:03] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "b"} and A:restbase and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002 [20:57:21] (03CR) 10Scott French: "Current plan:" [puppet] - 10https://gerrit.wikimedia.org/r/1003525 (https://phabricator.wikimedia.org/T357436) (owner: 10Scott French) [20:57:40] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse) [20:58:22] (03CR) 10Ssingh: [C: 03+1] ncmonitor: Remove useless apt-get require [puppet] - 10https://gerrit.wikimedia.org/r/1003524 (owner: 10BCornwall) [20:59:23] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1036.eqiad.wmnet with reason: host reimage [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T2100). [21:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:31] (03PS1) 10Andrea Denisse: Revert "grafana: Ensure the grafana2001 hosts uses Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/1003469 [21:01:05] (03CR) 10Herron: [C: 03+1] thanos: run Thanos components in a systemd slice [puppet] - 10https://gerrit.wikimedia.org/r/1003439 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [21:01:24] o/ ptrdrny [21:01:28] o/ present [21:02:37] hi Jdlrobson: i can deploy for you [21:02:43] 1 sec [21:03:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679) (owner: 10Jdlrobson) [21:04:16] not sure if there's something i need to do when a new dblist is added [21:04:35] (03Merged) 10jenkins-bot: New communities will not share scripts going forward [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679) (owner: 10Jdlrobson) [21:05:02] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1003505|New communities will not share scripts going forward (T331679)]] [21:05:08] T331679: Communities can disable sharing of site/user scripts between Vector and Vector 2022 skins - https://phabricator.wikimedia.org/T331679 [21:05:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 50%: T355864 - Post migration repool of db2176', diff saved to https://phabricator.wikimedia.org/P56810 and previous config saved to /var/cache/conftool/dbconfig/20240214-210531-arnaudb.json [21:05:46] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [21:08:00] !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:1003505|New communities will not share scripts going forward (T331679)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:08:05] Jdlrobson: are you able to test? [21:09:16] (03PS1) 10Bking: cloudelastic: remove unneeded master eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617) [21:09:56] (03PS2) 10Bking: cloudelastic: remove unneeded master eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617) [21:10:00] yep [21:11:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:12:24] need a bit more time on this one [21:12:34] np - take your time [21:13:23] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:14:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T352010)', diff saved to https://phabricator.wikimedia.org/P56811 and previous config saved to /var/cache/conftool/dbconfig/20240214-211413-ladsgroup.json [21:14:28] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:14:33] (KubernetesCalicoDown) firing: (3) mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:15:41] still looking... [21:18:43] cjming: something is misbehaving but I'm not sure why. It's setup correctly [21:18:54] I am wondering if I forgot to register the dblist somewhere. [21:20:18] (03PS1) 10Jdlrobson: Register dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003542 [21:20:20] cjming: i did it again..^ [21:20:27] I need this one as well for the patch to work [21:20:38] (CI should really be detecting this) [21:20:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 75%: T355864 - Post migration repool of db2176', diff saved to https://phabricator.wikimedia.org/P56812 and previous config saved to /var/cache/conftool/dbconfig/20240214-212038-arnaudb.json [21:20:47] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [21:21:06] (03CR) 10CI reject: [V: 04-1] Register dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003542 (owner: 10Jdlrobson) [21:21:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:22:03] Jdlrobson: ok - should we revert and you can roll a new patch with the update? or merge and rebase the new one? [21:22:49] What would be your preference? [21:22:49] er ... rebase, then merge your follow up patch? [21:22:53] yep [21:23:20] i'm fine with syncing if you don't think it'll cause an issue and i can backport the follow up one right away [21:23:32] hmm CI is being funky [21:24:03] (03PS2) 10Jdlrobson: Register dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003542 [21:24:11] ^ cjming ok that should do it [21:24:39] so should i sync the current one? [21:25:22] cjming: no we want to wait [21:25:26] they need to go out together [21:25:57] ok - i'm not going to sync, and i'll scap backport them together [21:26:08] !log cjming@deploy2002 Sync cancelled. [21:26:08] thanks! [21:26:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003542 (owner: 10Jdlrobson) [21:27:52] (03Merged) 10jenkins-bot: Register dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003542 (owner: 10Jdlrobson) [21:28:14] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1003505|New communities will not share scripts going forward (T331679)]], [[gerrit:1003542|Register dblist]] [21:28:18] T331679: Communities can disable sharing of site/user scripts between Vector and Vector 2022 skins - https://phabricator.wikimedia.org/T331679 [21:29:21] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P56813 and previous config saved to /var/cache/conftool/dbconfig/20240214-212920-ladsgroup.json [21:29:41] !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:1003505|New communities will not share scripts going forward (T331679)]], [[gerrit:1003542|Register dblist]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:30:06] Jdlrobson: wanna try retesting? [21:30:23] cjming: yes pleaze [21:30:56] cjming: hurrah! please sync! [21:30:59] now it's working :) [21:31:02] yay! [21:31:07] !log cjming@deploy2002 cjming and jdlrobson: Continuing with sync [21:34:03] (03PS2) 10C. Scott Ananian: Turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999061 (https://phabricator.wikimedia.org/T355374) [21:35:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 100%: T355864 - Post migration repool of db2176', diff saved to https://phabricator.wikimedia.org/P56814 and previous config saved to /var/cache/conftool/dbconfig/20240214-213544-arnaudb.json [21:35:54] T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 [21:36:41] !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching P{P:cassandra%rack = "b"} and A:restbase and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002 [21:36:48] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [21:37:02] (03CR) 10Subramanya Sastry: [C: 03+1] Turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999061 (https://phabricator.wikimedia.org/T355374) (owner: 10C. Scott Ananian) [21:37:30] who is doing the backport today?  i'm late to the party, but could i get a config change in? [21:37:50] hi cscott -- sure what's the patch number? [21:38:21] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1003505|New communities will not share scripts going forward (T331679)]], [[gerrit:1003542|Register dblist]] (duration: 10m 06s) [21:38:26] T331679: Communities can disable sharing of site/user scripts between Vector and Vector 2022 skins - https://phabricator.wikimedia.org/T331679 [21:38:26] Jdlrobson: should be live! [21:39:02] cjming: https://gerrit.wikimedia.org/r/999061 i just added it to the calendar [21:39:07] cjming: thanks a bunch! [21:39:12] ANd thanks for the admin was just about to do that [21:40:05] Jdlrobson: yw! glad it worked out [21:40:25] cscott: good timing - i'll do it now [21:40:32] there's no canary for wikitech (i learned during last week's backport) so i can't do much during the canary phase other than check that the other sites haven't been affected; it needs to deploy fully before i can check that the config change was effective on wikitech. [21:40:46] but i'll get set up to do those checks [21:40:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999061 (https://phabricator.wikimedia.org/T355374) (owner: 10C. Scott Ananian) [21:41:21] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1032.eqiad.wmnet: Restart to pickup logging jars — T353550 - eevans@cumin1002 [21:41:31] (03Merged) 10jenkins-bot: Turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999061 (https://phabricator.wikimedia.org/T355374) (owner: 10C. Scott Ananian) [21:41:31] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:41:40] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:41:48] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:41:54] !log cjming@deploy2002 Started scap: Backport for [[gerrit:999061|Turn on Parsoid read views by default on wikitech Talk pages (T355374)]] [21:41:54] cscott: sounds good [21:41:58] T355374: Use Parsoid for DiscussionTools on wikitech - https://phabricator.wikimedia.org/T355374 [21:41:59] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:43:13] (03PS1) 10Ebernhardson: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003547 [21:43:23] cscott: re wikitech pre-sync testing, T237773 is the blocker (or at least *a* blocker). I have some hope that will be resolved via T292707 before the heat death of the universe. [21:43:23] T237773: Move Wikitech onto the production MW cluster - https://phabricator.wikimedia.org/T237773 [21:43:24] T292707: Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707 [21:43:27] !log cjming@deploy2002 cscott and cjming: Backport for [[gerrit:999061|Turn on Parsoid read views by default on wikitech Talk pages (T355374)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:43:41] cscott: shall i sync? [21:43:55] * subbu is excited about the milestone [21:44:15] cjming i'm just going to sanity check that the configuration on enwiki canary hasn't changed hang on [21:44:25] sure thing [21:44:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P56815 and previous config saved to /var/cache/conftool/dbconfig/20240214-214427-ladsgroup.json [21:45:25] cjming ok, confirmed that i haven't broken enwiki at least, go ahead with the sync [21:45:37] alrighty [21:45:42] !log cjming@deploy2002 cscott and cjming: Continuing with sync [21:49:45] (03CR) 10Ebernhardson: [C: 03+2] cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003547 (owner: 10Ebernhardson) [21:50:50] (03Merged) 10jenkins-bot: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003547 (owner: 10Ebernhardson) [21:51:17] whoops, my connection dropped.  cjming, ping me when sync is done? [21:51:38] cscott: sure thing - almost there [21:51:59] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1032.eqiad.wmnet: Restart to pickup logging jars — T353550 - eevans@cumin1002 [21:52:04] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [21:52:38] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:999061|Turn on Parsoid read views by default on wikitech Talk pages (T355374)]] (duration: 10m 44s) [21:52:43] T355374: Use Parsoid for DiscussionTools on wikitech - https://phabricator.wikimedia.org/T355374 [21:53:29] cscott: should be live! [21:53:30] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:53:39] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:54:08] Hi, do we have 5 minutes or so to deploy one config patch? [21:54:19] lol - sure [21:54:26] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/997274 [21:54:45] (03PS5) 10Zoranzoki21: throttle.php: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997274 (https://phabricator.wikimedia.org/T356654) [21:55:56] (03CR) 10Scott French: [C: 03+1] "Thanks, Dallas. This looks good to me. I'll +2 and merge tomorrow during the deployment window." [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436) (owner: 10Dwisehaupt) [21:56:32] Kizule: will do yours and then close windo [21:56:35] *window [21:56:47] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:56:48] cjming: Sounds good, because that one doesn't need mwdebug. [21:56:56] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:57:10] (03CR) 10RLazarus: [C: 03+1] httpbb: add donate.wikimedia.org redirect tests [puppet] - 10https://gerrit.wikimedia.org/r/1003525 (https://phabricator.wikimedia.org/T357436) (owner: 10Scott French) [21:57:25] Kizule: so i'll just sync when it's ready [21:57:28] (03PS7) 10Ryan Kemper: cloudelastic: Complete cloudelastic1007's migration [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:57:30] (03PS6) 10Zoranzoki21: throttle.php: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997274 (https://phabricator.wikimedia.org/T356654) [21:57:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997274 (https://phabricator.wikimedia.org/T356654) (owner: 10Zoranzoki21) [21:57:37] cjming: Okay [21:58:21] (03Merged) 10jenkins-bot: throttle.php: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997274 (https://phabricator.wikimedia.org/T356654) (owner: 10Zoranzoki21) [21:58:47] !log cjming@deploy2002 Started scap: Backport for [[gerrit:997274|throttle.php: Add throttle rule for editathon (T356654)]] [21:58:52] T356654: Request to remove account creation limit during edit-a-thon at WIT - https://phabricator.wikimedia.org/T356654 [21:59:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T352010)', diff saved to https://phabricator.wikimedia.org/P56816 and previous config saved to /var/cache/conftool/dbconfig/20240214-215934-ladsgroup.json [21:59:36] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [21:59:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:59:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [22:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T2200) [22:00:17] !log cjming@deploy2002 zoranzoki21 and cjming: Backport for [[gerrit:997274|throttle.php: Add throttle rule for editathon (T356654)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:00:20] !log cjming@deploy2002 zoranzoki21 and cjming: Continuing with sync [22:00:40] cjming: I've added it to a calendar, so we can keep up with the procedure [22:01:04] (03CR) 10Bking: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:02:15] Kizule: thanks - should be live here shortly [22:04:33] (03CR) 10Bking: [C: 03+2] cloudelastic: Complete cloudelastic1007's migration [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:07:18] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:997274|throttle.php: Add throttle rule for editathon (T356654)]] (duration: 08m 31s) [22:07:24] T356654: Request to remove account creation limit during edit-a-thon at WIT - https://phabricator.wikimedia.org/T356654 [22:07:38] Kizule: and it's live [22:07:55] Thanks cjming, I guess you can close the window now. :) [22:08:10] ya :) [22:08:15] !log end of UTC late backport window [22:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:33] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:09:11] could someone remind me of the difference between labswiki and wikitech as a key in InitialiseSettings.php ? [22:10:09] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002" [22:10:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1007.eqiad.wmnet with OS bullseye [22:13:16] !log restarting Cassandra: restbase/codfw, row b — T353550 [22:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:23] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [22:13:29] (03PS1) 10C. Scott Ananian: Correctly turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003551 [22:15:39] (03CR) 10Subramanya Sastry: [C: 03+1] Correctly turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003551 (owner: 10C. Scott Ananian) [22:15:43] PROBLEM - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is CRITICAL: connect to address 10.192.16.153 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:16:43] RECOVERY - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is OK: TCP OK - 0.030 second response time on 10.192.16.153 port 9042 https://phabricator.wikimedia.org/T93886 [22:17:12] (03PS2) 10C. Scott Ananian: Turn on Parsoid read views by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) [22:18:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:19:02] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic [22:19:13] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic [22:20:11] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1005*,cloudelastic1006* for IP migration - bking@cumin2002 - T355617 [22:20:14] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1005*,cloudelastic1006* for IP migration - bking@cumin2002 - T355617 [22:20:15] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [22:20:27] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:22:46] (03PS3) 10Bking: cloudelastic: remove unneeded master eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617) [22:23:15] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [22:23:33] (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:23:34] (ProbeDown) resolved: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:23:38] (JobUnavailable) resolved: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:26:55] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: remove unneeded master eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:27:28] (03CR) 10Bking: [C: 03+2] cloudelastic: remove unneeded master eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:33:58] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [22:34:03] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [22:39:30] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "c"} and A:restbase and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002 [22:39:35] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [22:40:19] (03PS1) 10Bking: cloudelastic: Begin private IP migration for cloudelastic1006 [puppet] - 10https://gerrit.wikimedia.org/r/1003557 (https://phabricator.wikimedia.org/T355617) [22:41:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003557 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:47:34] (03PS1) 10Bking: cloudelastic: Complete cloudelastic1006's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003558 (https://phabricator.wikimedia.org/T355617) [22:48:03] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [22:48:07] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [22:48:34] (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:49:25] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1007.eqiad.wmnet [22:49:37] !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1008.eqiad.wmnet [22:50:05] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1007.eqiad.wmnet [22:50:57] !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1008.eqiad.wmnet [22:51:09] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [22:54:55] (03PS1) 10Bking: cloudelastic: Begin private IP migration for cloudelastic1005 [puppet] - 10https://gerrit.wikimedia.org/r/1003561 (https://phabricator.wikimedia.org/T355617) [22:57:43] (03PS1) 10Bking: cloudelastic: Complete cloudelastic1005's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003563 (https://phabricator.wikimedia.org/T355617) [22:58:52] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Complete cloudelastic1006's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003558 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:59:05] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Begin private IP migration for cloudelastic1006 [puppet] - 10https://gerrit.wikimedia.org/r/1003557 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:59:21] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Begin private IP migration for cloudelastic1005 [puppet] - 10https://gerrit.wikimedia.org/r/1003561 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:59:30] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Complete cloudelastic1005's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003563 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [23:04:23] 10SRE, 10Fundraising-Backlog, 10Wikimedia-Apache-configuration, 10fundraising-tech-ops, and 2 others: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10Dwisehaupt) [23:04:31] (03CR) 10Scott French: "For completeness, the somewhat surprising redirect behavior in [0] is likely due to the `RewriteRule ^/wiki /w/index.php [L]` at [1] (comb" [puppet] - 10https://gerrit.wikimedia.org/r/1003525 (https://phabricator.wikimedia.org/T357436) (owner: 10Scott French) [23:11:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T352010)', diff saved to https://phabricator.wikimedia.org/P56817 and previous config saved to /var/cache/conftool/dbconfig/20240214-231144-ladsgroup.json [23:11:59] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:14:03] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617 [23:14:08] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [23:26:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P56818 and previous config saved to /var/cache/conftool/dbconfig/20240214-232651-ladsgroup.json [23:32:27] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "c"} and A:restbase and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002 [23:32:34] T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550 [23:34:55] (SystemdUnitFailed) firing: ferm.service on mw2282:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:41:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P56819 and previous config saved to /var/cache/conftool/dbconfig/20240214-234157-ladsgroup.json [23:57:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T352010)', diff saved to https://phabricator.wikimedia.org/P56820 and previous config saved to /var/cache/conftool/dbconfig/20240214-235703-ladsgroup.json [23:57:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance [23:57:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:57:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance [23:57:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T352010)', diff saved to https://phabricator.wikimedia.org/P56821 and previous config saved to /var/cache/conftool/dbconfig/20240214-235725-ladsgroup.json