[00:01:43] <logmsgbot>	 !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncmonitor1001.eqiad.wmnet with reason: host reimage
[00:03:14] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Ganeti VM fails to reboot after "gnt-instance modify" - https://phabricator.wikimedia.org/T357449 (10Dzahn) I did another wmf-reimage cookbook run on this host and the installation finished, including the grub install. I can't explain why it wouldn't work...
[00:03:51] <jinxer-wm>	 (SystemdUnitFailed) resolved: dump_cloud_ip_ranges.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:03:51] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10Dzahn)
[00:04:21] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10Dzahn) 05Open→03In progress
[00:04:36] <logmsgbot>	 !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncmonitor1001.eqiad.wmnet with reason: host reimage
[00:11:06] <wikibugs>	 (03PS1) 10Dzahn: convert ncmonitor role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619)
[00:11:32] <wikibugs>	 (03CR) 10Dzahn: "mu" [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619) (owner: 10Dzahn)
[00:14:12] <wikibugs>	 (03PS2) 10Dzahn: convert ncmonitor role to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619)
[00:16:48] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10Dzahn) The host should be usable now:  ` [ncmonitor1001:~] $ uptime  00:15:21 up 1 min,  1 user,  load average: 0.15, 0.04,...
[00:18:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710 (10Dzahn) ` 23:55 <+logmsgbot> !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host ncmonitor1001.eqiad.wmnet with OS bookworm 00:01 <+logmsgbot> !log dza...
[00:19:42] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Jhancock.wm)
[00:20:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: re-activate public dump job [puppet] - 10https://gerrit.wikimedia.org/r/1003070 (https://phabricator.wikimedia.org/T355502) (owner: 10Dzahn)
[00:36:24] <icinga-wm_>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - ASunknown/IPv4: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:39:10] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003031
[00:39:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003031 (owner: 10TrainBranchBot)
[00:43:36] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:44:44] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:01:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003031 (owner: 10TrainBranchBot)
[01:03:02] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:05:18] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:05:30] <icinga-wm_>	 PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 1 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:06:38] <icinga-wm_>	 RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 61 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:13:16] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:16:42] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:50:50] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 1 day(s) (Thu 15 Feb 2024 02:11:55 AM GMT +0000). https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[01:51:58] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[02:07:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[02:09:37] <wikibugs>	 (03CR) 10Ssingh: "Thanks for the patch. Will defer to Brett on this if he wants this to be Puppet 7 for all hosts or just ncredir1001." [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619) (owner: 10Dzahn)
[02:25:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T352010)', diff saved to https://phabricator.wikimedia.org/P56735 and previous config saved to /var/cache/conftool/dbconfig/20240214-022544-ladsgroup.json
[02:25:50] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[02:38:49] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:40:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P56736 and previous config saved to /var/cache/conftool/dbconfig/20240214-024050-ladsgroup.json
[02:55:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P56737 and previous config saved to /var/cache/conftool/dbconfig/20240214-025557-ladsgroup.json
[03:11:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T352010)', diff saved to https://phabricator.wikimedia.org/P56738 and previous config saved to /var/cache/conftool/dbconfig/20240214-031103-ladsgroup.json
[03:11:06] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[03:11:08] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[03:11:19] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance
[03:11:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T352010)', diff saved to https://phabricator.wikimedia.org/P56739 and previous config saved to /var/cache/conftool/dbconfig/20240214-031125-ladsgroup.json
[03:13:49] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[04:01:46] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 22 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:05:16] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:21:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[04:51:38] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 21 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:52:48] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:27:36] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 20 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:28:50] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:55:26] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 20 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:56:36] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:08:04] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[06:18:02] <icinga-wm_>	 PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 0 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:19:12] <icinga-wm_>	 RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 60 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[06:21:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:22:47] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002"
[06:22:49] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host restbase1035.eqiad.wmnet with OS bullseye
[06:22:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host restbase1035.eqiad.wmnet with OS bullseye completed: - restbase1035 (**PASS**)   - D...
[06:23:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Jclark-ctr)
[06:23:52] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Jclark-ctr) 05Open→03Resolved
[06:28:27] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Jclark-ctr) @Andrew  following up to see if this has been put back into service?
[06:39:03] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on an-worker1173 - https://phabricator.wikimedia.org/T357460 (10Jclark-ctr) a:03Jclark-ctr Submitted Request for replacement 4tb hdd
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T0700)
[07:01:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:02:12] <icinga-wm_>	 PROBLEM - HTTPS on lists1001 is CRITICAL: SSL CRITICAL - Certificate lists.wikimedia.org valid until 2024-02-15 02:11:55 +0000 (expires in 0 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:03:20] <icinga-wm_>	 RECOVERY - HTTPS on lists1001 is OK: SSL OK - Certificate lists.wikimedia.org valid until 2024-04-15 02:06:19 +0000 (expires in 60 days) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:07:09] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:external_cloud_vendors add owner to systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/1002985 (owner: 10Slyngshede)
[07:13:38] <wikibugs>	 (03PS1) 10Slyngshede: P:puppet::client_bucket remove monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003269 (https://phabricator.wikimedia.org/T350694)
[07:14:26] <wikibugs>	 (03Abandoned) 10Slyngshede: P:puppet::client_bucket Start moving monitoring to Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[07:26:59] <wikibugs>	 (03PS1) 10Slyngshede: D:service::uwsgi Parse the Icinga disable flag to uwsgi:app. [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694)
[07:30:51] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1366/co" [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[07:32:02] <wikibugs>	 (03CR) 10Slyngshede: D:service::uwsgi Parse the Icinga disable flag to uwsgi:app. [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[07:32:42] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 18 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:33:25] <wikibugs>	 (03PS2) 10Slyngshede: D:service::uwsgi Parse the Icinga disable flag to uwsgi:app. [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694)
[07:33:50] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:34:59] <wikibugs>	 (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003360
[07:35:24] <wikibugs>	 (03PS2) 10Kosta Harlan: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003360 (https://phabricator.wikimedia.org/T356736) (owner: 10STran)
[07:36:17] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003360 (https://phabricator.wikimedia.org/T356736) (owner: 10STran)
[07:37:10] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003360 (https://phabricator.wikimedia.org/T356736) (owner: 10STran)
[07:47:04] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[07:48:04] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[07:48:48] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[07:48:52] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[07:49:10] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[07:50:25] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[07:50:54] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[07:51:50] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[07:55:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T352010)', diff saved to https://phabricator.wikimedia.org/P56740 and previous config saved to /var/cache/conftool/dbconfig/20240214-075545-ladsgroup.json
[07:55:50] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:06:07] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: disable systematic wiping of /srv on db2194 [puppet] - 10https://gerrit.wikimedia.org/r/1002410 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[08:09:33] <wikibugs>	 (03PS1) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069)
[08:10:42] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Certificate lists.wikimedia.org expires in 18 hours (Thu 15 Feb 2024 02:11:55 AM GMT +0000) https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:10:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P56741 and previous config saved to /var/cache/conftool/dbconfig/20240214-081051-ladsgroup.json
[08:11:32] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddumps1001.wikimedia.org
[08:11:50] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:12:50] <taavi>	 !log restart apache2 on lists1001 to remove traces of old, soon-to-expire TLS certificate
[08:12:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah)
[08:14:17] <wikibugs>	 (03PS2) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069)
[08:19:50] <wikibugs>	 (03PS1) 10Majavah: hieradata: Failover all dumps traffic to clouddumps1001 [puppet] - 10https://gerrit.wikimedia.org/r/1003363 (https://phabricator.wikimedia.org/T321313)
[08:20:28] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddumps1001.wikimedia.org
[08:20:42] <wikibugs>	 (03PS1) 10Majavah: Revert "Failover dumps to clouddumps1002" [dns] - 10https://gerrit.wikimedia.org/r/1003364 (https://phabricator.wikimedia.org/T321313)
[08:25:11] <wikibugs>	 (03PS1) 10Ayounsi: Add BGP sessions to KPN in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1003366 (https://phabricator.wikimedia.org/T322630)
[08:25:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P56742 and previous config saved to /var/cache/conftool/dbconfig/20240214-082558-ladsgroup.json
[08:27:14] <wikibugs>	 (03PS1) 10Ayounsi: Add KPN in the list of critical BGP peers [puppet] - 10https://gerrit.wikimedia.org/r/1003367 (https://phabricator.wikimedia.org/T322630)
[08:30:22] <wikibugs>	 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10LSobanski)
[08:30:45] <wikibugs>	 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: Updater process (instance wdqs1022) - https://phabricator.wikimedia.org/T357496 (10LSobanski) There are also alerts for wdqs1023 and wdqs1024.
[08:31:03] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add BGP sessions to KPN in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1003366 (https://phabricator.wikimedia.org/T322630) (owner: 10Ayounsi)
[08:31:06] <wikibugs>	 10SRE, 10SRE-Access-Requests: Updating access key - rkhan - https://phabricator.wikimedia.org/T357483 (10Peachey88)
[08:31:27] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: eventstreams: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003368 (https://phabricator.wikimedia.org/T355686)
[08:31:29] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: cxserver: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003369 (https://phabricator.wikimedia.org/T355686)
[08:31:37] <wikibugs>	 (03Merged) 10jenkins-bot: Add BGP sessions to KPN in esams [homer/public] - 10https://gerrit.wikimedia.org/r/1003366 (https://phabricator.wikimedia.org/T322630) (owner: 10Ayounsi)
[08:31:44] <wikibugs>	 (03Restored) 10Hashar: python-build: set date of source files in the wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[08:33:46] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney)
[08:35:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DBA, and 2 others: Migrate servers in codfw rack A4 from asw-a4-codfw to lsw1-a4-codfw - https://phabricator.wikimedia.org/T355863 (10cmooney) 05Open→03Resolved a:03cmooney Closing - thanks all for the help!
[08:39:38] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:kerberos::kadminserver absent Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/995181 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[08:40:10] <wikibugs>	 (03CR) 10Muehlenhoff: convert ncmonitor role to puppet7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619) (owner: 10Dzahn)
[08:41:05] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T352010)', diff saved to https://phabricator.wikimedia.org/P56743 and previous config saved to /var/cache/conftool/dbconfig/20240214-084104-ladsgroup.json
[08:41:09] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance
[08:41:13] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[08:41:23] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance
[08:41:25] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:41:41] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:41:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T352010)', diff saved to https://phabricator.wikimedia.org/P56744 and previous config saved to /var/cache/conftool/dbconfig/20240214-084146-ladsgroup.json
[08:45:35] <wikibugs>	 (03CR) 10Muehlenhoff: puppetserver: Also install the tool to update netboot images on puppet servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003004 (https://phabricator.wikimedia.org/T341056) (owner: 10Muehlenhoff)
[08:45:37] <wikibugs>	 (03PS1) 10Slyngshede: C:puppetmaster::monitoring Absent Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1003370 (https://phabricator.wikimedia.org/T350694)
[08:45:51] <wikibugs>	 (03CR) 10Volans: "The code looks good to me but I have just one main doubt." [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah)
[08:50:16] <wikibugs>	 (03PS3) 10Hashar: python-build: default to run as nobody from /deploy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346)
[08:50:18] <wikibugs>	 (03PS2) 10Hashar: python-build: add make and virtualenv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1003002 (https://phabricator.wikimedia.org/T342346)
[08:50:20] <wikibugs>	 (03PS6) 10Hashar: python-build: set date of source files in the wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T259611)
[08:50:26] <wikibugs>	 (03PS1) 10Ayounsi: don't require a cable ID on planned cables [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1003371 (https://phabricator.wikimedia.org/T357259)
[08:50:29] <wikibugs>	 (03PS1) 10Slyngshede: P:ganeti: Remove Icinga memory monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003372 (https://phabricator.wikimedia.org/T350694)
[08:50:54] <wikibugs>	 (03PS2) 10Muehlenhoff: puppetserver: Also install the tool to update netboot images on puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1003004 (https://phabricator.wikimedia.org/T341056)
[08:51:40] <wikibugs>	 (03PS2) 10Hashar: python-build: ensure frozen-requirements is exhaustive [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941443 (https://phabricator.wikimedia.org/T342346)
[08:59:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: re-notify for SystemdUnitFailed after 24h [puppet] - 10https://gerrit.wikimedia.org/r/1003009 (https://phabricator.wikimedia.org/T357333) (owner: 10Filippo Giunchedi)
[09:00:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] P:ganeti: Remove Icinga memory monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003372 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:00:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] C:puppetmaster::monitoring Absent Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1003370 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:01:22] <wikibugs>	 (03PS1) 10Slyngshede: P:ganeti: Absent checks for generic Ganeti services. [puppet] - 10https://gerrit.wikimedia.org/r/1003374 (https://phabricator.wikimedia.org/T350694)
[09:02:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] puppetserver: Also install the tool to update netboot images on puppet servers [puppet] - 10https://gerrit.wikimedia.org/r/1003004 (https://phabricator.wikimedia.org/T341056) (owner: 10Muehlenhoff)
[09:03:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] P:puppet::client_bucket remove monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003269 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:05:09] <wikibugs>	 (03CR) 10Hashar: "I have restored a couple changes I have made in July 2023 and stacked them on top:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[09:05:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet
[09:05:26] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Draining ganeti2023.codfw.wmnet of running VMs
[09:06:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see inline" [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:08:07] <icinga-wm_>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:08:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet
[09:08:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] nagios_common/planet: remove check_lastmod check, script and config [puppet] - 10https://gerrit.wikimedia.org/r/1003084 (https://phabricator.wikimedia.org/T353298) (owner: 10Dzahn)
[09:08:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Very nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1003098 (owner: 10Dzahn)
[09:11:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] D:service::uwsgi Parse the Icinga disable flag to uwsgi:app. [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:12:05] <wikibugs>	 (03PS3) 10Slyngshede: Monitoring of PKI infrastructure certs. [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694)
[09:13:27] <wikibugs>	 (03CR) 10Slyngshede: Monitoring of PKI infrastructure certs. (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:14:18] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:puppet::client_bucket remove monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003269 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:17:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[09:20:11] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003031 (owner: 10TrainBranchBot)
[09:22:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): volatile: We need to configure the volatile endpoint on puppetserveres - https://phabricator.wikimedia.org/T341056 (10MoritzMuehlenhoff) >>! In T341056#9540458, @jhathaway wrote: > @Muehlenhoff I thin...
[09:28:12] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003033
[09:28:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003033 (owner: 10TrainBranchBot)
[09:29:42] <wikibugs>	 (03PS1) 10Muehlenhoff: update-netboot-image: Update instructions for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003375 (https://phabricator.wikimedia.org/T341056)
[09:30:43] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] Revert "Failover dumps to clouddumps1002" [dns] - 10https://gerrit.wikimedia.org/r/1003364 (https://phabricator.wikimedia.org/T321313) (owner: 10Majavah)
[09:31:16] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: Failover all dumps traffic to clouddumps1001 [puppet] - 10https://gerrit.wikimedia.org/r/1003363 (https://phabricator.wikimedia.org/T321313) (owner: 10Majavah)
[09:37:27] <wikibugs>	 (03PS3) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069)
[09:38:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet
[09:38:14] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Draining ganeti2024.codfw.wmnet of running VMs
[09:38:23] <wikibugs>	 (03CR) 10Majavah: "Good point. PS3 includes a config file with the repositories checked out on the host." [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah)
[09:38:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah)
[09:39:25] <wikibugs>	 (03PS4) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069)
[09:40:25] <icinga-wm_>	 PROBLEM - Host sretest2005 is DOWN: CRITICAL - Host Unreachable (10.192.24.3)
[09:40:54] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1367/co" [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah)
[09:41:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] update-netboot-image: Update instructions for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003375 (https://phabricator.wikimedia.org/T341056) (owner: 10Muehlenhoff)
[09:42:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet
[09:43:47] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: rec-api: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003376 (https://phabricator.wikimedia.org/T355686)
[09:43:50] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: wikifunctions: Add mesh.configuration in package.json [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003377 (https://phabricator.wikimedia.org/T355686)
[09:46:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] eventstreams: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003368 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris)
[09:46:51] <icinga-wm_>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:47:54] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Release 0.7 prometheus-etherpad-exporter [debs/prometheus-etherpad-exporter] - 10https://gerrit.wikimedia.org/r/1003007 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto)
[09:48:19] <wikibugs>	 (03Merged) 10jenkins-bot: eventstreams: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003368 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris)
[09:48:34] <jinxer-wm>	 (ProbeDown) firing: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:49:12] <moritzm>	 !log imported openssl11 1.1.1w-0+deb11u1+wmf1 to component/haproxy26 T352744
[09:49:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:17] <stashbot>	 T352744: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744
[09:51:16] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1003033 (owner: 10TrainBranchBot)
[09:52:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1003124 (https://phabricator.wikimedia.org/T349619) (owner: 10Dzahn)
[09:52:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1002922 (https://phabricator.wikimedia.org/T354959) (owner: 10Muehlenhoff)
[09:53:33] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:55:44] <moritzm>	 !log installing Linux 5.10.209 on Bullseye hosts
[09:55:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:57] <icinga-wm_>	 RECOVERY - Host sretest2005 is UP: PING OK - Packet loss = 0%, RTA = 30.94 ms
[10:01:30] <icinga-wm_>	 PROBLEM - Host sretest2005 is DOWN: CRITICAL - Host Unreachable (10.192.24.3)
[10:02:07] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm
[10:02:52] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1003371 (https://phabricator.wikimedia.org/T357259) (owner: 10Ayounsi)
[10:06:32] <icinga-wm_>	 RECOVERY - Host sretest2005 is UP: PING OK - Packet loss = 0%, RTA = 32.19 ms
[10:08:33] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[10:08:52] <wikibugs>	 (03CR) 10Volans: "minor nit and we're good to go :)" [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah)
[10:12:49] <wikibugs>	 (03PS5) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069)
[10:13:08] <wikibugs>	 (03CR) 10Majavah: P:spicerack: support wmcs cookbook repo in test-cookbook (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah)
[10:13:12] <wikibugs>	 (03PS1) 10Jelto: miscweb: reduce resource requests for apache and envoy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003380 (https://phabricator.wikimedia.org/T357413)
[10:15:46] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM! Thanks for the patch." [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah)
[10:16:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10MoritzMuehlenhoff)
[10:18:33] <godog>	 !log powercycle titan1001
[10:18:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetserver2003.codfw.wmnet with OS bookworm
[10:19:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetserver2003.codfw.wmnet with...
[10:20:46] <wikibugs>	 (03PS2) 10Jelto: miscweb: reduce resource requests for apache and envoy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003380 (https://phabricator.wikimedia.org/T357413)
[10:21:18] <icinga-wm_>	 PROBLEM - Host titan1001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:22:48] <icinga-wm_>	 RECOVERY - Host titan1001 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[10:22:48] <icinga-wm_>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:23:33] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:23:51] <jinxer-wm>	 (ProbeDown) resolved: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:24:39] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10klausman)
[10:28:40] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams: apply
[10:28:52] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
[10:31:02] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host puppetserver2003.codfw.wmnet with OS bookworm
[10:32:47] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Monitoring of PKI infrastructure certs. [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[10:33:18] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
[10:33:48] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
[10:34:21] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 (10cmooney)
[10:34:36] <wikibugs>	 (03Merged) 10jenkins-bot: Monitoring of PKI infrastructure certs. [alerts] - 10https://gerrit.wikimedia.org/r/999802 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[10:34:38] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:puppetmaster::monitoring Absent Icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1003370 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[10:37:35] <slyngs>	 !log Deploying new PKI checks to alertmanager
[10:37:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:37:39] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[10:37:41] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[10:37:50] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[10:38:04] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[10:38:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P56745 and previous config saved to /var/cache/conftool/dbconfig/20240214-103810-ladsgroup.json
[10:38:19] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[10:40:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P56746 and previous config saved to /var/cache/conftool/dbconfig/20240214-104024-ladsgroup.json
[10:41:08] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetserver2003.codfw.wmnet with OS bookworm
[10:41:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetserver2003.codfw.wmnet with...
[10:45:51] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2005.codfw.wmnet
[10:46:16] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2005.codfw.wmnet
[10:48:18] <jelto>	 !log import prometheus-etherpad-exporter 0.7 to bookworm-wikimedia on apt hosts - T316421
[10:48:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:28] <stashbot>	 T316421: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421
[10:48:30] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams: apply
[10:48:59] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
[10:50:27] <wikibugs>	 10SRE, 10observability, 10Sustainability (Incident Followup): thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788 (10fgiunchedi) Happened again albeit on titan1001 only, where query-frontend and store both using cpu and memory, and the host becoming unr...
[10:52:15] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm
[10:53:57] <wikibugs>	 (03PS1) 10Slyngshede: P:kerberos::kdc absent Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1003382 (https://phabricator.wikimedia.org/T350694)
[10:55:31] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P56747 and previous config saved to /var/cache/conftool/dbconfig/20240214-105530-ladsgroup.json
[10:56:53] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[10:57:46] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[10:57:47] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
[10:58:21] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
[10:58:22] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[10:58:27] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM vrts1002.eqiad.wmnet
[10:58:34] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[10:58:38] <wikibugs>	 (03PS1) 10Slyngshede: P:kerberos::replication absent Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/1003383 (https://phabricator.wikimedia.org/T350694)
[10:59:38] <wikibugs>	 (03PS1) 10Hnowlan: admin: update ssh key for rkhan [puppet] - 10https://gerrit.wikimedia.org/r/1003384 (https://phabricator.wikimedia.org/T357483)
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1100)
[11:00:57] <wikibugs>	 (03CR) 10Clément Goubert: "docker-pkg only builds the latest change in the changelog file, if it's not already built, so if all that stack is merged roughly at the s" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[11:03:00] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: rec-api: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003376 (https://phabricator.wikimedia.org/T355686)
[11:03:02] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: cxserver: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003369 (https://phabricator.wikimedia.org/T355686)
[11:03:04] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: wikifunctions: Add mesh.configuration in package.json [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003377 (https://phabricator.wikimedia.org/T355686)
[11:03:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Use 2disk Partman recipe for puppetserver2003 [puppet] - 10https://gerrit.wikimedia.org/r/1003385 (https://phabricator.wikimedia.org/T356991)
[11:04:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] rec-api: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003376 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris)
[11:05:31] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1003385 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff)
[11:06:02] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host puppetserver2003.codfw.wmnet with OS bookworm
[11:06:07] <wikibugs>	 (03Merged) 10jenkins-bot: rec-api: Bump mesh.configuration to 1.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003376 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris)
[11:09:12] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:ganeti: Remove Icinga memory monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/1003372 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[11:10:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Use 2disk Partman recipe for puppetserver2003 [puppet] - 10https://gerrit.wikimedia.org/r/1003385 (https://phabricator.wikimedia.org/T356991) (owner: 10Muehlenhoff)
[11:10:37] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P56748 and previous config saved to /var/cache/conftool/dbconfig/20240214-111037-ladsgroup.json
[11:12:56] <wikibugs>	 (03PS5) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[11:14:06] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply
[11:14:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn)
[11:14:33] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply
[11:14:34] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply
[11:14:53] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply
[11:14:54] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply
[11:15:05] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply
[11:17:48] <wikibugs>	 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507 (10Clement_Goubert)
[11:19:09] <icinga-wm_>	 PROBLEM - spamassassin on vrts1002 is CRITICAL: PROCS CRITICAL: 0 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin
[11:20:12] <wikibugs>	 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507 (10Clement_Goubert) p:05Triage→03High
[11:20:56] <wikibugs>	 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508 (10Clement_Goubert)
[11:21:31] <wikibugs>	 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Scap, 10serviceops: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 (10Clement_Goubert)
[11:21:39] <wikibugs>	 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508 (10Clement_Goubert)
[11:23:43] <wikibugs>	 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Scap, 10serviceops: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 (10Clement_Goubert)
[11:23:49] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[11:25:30] <wikibugs>	 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508 (10Clement_Goubert) 05Open→03Stalled p:05Triage→03High We'll stay at 50% or under because we want scap to check canaries before...
[11:25:44] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T352010)', diff saved to https://phabricator.wikimedia.org/P56749 and previous config saved to /var/cache/conftool/dbconfig/20240214-112543-ladsgroup.json
[11:25:46] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[11:25:49] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[11:25:59] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[11:26:06] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P56750 and previous config saved to /var/cache/conftool/dbconfig/20240214-112606-ladsgroup.json
[11:26:51] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[11:28:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P56751 and previous config saved to /var/cache/conftool/dbconfig/20240214-112818-ladsgroup.json
[11:28:47] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 45% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003393 (https://phabricator.wikimedia.org/T357507)
[11:30:16] <wikibugs>	 (03PS1) 10Clément Goubert: trafficserver: move 45% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1003394 (https://phabricator.wikimedia.org/T357507)
[11:33:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host puppetserver2003.codfw.wmnet with OS bookworm
[11:33:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host puppetserver2003.codfw.wmnet with...
[11:37:08] <wikibugs>	 (03PS6) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[11:37:26] <icinga-wm_>	 RECOVERY - Disk space on vrts1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1002&var-datasource=eqiad+prometheus/ops
[11:40:07] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ms-be1045.eqiad.wmnet
[11:40:16] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] php: add env[MCROUTER_SERVER] variable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:40:27] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:43:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P56752 and previous config saved to /var/cache/conftool/dbconfig/20240214-114325-ladsgroup.json
[11:45:38] <wikibugs>	 (03CR) 10David Caro: "I did not hit enter... sorry" [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah)
[11:46:13] <Lucas_WMDE>	 jouncebot: nowandnext
[11:46:13] <jouncebot>	 For the next 0 hour(s) and 13 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1100)
[11:46:13] <jouncebot>	 In 2 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1400)
[11:47:01] <Lucas_WMDE>	 I think I’ll give T355685 another go (cc akosiaris, hashar)
[11:47:01] <stashbot>	 T355685: Migrate Termbox SSR from Node 16 to 18 - https://phabricator.wikimedia.org/T355685
[11:50:30] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ms-be1045.eqiad.wmnet
[11:50:45] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Reapply "termbox: update to 2024-01-22-163619-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003400 (https://phabricator.wikimedia.org/T331403)
[11:51:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on puppetserver2003.codfw.wmnet with reason: host reimage
[11:53:33] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "I’m guessing that a self-merge is okay here because it’s only restoring a previously reviewed change, after the cause of the revert was co" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003400 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE))
[11:54:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on puppetserver2003.codfw.wmnet with reason: host reimage
[11:55:15] <wikibugs>	 (03Merged) 10jenkins-bot: Reapply "termbox: update to 2024-01-22-163619-production" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003400 (https://phabricator.wikimedia.org/T331403) (owner: 10Lucas Werkmeister (WMDE))
[11:57:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Make puppetserver2003 a Puppet server [puppet] - 10https://gerrit.wikimedia.org/r/1003402 (https://phabricator.wikimedia.org/T356991)
[11:57:48] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.convert-disks for host ms-be1045
[11:58:02] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] START helmfile.d/services/termbox: apply
[11:58:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [staging] DONE helmfile.d/services/termbox: apply
[11:58:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P56753 and previous config saved to /var/cache/conftool/dbconfig/20240214-115831-ladsgroup.json
[11:58:55] <wikibugs>	 (03PS1) 10Muehlenhoff: Advertise puppetserver2003 as active Puppet 7 server [dns] - 10https://gerrit.wikimedia.org/r/1003403 (https://phabricator.wikimedia.org/T356991)
[11:59:01] <Lucas_WMDE>	 test command from https://wikitech.wikimedia.org/wiki/WMDE/Wikidata/SSR_Service#Deployment works for staging, at least
[11:59:05] <Lucas_WMDE>	 so I’ll go ahead with eqiad and codfw
[11:59:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] START helmfile.d/services/termbox: apply
[12:00:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [eqiad] DONE helmfile.d/services/termbox: apply
[12:02:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] START helmfile.d/services/termbox: apply
[12:02:44] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 helmfile [codfw] DONE helmfile.d/services/termbox: apply
[12:03:32] <Lucas_WMDE>	 termbox SSR is working as far as I can tell \o/
[12:03:37] * Lucas_WMDE done deploying
[12:06:22] <wikibugs>	 (03PS7) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[12:07:25] <icinga-wm_>	 PROBLEM - Disk space on vrts1002 is CRITICAL: DISK CRITICAL - free space: /srv/otrs-data 5468 MB (0% inode=61%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1002&var-datasource=eqiad+prometheus/ops
[12:09:27] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] D:service::uwsgi Parse the Icinga disable flag to uwsgi:app. [puppet] - 10https://gerrit.wikimedia.org/r/1003359 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[12:11:16] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] "Will do, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1003361 (https://phabricator.wikimedia.org/T345069) (owner: 10Majavah)
[12:11:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[12:13:39] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T352010)', diff saved to https://phabricator.wikimedia.org/P56754 and previous config saved to /var/cache/conftool/dbconfig/20240214-121337-ladsgroup.json
[12:13:41] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[12:13:45] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[12:13:55] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[12:14:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T352010)', diff saved to https://phabricator.wikimedia.org/P56755 and previous config saved to /var/cache/conftool/dbconfig/20240214-121401-ladsgroup.json
[12:16:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T352010)', diff saved to https://phabricator.wikimedia.org/P56756 and previous config saved to /var/cache/conftool/dbconfig/20240214-121614-ladsgroup.json
[12:16:22] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] miscweb: reduce resource requests for apache and envoy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003380 (https://phabricator.wikimedia.org/T357413) (owner: 10Jelto)
[12:17:07] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review: Extend "test-cookbook" to support wmcs-cookbooks - https://phabricator.wikimedia.org/T345069 (10taavi) 05Open→03Resolved a:03taavi
[12:21:48] <wikibugs>	 (03PS1) 10Slyngshede: P:puppetdb::microservice absent uwsgi Icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/1003405 (https://phabricator.wikimedia.org/T350694)
[12:23:46] <wikibugs>	 (03PS1) 10Slyngshede: P:puppetboard absent Icinga checks for PuppetBoard. [puppet] - 10https://gerrit.wikimedia.org/r/1003406 (https://phabricator.wikimedia.org/T350694)
[12:31:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P56757 and previous config saved to /var/cache/conftool/dbconfig/20240214-123120-ladsgroup.json
[12:31:51] <wikibugs>	 (03PS1) 10Hnowlan: kubernetes: make 5 codfw appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/1003407 (https://phabricator.wikimedia.org/T351074)
[12:32:29] <wikibugs>	 (03PS8) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[12:32:59] <wikibugs>	 (03PS2) 10Effie Mouzeli: services_proxy: set retry for ipoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1003001 (https://phabricator.wikimedia.org/T356766)
[12:33:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn)
[12:35:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] services_proxy: set keepalive for ipoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1003000 (https://phabricator.wikimedia.org/T356766) (owner: 10Effie Mouzeli)
[12:35:58] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039
[12:36:11] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] services_proxy: set keepalive for ipoid listener [puppet] - 10https://gerrit.wikimedia.org/r/1003000 (https://phabricator.wikimedia.org/T356766) (owner: 10Effie Mouzeli)
[12:36:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 (owner: 10Effie Mouzeli)
[12:36:31] <wikibugs>	 (03PS1) 10Ladsgroup: Enable echo conditional defaults for loginwiki since 2013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003408 (https://phabricator.wikimedia.org/T357072)
[12:39:00] <wikibugs>	 (03PS9) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[12:39:21] <wikibugs>	 (03PS4) 10Effie Mouzeli: mediawiki: Bump sextant module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/999780 (owner: 10Alexandros Kosiaris)
[12:39:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: Bump sextant module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/999780 (owner: 10Alexandros Kosiaris)
[12:39:57] <wikibugs>	 (03PS3) 10Effie Mouzeli: mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039
[12:40:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039 (owner: 10Effie Mouzeli)
[12:40:11] <wikibugs>	 (03CR) 10Ladsgroup: "We shouldn't need to do any clean ups there and it's already 6GB, clean up should be easy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003408 (https://phabricator.wikimedia.org/T357072) (owner: 10Ladsgroup)
[12:40:59] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] kubernetes: make 5 codfw appservers kubernetes workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003407 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[12:41:16] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] cache.mcrouter: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000032 (owner: 10Effie Mouzeli)
[12:42:14] <wikibugs>	 (03Merged) 10jenkins-bot: cache.mcrouter: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000032 (owner: 10Effie Mouzeli)
[12:45:14] <wikibugs>	 (03PS5) 10Effie Mouzeli: mediawiki: Bump sextant module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/999780 (owner: 10Alexandros Kosiaris)
[12:46:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P56758 and previous config saved to /var/cache/conftool/dbconfig/20240214-124627-ladsgroup.json
[12:49:33] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.swift.convert-disks (exit_code=99) for host ms-be1045
[12:51:52] <wikibugs>	 10SRE, 10Domains, 10Fundraising-Backlog, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10AKanji-WMF) Thank you @RLazarus ! The fundraising team would like the redirect to be active for two years - considering the life cycle of the podcast and surrounding marketing c...
[12:52:31] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm
[12:54:16] <wikibugs>	 (03PS1) 10Slyngshede: P:debmonitor::server Add CDN endpoint check. [puppet] - 10https://gerrit.wikimedia.org/r/1003409 (https://phabricator.wikimedia.org/T350694)
[12:57:06] <wikibugs>	 (03PS2) 10Hnowlan: kubernetes: make 5 codfw appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/1003407 (https://phabricator.wikimedia.org/T351074)
[12:57:47] <hashar>	 Lucas_WMDE: +1 about Termbox, then I don't know what that services is doing exactly but I guess you are in the best position to deploy and monitor it :)
[12:59:40] <wikibugs>	 (03CR) 10Jforrester: "This looks great, thanks! Do we also need to apply it (or similar) to charts/function-evaluator/Chart.yaml ? It has a bunch of mesh.* entr" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003377 (https://phabricator.wikimedia.org/T355686) (owner: 10Alexandros Kosiaris)
[13:01:35] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T352010)', diff saved to https://phabricator.wikimedia.org/P56760 and previous config saved to /var/cache/conftool/dbconfig/20240214-130134-ladsgroup.json
[13:01:37] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[13:01:51] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[13:01:54] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[13:01:57] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T352010)', diff saved to https://phabricator.wikimedia.org/P56761 and previous config saved to /var/cache/conftool/dbconfig/20240214-130157-ladsgroup.json
[13:01:59] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] kubernetes: make 5 codfw appservers kubernetes workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003407 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[13:02:14] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] kubernetes: make 5 codfw appservers kubernetes workers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003407 (https://phabricator.wikimedia.org/T351074) (owner: 10Hnowlan)
[13:04:11] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T352010)', diff saved to https://phabricator.wikimedia.org/P56762 and previous config saved to /var/cache/conftool/dbconfig/20240214-130410-ladsgroup.json
[13:05:52] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[13:07:50] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage
[13:10:43] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage
[13:12:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T352010)', diff saved to https://phabricator.wikimedia.org/P56763 and previous config saved to /var/cache/conftool/dbconfig/20240214-131231-ladsgroup.json
[13:12:37] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[13:13:14] <wikibugs>	 (03PS1) 10Slyngshede: P:netbox monitoring not required on systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/1003413 (https://phabricator.wikimedia.org/T350694)
[13:16:15] <wikibugs>	 (03PS2) 10Slyngshede: P:netbox monitoring not required on systemd timer. [puppet] - 10https://gerrit.wikimedia.org/r/1003413 (https://phabricator.wikimedia.org/T350694)
[13:17:15] <wikibugs>	 (03PS9) 10KartikMistry: Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982)
[13:19:17] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P56764 and previous config saved to /var/cache/conftool/dbconfig/20240214-131916-ladsgroup.json
[13:23:08] <icinga-wm_>	 PROBLEM - ensure kvm processes are running on cloudvirt1036 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:24:07] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.hosts.reimage for host eventlog1003.eqiad.wmnet with OS bullseye
[13:24:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jmm@cumin2002"
[13:24:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host puppetserver2003.codfw.wmnet with OS bookworm
[13:24:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Repurpose puppetmaster2003 as puppetserver2003 - https://phabricator.wikimedia.org/T356991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host puppetserver200...
[13:24:33] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2005.codfw.wmnet with OS bookworm
[13:24:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host apifeatureusage2001.codfw.wmnet
[13:26:28] <Daimona>	 !log T357007 Profiling current master version of CampaignEvents:GenerateInvitationList with excimer in mwmaint2002
[13:26:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:32] <stashbot>	 T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007
[13:26:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch apifeatureusage2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003415 (https://phabricator.wikimedia.org/T349619)
[13:27:07] <icinga-wm_>	 RECOVERY - ensure kvm processes are running on cloudvirt1036 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:27:22] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb: reduce resource requests for apache and envoy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003380 (https://phabricator.wikimedia.org/T357413) (owner: 10Jelto)
[13:27:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P56765 and previous config saved to /var/cache/conftool/dbconfig/20240214-132737-ladsgroup.json
[13:28:27] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: reduce resource requests for apache and envoy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003380 (https://phabricator.wikimedia.org/T357413) (owner: 10Jelto)
[13:34:17] <wikibugs>	 (03PS1) 10Ayounsi: Add support for routed Ganeti in D-I early_command.sh [puppet] - 10https://gerrit.wikimedia.org/r/1003416 (https://phabricator.wikimedia.org/T300152)
[13:34:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P56766 and previous config saved to /var/cache/conftool/dbconfig/20240214-133422-ladsgroup.json
[13:36:33] <logmsgbot>	 !log brouberol@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on eventlog1003.eqiad.wmnet with reason: host reimage
[13:37:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch apifeatureusage2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003415 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:37:43] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] openstack: overhaul the floating IP updater [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah)
[13:38:06] <moritzm>	 taavi: ok to merge your patch along?
[13:38:08] <taavi>	 yes please
[13:38:40] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] openstack: overhaul the floating IP updater (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/998401 (https://phabricator.wikimedia.org/T341338) (owner: 10Majavah)
[13:39:01] <moritzm>	 done, merged
[13:39:18] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on eventlog1003.eqiad.wmnet with reason: host reimage
[13:39:40] <wikibugs>	 (03PS4) 10Effie Mouzeli: mediawiki: rename cache.mcrouter.deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1000039
[13:40:38] <taavi>	 thanks!
[13:42:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host apifeatureusage2001.codfw.wmnet
[13:42:20] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[13:42:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P56767 and previous config saved to /var/cache/conftool/dbconfig/20240214-134244-ladsgroup.json
[13:48:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch wikireplicas::dedicated::analytics_multiinstance to Puppet 7 on role level [puppet] - 10https://gerrit.wikimedia.org/r/1003418 (https://phabricator.wikimedia.org/T349619)
[13:49:30] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T352010)', diff saved to https://phabricator.wikimedia.org/P56768 and previous config saved to /var/cache/conftool/dbconfig/20240214-134929-ladsgroup.json
[13:49:32] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[13:49:34] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[13:49:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Switch wikireplicas::dedicated::analytics_multiinstance to Puppet 7 on role level [puppet] - 10https://gerrit.wikimedia.org/r/1003418 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:49:46] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[13:49:48] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:49:53] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[13:50:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T352010)', diff saved to https://phabricator.wikimedia.org/P56769 and previous config saved to /var/cache/conftool/dbconfig/20240214-134959-ladsgroup.json
[13:52:44] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[13:54:53] <icinga-wm_>	 PROBLEM - statsv process on webperf2003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv
[13:55:53] <icinga-wm_>	 RECOVERY - statsv process on webperf2003 is OK: PROCS OK: 2 processes with command name python3, args statsv https://wikitech.wikimedia.org/wiki/Graphite%23statsv
[13:57:51] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T352010)', diff saved to https://phabricator.wikimedia.org/P56770 and previous config saved to /var/cache/conftool/dbconfig/20240214-135750-ladsgroup.json
[13:57:53] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[13:57:56] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[13:58:07] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[13:58:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T352010)', diff saved to https://phabricator.wikimedia.org/P56771 and previous config saved to /var/cache/conftool/dbconfig/20240214-135813-ladsgroup.json
[13:59:14] <logmsgbot>	 !log brouberol@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host eventlog1003.eqiad.wmnet with OS bullseye
[13:59:53] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T352010)', diff saved to https://phabricator.wikimedia.org/P56772 and previous config saved to /var/cache/conftool/dbconfig/20240214-135953-ladsgroup.json
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1400).
[14:00:05] <jouncebot>	 Daimona: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:17] <Daimona>	 o/
[14:00:21] <HouseOfM>	 o/
[14:01:08] * TheresNoTime can deploy
[14:02:21] <Lucas_WMDE>	 o/
[14:02:40] <TheresNoTime>	 Daimona: just doing the beta-only patch first
[14:03:28] <Daimona>	 yup sure
[14:03:37] <TheresNoTime>	 not sure why logmsgbot isn't logging here...
[14:03:57] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[14:05:16] <TheresNoTime>	 Daimona: now starting the prod one
[14:05:45] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[14:05:52] <Daimona>	 Thanks! Looks like we missed the beta update job by a few seconds. I could perhaps retrigger it manually to test in beta first
[14:06:01] <TheresNoTime>	 Lucas_WMDE: logmsgbot is still meant to be logging deployment here right..?
[14:06:12] <Lucas_WMDE>	 I think so yeah?
[14:06:12] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:991352|prod: Stop setting $wgCampaignEventsEnableParticipantQuestions (T347608)]]
[14:06:15] <Lucas_WMDE>	 there it goes
[14:06:16] <stashbot>	 T347608: Remove feature flag for Participant Questions - https://phabricator.wikimedia.org/T347608
[14:06:22] <TheresNoTime>	 and yeah good idea Daimona 
[14:06:42] <Lucas_WMDE>	 TheresNoTime: was it still waiting for the patch to be merged? I don’t think it logs until the actual scap sync starts
[14:06:58] <Lucas_WMDE>	 (though I would’ve expected messages about the +2 and “merged” from another bot)
[14:07:04] <Lucas_WMDE>	 (wikibugs ig)
[14:07:11] <TheresNoTime>	 ah another bot does that
[14:07:19] <Lucas_WMDE>	 wikibugs: status
[14:07:29] <Lucas_WMDE>	 (idk if it has a status command, worth a shot though)
[14:08:33] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:08:58] <Lucas_WMDE>	 wikibugs’ toolforge jobs are running, at least…
[14:09:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1034.eqiad.wmnet
[14:10:23] <logmsgbot>	 !log samtar@deploy2002 samtar and daimona: Backport for [[gerrit:991352|prod: Stop setting $wgCampaignEventsEnableParticipantQuestions (T347608)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:10:25] <Lucas_WMDE>	 in #wikimedia-dev wikibugs hasn’t written anything new since 15:06 CET
[14:10:28] <TheresNoTime>	 Daimona: prod patch is live on mwdebug (you can test it after you've checked beta if you want?)
[14:10:30] <Daimona>	 beta has been updated, so testing there now @HouseOfM
[14:11:39] <Lucas_WMDE>	 well, there goes wikibugs
[14:11:56] <TheresNoTime>	 just did a `toolforge jobs load libera/k8s-jobs.yaml` 
[14:12:09] <Lucas_WMDE>	 seems to be back in -dev at least
[14:12:39] <Daimona>	 Beta looking fine, now testing prod. HouseOfM: I think you can also test in prod directly
[14:13:02] <HouseOfM>	 thx
[14:14:12] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm
[14:14:33] <Daimona>	 I made this test event: https://test.wikipedia.org/wiki/Event:T347608
[14:14:34] <stashbot>	 T347608: Remove feature flag for Participant Questions - https://phabricator.wikimedia.org/T347608
[14:14:55] <Daimona>	 You can try registering there and see if you get the questions
[14:15:00] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P56773 and previous config saved to /var/cache/conftool/dbconfig/20240214-141459-ladsgroup.json
[14:15:06] <claime>	 !log Draining and cordoning kubernetes2019.codfw.wmnet kubernetes2018.codfw.wmnet mw2420.codfw.wmnet mw2421.codfw.wmnet mw2406.codfw.wmnet mw2422.codfw.wmnet mw2423.codfw.wmnet for T355864
[14:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:11] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[14:15:50] <HouseOfM>	 LGTM
[14:15:58] <TheresNoTime>	 happy to sync? :)
[14:17:34] <HouseOfM>	 @Daimona, it's not redirecting after answering the questions. is that expected?
[14:17:53] <Daimona>	 wdym?
[14:18:30] <wikibugs>	 (03CR) 10Hashar: "Yup then my trouble is whether we want to build each commit or if it is fine to only build an image which would include all the changes ;)" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[14:19:17] <HouseOfM>	 When registering after answering the questions, it gives a positive message, but stays on the questions form
[14:20:28] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Profiler: Silence "RedisException: Connection timed out" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003083 (https://phabricator.wikimedia.org/T348756) (owner: 10Krinkle)
[14:22:06] <HouseOfM>	 nm, ignore that. It's fine
[14:22:12] <HouseOfM>	 go ahead
[14:22:27] <TheresNoTime>	 ack
[14:22:32] <logmsgbot>	 !log samtar@deploy2002 samtar and daimona: Continuing with sync
[14:23:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch restbase1034 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003427 (https://phabricator.wikimedia.org/T349619)
[14:24:32] <wikibugs>	 (03CR) 10Clément Goubert: "I don't see a problem with building only the end result if you've tested local builds (including a sample of dependent images) work with y" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[14:25:37] <Daimona>	 Sorry folks, power went out :D
[14:25:54] <TheresNoTime>	 :D that patch is currently syncing
[14:25:58] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2311.codfw.wmnet with OS bullseye
[14:26:01] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2379.codfw.wmnet with OS bullseye
[14:26:03] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2380.codfw.wmnet with OS bullseye
[14:26:05] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2383.codfw.wmnet with OS bullseye
[14:26:07] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2335.codfw.wmnet with OS bullseye
[14:26:16] <wikibugs>	 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10Jhancock.wm) @MatthewVernon this port errored three times yesterday. There's no active alert on it right now but I think I want to replace the DAC anyway. Is it safe to do so?
[14:27:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch restbase1034 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003427 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:27:09] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch restbase1035 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003428 (https://phabricator.wikimedia.org/T349619)
[14:27:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch restbase1036 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003429 (https://phabricator.wikimedia.org/T349619)
[14:27:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch restbase1037 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003430 (https://phabricator.wikimedia.org/T349619)
[14:27:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch restbase1038 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003431 (https://phabricator.wikimedia.org/T349619)
[14:27:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch restbase1039 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003432 (https://phabricator.wikimedia.org/T349619)
[14:27:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch restbase1040 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003433 (https://phabricator.wikimedia.org/T349619)
[14:27:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch restbase1041 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003434 (https://phabricator.wikimedia.org/T349619)
[14:27:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch restbase1042 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003435 (https://phabricator.wikimedia.org/T349619)
[14:27:36] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage
[14:29:29] <wikibugs>	 (03PS1) 10Brouberol: eventlogging: tweak PYTHONPATH to allow eventlogging to import _mysql.so [puppet] - 10https://gerrit.wikimedia.org/r/1003438 (https://phabricator.wikimedia.org/T349289)
[14:29:49] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:991352|prod: Stop setting $wgCampaignEventsEnableParticipantQuestions (T347608)]] (duration: 23m 37s)
[14:29:53] <stashbot>	 T347608: Remove feature flag for Participant Questions - https://phabricator.wikimedia.org/T347608
[14:30:07] <TheresNoTime>	 Daimona and HouseOfM: live on prod, can you double-check?
[14:30:09] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P56774 and previous config saved to /var/cache/conftool/dbconfig/20240214-143006-ladsgroup.json
[14:30:56] <Daimona>	 yup
[14:31:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1034.eqiad.wmnet
[14:31:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:31:41] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage
[14:31:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1035.eqiad.wmnet
[14:32:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:33:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10MatthewVernon) Yes, please go ahead whenever is convenient (if you can let me know when done I can check the node is still happy).
[14:33:26] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10MatthewVernon)
[14:33:29] <TheresNoTime>	 !log close UTC afternoon backport window
[14:33:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:56] <Daimona>	 LGTM
[14:34:03] <TheresNoTime>	 :D
[14:34:51] <claime>	 !log Depooling mw2402|mw2403|mw2404|mw2405|mw2407|mw2408|mw2409|mw2401|mw2410|mw2411|parse2001|parse2002|parse2003 for T355864
[14:34:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:55] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[14:35:19] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=(mw2402|mw2403|mw2404|mw2405|mw2407|mw2408|mw2409|mw2401|mw2410|mw2411|parse2001|parse2002|parse2003).*
[14:36:00] <claime>	 topranks: ^good in two minutes
[14:36:50] <topranks>	 claime: ok great, we're not starting till 16:00 utc so lots of time
[14:36:51] <topranks>	 thanks!
[14:37:02] <Daimona>	 HouseOfM: anything on your side?
[14:38:34] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:38:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1035.eqiad.wmnet
[14:38:54] <HouseOfM>	 no, all good
[14:39:23] <Daimona>	 Then I think we're done! Thanks TheresNoTime!
[14:39:31] <TheresNoTime>	 np!
[14:39:43] <HouseOfM>	 ty!
[14:40:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1037.eqiad.wmnet
[14:41:58] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2335.codfw.wmnet with reason: host reimage
[14:42:01] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2311.codfw.wmnet with reason: host reimage
[14:42:58] <logmsgbot>	 !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2383.codfw.wmnet with OS bullseye
[14:43:04] <logmsgbot>	 !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2379.codfw.wmnet with OS bullseye
[14:43:17] <logmsgbot>	 !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2380.codfw.wmnet with OS bullseye
[14:43:34] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Jelto) >>! In T316421#9538988, @Dzahn wrote: > test instance etherpad-bookworm.devtools now has etherpad-lite 1.9.7-2 installed by puppet  and `...
[14:44:21] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2380.codfw.wmnet with OS bullseye
[14:44:26] <claime>	 !log Restarted rsyslog on A:wikikube-master
[14:44:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:43] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2335.codfw.wmnet with reason: host reimage
[14:44:47] <Lucas_WMDE>	 phabricator hanging for anyone else?
[14:44:50] <taavi>	 yes
[14:44:55] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2379.codfw.wmnet with OS bullseye
[14:45:03] <taavi>	 aand it's back
[14:45:10] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2383.codfw.wmnet with OS bullseye
[14:45:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T352010)', diff saved to https://phabricator.wikimedia.org/P56776 and previous config saved to /var/cache/conftool/dbconfig/20240214-144514-ladsgroup.json
[14:45:17] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[14:45:22] <godog>	 snoozing on the job
[14:45:24] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[14:45:26] <Lucas_WMDE>	 bit slow, but seems back yes
[14:45:31] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[14:45:38] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T352010)', diff saved to https://phabricator.wikimedia.org/P56777 and previous config saved to /var/cache/conftool/dbconfig/20240214-144537-ladsgroup.json
[14:45:51] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2005.codfw.wmnet with OS bookworm
[14:45:58] <jinxer-wm>	 (ProbeDown) firing: (3) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:46:09] <jynus>	 checking
[14:46:12] <TheresNoTime>	 hm
[14:46:21] <volans>	 I see 5xx and big drop in requests
[14:46:43] <jynus>	 riccardo acked it
[14:47:06] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2311.codfw.wmnet with reason: host reimage
[14:47:16] <volans>	 I think I had already done it :D
[14:48:23] <wikibugs>	 10SRE, 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T357445 (10Jhancock.wm) @cmooney hey I checked these in netbox. none of the ports listed are active. Can you run homer for this when you have a chance? thanks!
[14:50:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1037.eqiad.wmnet
[14:50:58] <jinxer-wm>	 (ProbeDown) resolved: (3) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:51:18] <jinxer-wm>	 (NELHigh) firing: (2) Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[14:51:29] <volans>	 acked
[14:51:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1038.eqiad.wmnet
[14:51:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10Jhancock.wm) it's been replaced.
[14:52:06] <logmsgbot>	 !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2383.codfw.wmnet with OS bullseye
[14:52:11] <logmsgbot>	 !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2379.codfw.wmnet with OS bullseye
[14:52:14] <logmsgbot>	 !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2380.codfw.wmnet with OS bullseye
[14:56:18] <jinxer-wm>	 (NELHigh) resolved: (2) Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[14:56:56] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[14:57:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1038.eqiad.wmnet
[14:57:30] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2379.codfw.wmnet with OS bullseye
[14:57:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1039.eqiad.wmnet
[14:58:08] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[14:58:14] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[14:58:20] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[14:58:37] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:47] <wikibugs>	 10SRE, 10Domains, 10Fundraising-Backlog, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10AKanji-WMF) Tagging you @Dwisehaupt as Jeff is out this AM.
[15:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1500)
[15:00:58] <logmsgbot>	 !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncmonitor1001.eqiad.wmnet with OS bookworm
[15:03:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10MatthewVernon) Great, thanks, I can confirm that swift is happy with that node.
[15:03:55] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2335.codfw.wmnet with OS bullseye
[15:05:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1039.eqiad.wmnet
[15:06:04] <logmsgbot>	 !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2379.codfw.wmnet with OS bullseye
[15:06:18] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2379.codfw.wmnet with OS bullseye
[15:07:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1040.eqiad.wmnet
[15:07:40] <icinga-wm_>	 PROBLEM - ensure kvm processes are running on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[15:07:50] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2311.codfw.wmnet with OS bullseye
[15:09:11] <wikibugs>	 10SRE, 10observability, 10serviceops: Cert renewal for {appserver,api}.svc.{eqiad,codfw}.wmnet - https://phabricator.wikimedia.org/T304237 (10lmata) 05Open→03Resolved a:03lmata I will boldly resolve this. I discussed this with the team, and we agreed the strategy here is to renew/purchase dates to be m...
[15:09:26] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357446 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm
[15:11:36] <logmsgbot>	 !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2379.codfw.wmnet with OS bullseye
[15:13:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1040.eqiad.wmnet
[15:14:05] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2380.codfw.wmnet with OS bullseye
[15:15:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1041.eqiad.wmnet
[15:21:32] <logmsgbot>	 !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host mw2380.codfw.wmnet with OS bullseye
[15:21:45] <icinga-wm_>	 RECOVERY - ensure kvm processes are running on cloudvirt1041 is OK: PROCS OK: 2 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[15:22:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1041.eqiad.wmnet
[15:24:07] <wikibugs>	 10ops-codfw, 10serviceops: Multiple hosts in codfw fail to PXE boot upon reimage - https://phabricator.wikimedia.org/T357539 (10hnowlan)
[15:26:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) >>! In T355333#9538260, @Jhancock.wm wrote: > idk if this would help, but can we run the provisioning script with the --no-dhcp and --no-user tags. to catch any bios...
[15:30:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host restbase1042.eqiad.wmnet
[15:30:42] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644 (10aborrero)
[15:31:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, 10User-aborrero: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero)
[15:31:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, 10User-aborrero: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) 05Stalled→03Open In a 2024-02-14 network sync meeting we decided to continue moving older cloudvirts into the new single NI...
[15:35:32] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) sre.hosts.provision <hostname> --no-dhcp --no-user
[15:36:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Volans) >>! In T355333#9542804, @Jhancock.wm wrote: > sre.hosts.provision <hostname> --no-dhcp --no-user  Also `--no-switch` in this case I'd say.
[15:37:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): volatile: We need to configure the volatile endpoint on puppetserveres - https://phabricator.wikimedia.org/T341056 (10jhathaway) thanks for the additional context @Muehlenhoff!
[15:37:52] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10akosiaris) >>! In T316421#9542499, @Jelto wrote:  > We probably don't want two etherpad services running in parallel.   We most definitely don't...
[15:37:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host restbase1042.eqiad.wmnet
[15:44:22] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2005.codfw.wmnet
[15:44:53] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.provision for host mw2282.mgmt.codfw.wmnet with reboot policy GRACEFUL
[15:45:10] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2121.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:45:15] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[15:45:23] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2121.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:45:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2132.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:45:38] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2132.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:45:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2145.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:46:05] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2145.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:46:07] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2104.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:46:21] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2104.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:46:22] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2153.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:46:36] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2153.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:46:37] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2154.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:46:52] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2154.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:46:53] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2175.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:47:06] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2175.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:47:09] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2176.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:47:35] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2176.codfw.wmnet with reason: T355864 - Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw
[15:47:53] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T355864 - Depool db2121 db2132 db2145 db2104 db2153 db2154 db2175 db2176', diff saved to https://phabricator.wikimedia.org/P56778 and previous config saved to /var/cache/conftool/dbconfig/20240214-154753-arnaudb.json
[15:50:53] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a5-codfw.mgmt with reason: prepping for server uplink migration codfw rack a5
[15:51:10] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a5-codfw.mgmt with reason: prepping for server uplink migration codfw rack a5
[15:51:30] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9a43620e-deca-432c-aa1f-5d6e939b51bc) set by cmooney@cumin...
[15:53:05] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[15:53:10] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[15:53:16] <logmsgbot>	 !log hnowlan@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host mw2282.mgmt.codfw.wmnet with reboot policy GRACEFUL
[15:53:45] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm
[15:54:43] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye
[15:55:06] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[15:59:07] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[15:59:12] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[15:59:38] <topranks>	 !log disable puppet fleet-wide to allow for distruption to puppetmaster/puppetserver during network maint T355864
[15:59:41] <wikibugs>	 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402 (10thcipriani)
[15:59:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:43] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[16:04:58] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[16:05:21] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[16:06:43] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 38 hosts with reason: Migrating servers in codfw rack A5 to lsw1-a5-codfw
[16:07:17] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 38 hosts with reason: Migrating servers in codfw rack A5 to lsw1-a5-codfw
[16:07:27] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage
[16:07:30] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on sretest2005.codfw.wmnet with reason: host reimage
[16:07:46] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ec1ab967-b8f5-4bfd-914e-e76afe369468) set by cmooney@cumin...
[16:07:53] <topranks>	 !log Moving server uplinks from old switch to new codfw rack A5 T355864
[16:07:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:08:03] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[16:11:07] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: ncmonitor1001 install issues (Ganeti VM fails to reboot after "gnt-instance modify") - https://phabricator.wikimedia.org/T357449 (10BCornwall) 05In progress→03Resolved a:03BCornwall I'm still not sure where the problem lies and am concerned that this...
[16:14:57] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10cmooney) All links moved and all devices pinging ok again.
[16:15:07] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) This command also fails - but interestingly the host itself appears to have lost network connectivity. `ethtool` reports that the link is up but I can't connect in o...
[16:16:09] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10ABran-WMF) awesome, will start repooling, thanks @cmooney
[16:16:14] <claime>	 !log Uncordoning kubernetes2019.codfw.wmnet kubernetes2018.codfw.wmnet mw2420.codfw.wmnet mw2421.codfw.wmnet mw2406.codfw.wmnet mw2422.codfw.wmnet mw2423.codfw.wmnet for T355864
[16:16:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:19] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[16:16:50] <claime>	 !log Repooling mw2402|mw2403|mw2404|mw2405|mw2407|mw2408|mw2409|mw2401|mw2410|mw2411|parse2001|parse2002|parse2003 for T355864
[16:16:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Eevans) 05Resolved→03Open @Jclark-ctr did restbase1036 get imaged?  I don't see any comments from the cookbook...
[16:17:07] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=(mw2402|mw2403|mw2404|mw2405|mw2407|mw2408|mw2409|mw2401|mw2410|mw2411|parse2001|parse2002|parse2003).*
[16:18:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 25%: T355864 - Post migration repool of db2121', diff saved to https://phabricator.wikimedia.org/P56779 and previous config saved to /var/cache/conftool/dbconfig/20240214-161824-arnaudb.json
[16:19:25] <wikibugs>	 10SRE, 10Release-Engineering-Team (Backlog): mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 (10thcipriani)
[16:19:40] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2005.codfw.wmnet
[16:20:28] <logmsgbot>	 !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye
[16:22:34] <icinga-wm_>	 RECOVERY - spamassassin on vrts1002 is OK: PROCS OK: 2 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin
[16:25:34] <icinga-wm_>	 PROBLEM - spamassassin on vrts1002 is CRITICAL: PROCS CRITICAL: 0 processes with args spamd https://wikitech.wikimedia.org/wiki/Mail%23SpamAssassin
[16:33:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 50%: T355864 - Post migration repool of db2121', diff saved to https://phabricator.wikimedia.org/P56780 and previous config saved to /var/cache/conftool/dbconfig/20240214-163330-arnaudb.json
[16:33:38] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[16:34:46] <icinga-wm_>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 165 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:36:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10cmooney) >>! In T355333#9543167, @hnowlan wrote: > This command also fails - but interestingly the host itself appears to have lost network connectivity. `ethtool` reports th...
[16:37:41] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2005.codfw.wmnet with OS bookworm
[16:37:48] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10cmooney) The reimage problem may be the firmware issue - link not coming up during the debian installer.  @hnowlan if you want to try the reimage again I can take a look at t...
[16:39:14] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) port shows activity on the server, but the network side is showing as down. Reseating either cable does nothing. but reseating the SFP makes it come back up  Pos...
[16:39:46] <icinga-wm_>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 44 probes of 733 (alerts on 90) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[16:39:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) @cmooney it was me, I was reseating the cable
[16:41:32] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10cmooney) >>! In T355333#9543231, @Jhancock.wm wrote: > port shows activity on the server, but the network side is showing as down. Reseating either cable does nothing. but re...
[16:42:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710 (10BCornwall) 05In progress→03Resolved Thanks!
[16:47:28] <icinga-wm_>	 RECOVERY - Disk space on vrts1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=vrts1002&var-datasource=eqiad+prometheus/ops
[16:48:16] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye
[16:48:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 75%: T355864 - Post migration repool of db2121', diff saved to https://phabricator.wikimedia.org/P56781 and previous config saved to /var/cache/conftool/dbconfig/20240214-164834-arnaudb.json
[16:48:39] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[16:49:49] <wikibugs>	 (03PS1) 10JHathaway: puppetserver: don't fail setting up rsync_module [puppet] - 10https://gerrit.wikimedia.org/r/1003486
[16:52:14] <logmsgbot>	 !log hnowlan@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host mw2282.codfw.wmnet with OS bullseye
[16:52:17] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Remove deprecated X-Webkit-CSP-Report-Only response header [puppet] - 10https://gerrit.wikimedia.org/r/1003109 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ)
[16:53:38] <wikibugs>	 (03PS2) 10JHathaway: puppetserver: don't fail setting up rsync_module [puppet] - 10https://gerrit.wikimedia.org/r/1003486 (https://phabricator.wikimedia.org/T356991)
[16:53:53] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] Remove deprecated X-Webkit-CSP-Report-Only response header [puppet] - 10https://gerrit.wikimedia.org/r/1003109 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ)
[16:54:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) Reimaging fails still after these changes fwiw - however, a reboot has broken network connectivity again?! The host is up and rebooted in the management interface, b...
[16:55:16] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003486 (https://phabricator.wikimedia.org/T356991) (owner: 10JHathaway)
[16:56:37] <fabfur>	 !log disabled puppet on A:cp-upload to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003109 selectively (T357479)
[16:56:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:42] <stashbot>	 T357479: Stop sending X-Webkit-CSP and X-Webkit-CSP-Report-Only headers - https://phabricator.wikimedia.org/T357479
[16:59:27] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) replaced the SFP this time. came up. server reboot is causing the port to go down, possibly
[17:03:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2121 (re)pooling @ 100%: T355864 - Post migration repool of db2121', diff saved to https://phabricator.wikimedia.org/P56782 and previous config saved to /var/cache/conftool/dbconfig/20240214-170339-arnaudb.json
[17:03:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 25%: T355864 - Post migration repool of db2145', diff saved to https://phabricator.wikimedia.org/P56783 and previous config saved to /var/cache/conftool/dbconfig/20240214-170345-arnaudb.json
[17:03:56] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[17:05:07] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "This file is starting to get a  bit out of control in size, but I guess is out of scope. The change looks ok to me but I'd like some more " [puppet] - 10https://gerrit.wikimedia.org/r/1003464 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[17:10:32] <fabfur>	 !log enabled puppet on A:cp-upload to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003109 selectively (T357479)
[17:10:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:10:47] <stashbot>	 T357479: Stop sending X-Webkit-CSP and X-Webkit-CSP-Report-Only headers - https://phabricator.wikimedia.org/T357479
[17:13:16] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.reimage for host mw2282.codfw.wmnet with OS bullseye
[17:13:58] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 03+2] "Thanks @TheDJ for this patch, has been applied successfully to our servers!" [puppet] - 10https://gerrit.wikimedia.org/r/1003109 (https://phabricator.wikimedia.org/T357479) (owner: 10TheDJ)
[17:16:01] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10cmooney) >>! In T355333#9543255, @hnowlan wrote: > Reimaging fails still after these changes fwiw - however, a reboot has broken network connectivity again?! The host is up a...
[17:18:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 50%: T355864 - Post migration repool of db2145', diff saved to https://phabricator.wikimedia.org/P56784 and previous config saved to /var/cache/conftool/dbconfig/20240214-171850-arnaudb.json
[17:18:53] <wikibugs>	 (03CR) 10Daniel Kinzler: Configure parser cache filters for parsoid-pcache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler)
[17:19:03] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[17:19:15] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Dzahn) >>! In T316421#9542499, @Jelto wrote:  > and `prometheus-etherpad-exporter` `0.7` as well. The `etherpad-lite` package also installed `no...
[17:20:30] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "https://phabricator.wikimedia.org/T316421#9542499" [puppet] - 10https://gerrit.wikimedia.org/r/999973 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn)
[17:20:37] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "https://phabricator.wikimedia.org/T316421#9542499" [puppet] - 10https://gerrit.wikimedia.org/r/1003073 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn)
[17:23:02] <wikibugs>	 (03PS1) 10Ayounsi: makevm: pass the v6 IP to GntInstance.add [cookbooks] - 10https://gerrit.wikimedia.org/r/1003490 (https://phabricator.wikimedia.org/T300152)
[17:27:11] <wikibugs>	 (03PS1) 10Ayounsi: Ganeti: pass the v4 and v6 IPs to the VM as fw_cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152)
[17:27:36] <wikibugs>	 (03PS1) 10Dzahn: etherpad: add data types to parameters [puppet] - 10https://gerrit.wikimedia.org/r/1003492
[17:29:21] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2282.codfw.wmnet with reason: host reimage
[17:31:37] <wikibugs>	 (03CR) 10Ladsgroup: "It's mostly because it's basically the second or third biggest user_properties table in the whole infra, so we can just clean that up." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003408 (https://phabricator.wikimedia.org/T357072) (owner: 10Ladsgroup)
[17:31:57] <Amir1>	 jouncebot: nowandnext
[17:31:58] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 28 minute(s)
[17:31:58] <jouncebot>	 In 0 hour(s) and 28 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1800)
[17:32:05] <logmsgbot>	 !log fnegri@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host cloudvirtlocal1001.eqiad.wmnet
[17:32:14] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2282.codfw.wmnet with reason: host reimage
[17:32:23] <wikibugs>	 (03PS1) 10Dzahn: etherpad: add $service_ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/1003493 (https://phabricator.wikimedia.org/T316421)
[17:32:33] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) I tried another reimage and it currently proceeding successfully - maybe replacing the SFP did the job? This is all a bit inexplicable.
[17:32:41] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1001.eqiad.wmnet
[17:32:57] <wikibugs>	 (03CR) 10Dzahn: "he" [puppet] - 10https://gerrit.wikimedia.org/r/1003493 (https://phabricator.wikimedia.org/T316421) (owner: 10Dzahn)
[17:33:04] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Enable echo conditional defaults for loginwiki since 2013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003408 (https://phabricator.wikimedia.org/T357072) (owner: 10Ladsgroup)
[17:33:53] <wikibugs>	 (03Merged) 10jenkins-bot: Enable echo conditional defaults for loginwiki since 2013 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003408 (https://phabricator.wikimedia.org/T357072) (owner: 10Ladsgroup)
[17:33:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 75%: T355864 - Post migration repool of db2145', diff saved to https://phabricator.wikimedia.org/P56785 and previous config saved to /var/cache/conftool/dbconfig/20240214-173355-arnaudb.json
[17:34:01] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[17:34:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Ganeti: pass the v4 and v6 IPs to the VM as fw_cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[17:35:02] <wikibugs>	 (03PS39) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[17:36:05] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:1003408|Enable echo conditional defaults for loginwiki since 2013 (T357072)]]
[17:36:11] <stashbot>	 T357072: Echo: Drop droppable rows from user_properties - https://phabricator.wikimedia.org/T357072
[17:38:42] <wikibugs>	 (03PS4) 10Effie Mouzeli: mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690)
[17:38:47] <wikibugs>	 (03CR) 10Effie Mouzeli: mw-mcrouter: add helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[17:38:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[17:39:02] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1003408|Enable echo conditional defaults for loginwiki since 2013 (T357072)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:39:25] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1001.eqiad.wmnet
[17:39:49] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on puppetserver2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[17:41:02] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[17:41:33] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Multiple hosts in codfw fail to PXE boot upon reimage - https://phabricator.wikimedia.org/T357539 (10cmooney) All of these are connected to lsw1-a3-codfw (new L3 switch) and they may be the first we've tried to reimage connected to new switch.  Investigating if it is related...
[17:44:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[17:44:20] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[17:48:14] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:1003408|Enable echo conditional defaults for loginwiki since 2013 (T357072)]] (duration: 12m 08s)
[17:48:18] <stashbot>	 T357072: Echo: Drop droppable rows from user_properties - https://phabricator.wikimedia.org/T357072
[17:48:33] <wikibugs>	 (03CR) 10Volans: "Apart the tests looks ok" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[17:48:48] <wikibugs>	 (03PS1) 10Dzahn: phabricator,etherpad: fix some puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/1003496
[17:49:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 100%: T355864 - Post migration repool of db2145', diff saved to https://phabricator.wikimedia.org/P56786 and previous config saved to /var/cache/conftool/dbconfig/20240214-174900-arnaudb.json
[17:49:05] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[17:49:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 25%: T355864 - Post migration repool of db2104', diff saved to https://phabricator.wikimedia.org/P56787 and previous config saved to /var/cache/conftool/dbconfig/20240214-174906-arnaudb.json
[17:49:50] <wikibugs>	 (03CR) 10Volans: "LGTM, to be merged after the spicerack release with the related patch." [cookbooks] - 10https://gerrit.wikimedia.org/r/1003490 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[17:50:38] <wikibugs>	 (03PS2) 10Dzahn: phabricator,etherpad: fix some puppet-lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/1003496
[17:50:38] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Multiple hosts in codfw fail to PXE boot upon reimage - https://phabricator.wikimedia.org/T357539 (10cmooney) I think what's happening is the new switch is not configured to insert the port information for DHCP requests over the legacy row-wide vlan.  Best way forward is to...
[17:51:25] <wikibugs>	 10SRE, 10Wikimedia-Etherpad, 10collaboration-services, 10Patch-For-Review: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 (10Dzahn) >>! In T316421#9542499, @Jelto wrote: >There is no puppet flag to enable or disable the process.   https://gerrit.wikimedia.org/r/c/opera...
[17:54:38] <wikibugs>	 (03PS2) 10Dzahn: etherpad: add $service_ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/1003493 (https://phabricator.wikimedia.org/T316421)
[17:56:38] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2282.codfw.wmnet with OS bullseye
[17:58:33] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1002.eqiad.wmnet
[17:59:01] <wikibugs>	 (03PS3) 10Dzahn: site: add etherpad role to etherpad1004 [puppet] - 10https://gerrit.wikimedia.org/r/999973 (https://phabricator.wikimedia.org/T316421)
[17:59:29] <wikibugs>	 (03PS4) 10Dzahn: site: add etherpad role to etherpad1004 [puppet] - 10https://gerrit.wikimedia.org/r/999973 (https://phabricator.wikimedia.org/T316421)
[17:59:43] <logmsgbot>	 !log hnowlan@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on mw2282.codfw.wmnet with reason: Testing if reimage is stable T355333
[17:59:47] <logmsgbot>	 !log hnowlan@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mw2282.codfw.wmnet with reason: Testing if reimage is stable T355333
[17:59:48] <stashbot>	 T355333: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1800)
[18:01:28] <wikibugs>	 (03PS4) 10Hashar: python-build: default to run as nobody from /deploy [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1002997 (https://phabricator.wikimedia.org/T342346)
[18:01:30] <wikibugs>	 (03PS3) 10Hashar: python-build: add make and virtualenv [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1003002 (https://phabricator.wikimedia.org/T342346)
[18:01:32] <wikibugs>	 (03PS7) 10Hashar: python-build: set date of source files in the wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/940157 (https://phabricator.wikimedia.org/T259611)
[18:01:34] <wikibugs>	 (03PS3) 10Hashar: python-build: ensure frozen-requirements is exhaustive [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941443 (https://phabricator.wikimedia.org/T342346)
[18:01:36] <wikibugs>	 (03PS1) 10Hashar: Rebuild python-build images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1003497
[18:01:38] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[18:01:55] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[18:02:34] <wikibugs>	 (03PS2) 10Dzahn: site: apply etherpad role on both eqiad and codfw [puppet] - 10https://gerrit.wikimedia.org/r/1003073 (https://phabricator.wikimedia.org/T316421)
[18:02:51] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[18:03:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[18:04:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 50%: T355864 - Post migration repool of db2104', diff saved to https://phabricator.wikimedia.org/P56788 and previous config saved to /var/cache/conftool/dbconfig/20240214-180411-arnaudb.json
[18:04:28] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[18:05:12] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1002.eqiad.wmnet
[18:05:41] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for mw2379 - cmooney@cumin1002"
[18:06:32] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add new entries for mw2379 - cmooney@cumin1002"
[18:06:32] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:06:48] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T352010)', diff saved to https://phabricator.wikimedia.org/P56789 and previous config saved to /var/cache/conftool/dbconfig/20240214-180647-ladsgroup.json
[18:07:13] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[18:08:33] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[18:09:20] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache mw2379.codfw.wmnet on all recursors
[18:09:24] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mw2379.codfw.wmnet on all recursors
[18:11:14] <hnowlan>	 !log running `homer 'cr*codfw*' commit 'T351074'` to pick up mw2282's bgp change
[18:11:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:11:20] <stashbot>	 T351074: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074
[18:11:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) 05Open→03Resolved a:03hnowlan Reimage was successful, networking survived a reboot. All done!
[18:12:07] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host mw2379.codfw.wmnet with OS bullseye
[18:13:44] <wikibugs>	 (03PS5) 10Effie Mouzeli: mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690)
[18:14:24] <logmsgbot>	 !log hnowlan@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=mw2282.codfw.wmnet
[18:17:29] <wikibugs>	 (03PS1) 10Ladsgroup: exim: Avoid considering wikimedia domains as local in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925)
[18:18:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1003486 (https://phabricator.wikimedia.org/T356991) (owner: 10JHathaway)
[18:18:24] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudvirtlocal1003.eqiad.wmnet
[18:19:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 75%: T355864 - Post migration repool of db2104', diff saved to https://phabricator.wikimedia.org/P56790 and previous config saved to /var/cache/conftool/dbconfig/20240214-181916-arnaudb.json
[18:19:21] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[18:21:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P56791 and previous config saved to /var/cache/conftool/dbconfig/20240214-182154-ladsgroup.json
[18:24:54] <logmsgbot>	 !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudvirtlocal1003.eqiad.wmnet
[18:27:29] <wikibugs>	 (03PS1) 10Hnowlan: mw-jobrunner: bump replicas for cirrusSearchLinksUpdate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003499 (https://phabricator.wikimedia.org/T349796)
[18:31:40] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2379.codfw.wmnet with reason: host reimage
[18:34:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2104 (re)pooling @ 100%: T355864 - Post migration repool of db2104', diff saved to https://phabricator.wikimedia.org/P56792 and previous config saved to /var/cache/conftool/dbconfig/20240214-183421-arnaudb.json
[18:34:26] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[18:34:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 25%: T355864 - Post migration repool of db2153', diff saved to https://phabricator.wikimedia.org/P56793 and previous config saved to /var/cache/conftool/dbconfig/20240214-183426-arnaudb.json
[18:34:36] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2379.codfw.wmnet with reason: host reimage
[18:34:51] <wikibugs>	 10SRE, 10Domains, 10Fundraising-Backlog, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10Dwisehaupt) @RLazarus Does this really need an apache config patch or just an update to the redirect rules in `hieradata/common/mediawiki.yaml`?
[18:37:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P56794 and previous config saved to /var/cache/conftool/dbconfig/20240214-183700-ladsgroup.json
[18:37:07] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[18:37:12] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[18:39:32] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[18:43:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10Jhancock.wm) I will mark the SFP I pulled as bad. See if I can test it on a new server.
[18:46:53] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303)
[18:47:46] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for codfw mw servers - cmooney@cumin1002"
[18:47:51] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303) (owner: 10Ebernhardson)
[18:48:39] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for codfw mw servers - cmooney@cumin1002"
[18:48:39] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:49:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 50%: T355864 - Post migration repool of db2153', diff saved to https://phabricator.wikimedia.org/P56795 and previous config saved to /var/cache/conftool/dbconfig/20240214-184931-arnaudb.json
[18:49:36] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[18:51:57] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache mw2380.codfw.wmnet on all recursors
[18:52:00] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mw2380.codfw.wmnet on all recursors
[18:52:07] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T352010)', diff saved to https://phabricator.wikimedia.org/P56796 and previous config saved to /var/cache/conftool/dbconfig/20240214-185207-ladsgroup.json
[18:52:09] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[18:52:12] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[18:52:12] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[18:52:15] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache mw2383.codfw.wmnet on all recursors
[18:52:18] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) mw2383.codfw.wmnet on all recursors
[18:52:19] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1242 (T352010)', diff saved to https://phabricator.wikimedia.org/P56797 and previous config saved to /var/cache/conftool/dbconfig/20240214-185218-ladsgroup.json
[18:53:30] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host mw2380.codfw.wmnet with OS bullseye
[18:54:13] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host mw2383.codfw.wmnet with OS bullseye
[18:57:07] <wikibugs>	 (03PS1) 10Jdlrobson: New communities will not share scripts going forward [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679)
[18:57:09] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303)
[18:58:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303) (owner: 10Ebernhardson)
[18:58:23] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2379.codfw.wmnet with OS bullseye
[19:00:04] <jouncebot>	 jeena and brennen: gettimeofday() says it's time for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1900)
[19:00:04] <jouncebot>	 jeena and brennen: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T1900).
[19:00:54] <brennen>	 o/
[19:04:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 75%: T355864 - Post migration repool of db2153', diff saved to https://phabricator.wikimedia.org/P56798 and previous config saved to /var/cache/conftool/dbconfig/20240214-190436-arnaudb.json
[19:04:53] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[19:08:06] <wikibugs>	 10SRE, 10procurement, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse)
[19:08:52] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2380.codfw.wmnet with reason: host reimage
[19:09:02] <wikibugs>	 10SRE, 10procurement, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse)
[19:09:29] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2383.codfw.wmnet with reason: host reimage
[19:10:13] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] puppetserver: don't fail setting up rsync_module [puppet] - 10https://gerrit.wikimedia.org/r/1003486 (https://phabricator.wikimedia.org/T356991) (owner: 10JHathaway)
[19:10:35] <wikibugs>	 (03PS3) 10Ebernhardson: cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303)
[19:11:23] <wikibugs>	 (03PS1) 10Dzahn: add alert for planet content updates (last modified) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298)
[19:11:33] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2380.codfw.wmnet with reason: host reimage
[19:12:03] <wikibugs>	 10SRE, 10Domains, 10Fundraising-Backlog, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10RLazarus) Sorry yeah, I was using the term broadly. The goal is to edit the Apache config, but that hieradata file is how you'd do it. :)
[19:12:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add alert for planet content updates (last modified) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298) (owner: 10Dzahn)
[19:13:43] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2383.codfw.wmnet with reason: host reimage
[19:14:25] <brennen>	 !log train 1.42.0-wmf.18 (T354436): logs chill, no current blockers, rolling to group1.
[19:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:30] <stashbot>	 T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436
[19:15:02] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003508 (https://phabricator.wikimedia.org/T354436)
[19:15:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003508 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot)
[19:15:44] <icinga-wm_>	 ACKNOWLEDGEMENT - BFD status on lsw1-a4-codfw.mgmt is CRITICAL: Down: 2 Cathal Mooney BFD is configured towards ganeti2034 but not configured on host. - The acknowledgement expires at: 2024-02-29 19:15:18. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:15:49] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003508 (https://phabricator.wikimedia.org/T354436) (owner: 10TrainBranchBot)
[19:16:58] <icinga-wm_>	 ACKNOWLEDGEMENT - BFD status on lsw1-b7-codfw.mgmt is CRITICAL: Down: 2 Cathal Mooney BFD is down to ganeti2023 as its not configured host side. - The acknowledgement expires at: 2024-02-29 19:16:25. https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:17:18] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303) (owner: 10Ebernhardson)
[19:18:11] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Introduce backfill releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003502 (https://phabricator.wikimedia.org/T356303) (owner: 10Ebernhardson)
[19:19:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 100%: T355864 - Post migration repool of db2153', diff saved to https://phabricator.wikimedia.org/P56799 and previous config saved to /var/cache/conftool/dbconfig/20240214-191941-arnaudb.json
[19:19:47] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[19:19:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 25%: T355864 - Post migration repool of db2154', diff saved to https://phabricator.wikimedia.org/P56800 and previous config saved to /var/cache/conftool/dbconfig/20240214-191946-arnaudb.json
[19:23:00] <wikibugs>	 (03PS1) 10Eevans: Bring restbase & aqs targets up to current [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003509 (https://phabricator.wikimedia.org/T353550)
[19:23:47] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Bring restbase & aqs targets up to current [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003509 (https://phabricator.wikimedia.org/T353550) (owner: 10Eevans)
[19:23:50] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] Bring restbase & aqs targets up to current [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003509 (https://phabricator.wikimedia.org/T353550) (owner: 10Eevans)
[19:24:49] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on puppetserver2003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[19:26:06] <wikibugs>	 10SRE, 10procurement, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse)
[19:26:24] <wikibugs>	 10SRE, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse)
[19:26:37] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.18  refs T354436
[19:26:42] <stashbot>	 T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436
[19:27:31] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@5c2dd00]: Deploying to updated target list — T353550
[19:27:37] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[19:28:12] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@5c2dd00]: Deploying to updated target list — T353550 (duration: 00m 41s)
[19:30:57] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@5c2dd00]: Deploying to updated target list — T353550
[19:31:17] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@5c2dd00]: Deploying to updated target list — T353550 (duration: 00m 20s)
[19:32:50] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Wikimedia-Apache-configuration, 10serviceops: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10RLazarus)
[19:33:03] <wikibugs>	 (03PS1) 10Cathal Mooney: Remove BFD from routed ganeti peerings on router side [homer/public] - 10https://gerrit.wikimedia.org/r/1003511 (https://phabricator.wikimedia.org/T300152)
[19:33:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:33:43] <wikibugs>	 (03PS1) 10Eevans: Fix canary name typo [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003512 (https://phabricator.wikimedia.org/T353550)
[19:34:01] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] Fix canary name typo [software/logstash-logback-encoder] - 10https://gerrit.wikimedia.org/r/1003512 (https://phabricator.wikimedia.org/T353550) (owner: 10Eevans)
[19:34:13] <logmsgbot>	 !log brennen@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.18  refs T354436 (duration: 07m 35s)
[19:34:17] <stashbot>	 T354436: 1.42.0-wmf.18 deployment blockers - https://phabricator.wikimedia.org/T354436
[19:34:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 50%: T355864 - Post migration repool of db2154', diff saved to https://phabricator.wikimedia.org/P56801 and previous config saved to /var/cache/conftool/dbconfig/20240214-193451-arnaudb.json
[19:34:55] <jinxer-wm>	 (SystemdUnitFailed) firing: ferm.service on mw2282:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:34:57] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[19:35:38] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host restbase1036.mgmt.eqiad.wmnet with reboot policy FORCED
[19:35:47] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10Jclark-ctr) @Eevans  sorry about missing that.  kicking of image now
[19:35:53] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2380.codfw.wmnet with OS bullseye
[19:35:56] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550
[19:36:05] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[19:36:13] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 16s)
[19:36:27] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Multiple hosts in codfw fail to PXE boot upon reimage - https://phabricator.wikimedia.org/T357539 (10cmooney) 05Open→03Resolved a:03cmooney Yeah the issue here was the hosts being connected to the new switches, but still configured for the legacy vlan.  That's fine, bu...
[19:37:21] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2383.codfw.wmnet with OS bullseye
[19:38:06] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550
[19:39:23] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 01m 17s)
[19:41:23] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550
[19:41:31] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[19:42:08] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 45s)
[19:42:42] <wikibugs>	 (03PS2) 10JHathaway: exim: Avoid considering wikimedia domains as local in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925) (owner: 10Ladsgroup)
[19:42:48] <wikibugs>	 (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925) (owner: 10Ladsgroup)
[19:43:09] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Failover Icinga and Alertmanager to alert2001 [puppet] - 10https://gerrit.wikimedia.org/r/1003513 (https://phabricator.wikimedia.org/T333615)
[19:43:14] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550
[19:43:29] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 14s)
[19:43:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:43:41] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514
[19:43:46] <wikibugs>	 (03PS1) 10Dwisehaupt: Add wikihole redirect for donatewiki [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436)
[19:44:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson)
[19:46:08] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550
[19:46:42] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 34s)
[19:46:46] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[19:47:31] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: Begin private IP migration for cloudelastic1007 [puppet] - 10https://gerrit.wikimedia.org/r/999088 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[19:48:06] <wikibugs>	 (03CR) 10Bking: [C: 03+2] "Excellent, thanks for the review." [puppet] - 10https://gerrit.wikimedia.org/r/999088 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[19:49:05] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Resolve alerts DNS queries to alert2001 [dns] - 10https://gerrit.wikimedia.org/r/1003516 (https://phabricator.wikimedia.org/T333615)
[19:49:11] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] exim: Avoid considering wikimedia domains as local in WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925) (owner: 10Ladsgroup)
[19:49:21] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1003498 (https://phabricator.wikimedia.org/T343925) (owner: 10Ladsgroup)
[19:49:33] <wikibugs>	 (03PS2) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514
[19:49:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 75%: T355864 - Post migration repool of db2154', diff saved to https://phabricator.wikimedia.org/P56802 and previous config saved to /var/cache/conftool/dbconfig/20240214-194956-arnaudb.json
[19:50:04] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[19:50:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson)
[19:50:24] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550
[19:50:29] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 05s)
[19:50:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cloudelastic1007.wikimedia.org
[19:50:54] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550
[19:50:58] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 04s)
[19:51:01] <wikibugs>	 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse)
[19:51:07] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550
[19:51:10] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 03s)
[19:51:41] <wikibugs>	 (03PS3) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514
[19:52:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson)
[19:52:56] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2282 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[19:53:05] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550
[19:53:10] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[19:53:13] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 07s)
[19:53:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) sync-puppet-ca.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:53:46] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550
[19:53:52] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 06s)
[19:53:59] <logmsgbot>	 !log eevans@deploy2002 Started deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550
[19:54:05] <logmsgbot>	 !log eevans@deploy2002 Finished deploy [cassandra/logstash-logback-encoder@0521449]: Deploying to updated target list — T353550 (duration: 00m 05s)
[19:57:12] <wikibugs>	 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse)
[19:58:33] <wikibugs>	 (03PS2) 10Jdlrobson: New communities will not share scripts going forward [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679)
[19:58:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) puppetserver.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:58:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:58:55] <wikibugs>	 (03PS3) 10Jdlrobson: New communities will not share scripts going forward [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679)
[19:59:13] <wikibugs>	 (03PS4) 10Jdlrobson: New communities will not share scripts going forward [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679)
[19:59:14] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "a"} and A:restbase and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002
[19:59:19] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[20:01:39] <wikibugs>	 (03CR) 10RLazarus: "Thanks Dallas! Adding Scott from my team to review and merge." [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436) (owner: 10Dwisehaupt)
[20:02:08] <wikibugs>	 (03PS4) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514
[20:02:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson)
[20:03:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) puppetserver.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:03:48] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1007.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[20:04:56] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudelastic1007.wikimedia.org decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[20:04:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:04:58] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudelastic1007.wikimedia.org
[20:05:02] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2154 (re)pooling @ 100%: T355864 - Post migration repool of db2154', diff saved to https://phabricator.wikimedia.org/P56803 and previous config saved to /var/cache/conftool/dbconfig/20240214-200501-arnaudb.json
[20:05:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 25%: T355864 - Post migration repool of db2175', diff saved to https://phabricator.wikimedia.org/P56804 and previous config saved to /var/cache/conftool/dbconfig/20240214-200507-arnaudb.json
[20:06:11] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[20:07:17] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) @dcaro   here is an update from dell  I found a couple of online articles: What Does Uncorrectable Se...
[20:07:28] <wikibugs>	 (03PS5) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514
[20:08:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson)
[20:09:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q3-Q4): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) https://www.minitool.com/lib/uncorrectable-sector-count.html https://community.wd.com/t/how-to-interp...
[20:09:56] <wikibugs>	 (03PS6) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514
[20:12:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[20:12:22] <wikibugs>	 (03CR) 10Scott French: [C: 04-1] "Thanks, Dallas! Two quick comments." [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436) (owner: 10Dwisehaupt)
[20:12:39] <wikibugs>	 (03PS7) 10BCornwall: fifo-log-demux: Decouple service from nginx/ats [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905)
[20:13:34] <jinxer-wm>	 (SystemdUnitFailed) resolved: (3) puppetserver.service on puppetserver2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:13:37] <wikibugs>	 (03CR) 10BCornwall: fifo-log-demux: Decouple service from nginx/ats (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[20:14:43] <wikibugs>	 (03PS7) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514
[20:16:04] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1007 to private IPs - bking@cumin2002"
[20:16:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: migrate cloudelastic1007 to private IPs - bking@cumin2002"
[20:16:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:17:37] <wikibugs>	 (03PS2) 10Dwisehaupt: Add wikihole redirect for donatewiki [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436)
[20:18:12] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:20:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 50%: T355864 - Post migration repool of db2175', diff saved to https://phabricator.wikimedia.org/P56805 and previous config saved to /var/cache/conftool/dbconfig/20240214-202012-arnaudb.json
[20:20:14] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[20:20:21] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1007
[20:20:27] <wikibugs>	 (03CR) 10Dwisehaupt: "Updated to resolve the issues." [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436) (owner: 10Dwisehaupt)
[20:21:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1007
[20:22:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cloudelastic1007.eqiad.wmnet with OS bullseye
[20:22:36] <icinga-wm_>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:22:59] <wikibugs>	 (03PS8) 10Ebernhardson: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514
[20:23:12] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[20:27:36] <icinga-wm_>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:28:12] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson)
[20:29:02] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Use same options for -backfill as normal releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003514 (owner: 10Ebernhardson)
[20:31:34] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[20:31:43] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:34:41] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host restbase1036.mgmt.eqiad.wmnet with reboot policy FORCED
[20:35:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 75%: T355864 - Post migration repool of db2175', diff saved to https://phabricator.wikimedia.org/P56806 and previous config saved to /var/cache/conftool/dbconfig/20240214-203517-arnaudb.json
[20:35:19] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[20:36:13] <wikibugs>	 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse)
[20:36:48] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage
[20:36:51] <wikibugs>	 (03PS1) 10BCornwall: ncmonitor: Remove useless apt-get require [puppet] - 10https://gerrit.wikimedia.org/r/1003524
[20:36:55] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[20:37:05] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:38:06] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1003439 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi)
[20:39:28] <wikibugs>	 (03PS1) 10Scott French: httpbb: add donate.wikimedia.org redirect tests [puppet] - 10https://gerrit.wikimedia.org/r/1003525 (https://phabricator.wikimedia.org/T357436)
[20:39:30] <jinxer-wm>	 (KubernetesCalicoDown) firing: mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2379.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[20:39:46] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudelastic1007.eqiad.wmnet with reason: host reimage
[20:41:33] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host restbase1036.eqiad.wmnet with OS bullseye
[20:41:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host restbase1036.eqiad.wmnet with OS bullseye
[20:42:24] <wikibugs>	 (03PS6) 10Bking: cloudelastic: Complete cloudelastic1007's migration [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617)
[20:42:30] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[20:44:27] <wikibugs>	 (03PS1) 10Eevans: cassandra: install git-fat to satisfy scap requirement [puppet] - 10https://gerrit.wikimedia.org/r/1003526 (https://phabricator.wikimedia.org/T353550)
[20:46:28] <wikibugs>	 (03CR) 10Hashar: python-build: ensure frozen-requirements is exhaustive (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/941443 (https://phabricator.wikimedia.org/T342346) (owner: 10Hashar)
[20:46:49] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Ensure the alert2001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615)
[20:47:02] <icinga-wm_>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:48:34] <jinxer-wm>	 (ProbeDown) firing: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:49:38] <wikibugs>	 (03PS2) 10Andrea Denisse: alert: Ensure the alert2001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615)
[20:50:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2175 (re)pooling @ 100%: T355864 - Post migration repool of db2175', diff saved to https://phabricator.wikimedia.org/P56807 and previous config saved to /var/cache/conftool/dbconfig/20240214-205021-arnaudb.json
[20:50:27] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[20:50:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 25%: T355864 - Post migration repool of db2176', diff saved to https://phabricator.wikimedia.org/P56808 and previous config saved to /var/cache/conftool/dbconfig/20240214-205027-arnaudb.json
[20:51:09] <inflatador>	 !log bking@puppetmaster1001 manually updating facts data for PCC T355617
[20:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:13] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[20:51:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408 (10Andrew) It's back in service but only as of today.
[20:52:26] <wikibugs>	 (03PS1) 10Andrea Denisse: alert: Ensure the alert1001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003531 (https://phabricator.wikimedia.org/T333615)
[20:52:28] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "a"} and A:restbase and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002
[20:52:33] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[20:53:34] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:54:49] <wikibugs>	 (03PS3) 10Andrea Denisse: alert: Ensure the alert2001 host is reimaged with Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1003527 (https://phabricator.wikimedia.org/T333615)
[20:55:14] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:55:32] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:56:06] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.286 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:56:22] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51452 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[20:56:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002"
[20:56:29] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on restbase1036.eqiad.wmnet with reason: host reimage
[20:56:36] <wikibugs>	 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse)
[20:56:53] <inflatador>	 !log bking@pcc-db1001.puppet-diffs.eqiad1.wikimedia.cloud updating puppet facts for PCC
[20:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:57:03] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "b"} and A:restbase and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002
[20:57:21] <wikibugs>	 (03CR) 10Scott French: "Current plan:" [puppet] - 10https://gerrit.wikimedia.org/r/1003525 (https://phabricator.wikimedia.org/T357436) (owner: 10Scott French)
[20:57:40] <wikibugs>	 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 (10andrea.denisse)
[20:58:22] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] ncmonitor: Remove useless apt-get require [puppet] - 10https://gerrit.wikimedia.org/r/1003524 (owner: 10BCornwall)
[20:59:23] <logmsgbot>	 !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on restbase1036.eqiad.wmnet with reason: host reimage
[21:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T2100).
[21:00:05] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:31] <wikibugs>	 (03PS1) 10Andrea Denisse: Revert "grafana: Ensure the grafana2001 hosts uses Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/1003469
[21:01:05] <wikibugs>	 (03CR) 10Herron: [C: 03+1] thanos: run Thanos components in a systemd slice [puppet] - 10https://gerrit.wikimedia.org/r/1003439 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi)
[21:01:24] <Jdlrobson>	 o/ ptrdrny
[21:01:28] <Jdlrobson>	 o/ present
[21:02:37] <cjming>	 hi Jdlrobson: i can deploy for you
[21:02:43] <cjming>	 1 sec
[21:03:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679) (owner: 10Jdlrobson)
[21:04:16] <cjming>	 not sure if there's something i need to do when a new dblist is added
[21:04:35] <wikibugs>	 (03Merged) 10jenkins-bot: New communities will not share scripts going forward [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003505 (https://phabricator.wikimedia.org/T331679) (owner: 10Jdlrobson)
[21:05:02] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:1003505|New communities will not share scripts going forward (T331679)]]
[21:05:08] <stashbot>	 T331679: Communities can disable sharing of site/user scripts between Vector and Vector 2022 skins - https://phabricator.wikimedia.org/T331679
[21:05:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 50%: T355864 - Post migration repool of db2176', diff saved to https://phabricator.wikimedia.org/P56810 and previous config saved to /var/cache/conftool/dbconfig/20240214-210531-arnaudb.json
[21:05:46] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[21:08:00] <logmsgbot>	 !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:1003505|New communities will not share scripts going forward (T331679)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:08:05] <cjming>	 Jdlrobson: are you able to test?
[21:09:16] <wikibugs>	 (03PS1) 10Bking: cloudelastic: remove unneeded master eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617)
[21:09:56] <wikibugs>	 (03PS2) 10Bking: cloudelastic: remove unneeded master eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617)
[21:10:00] <Jdlrobson>	 yep
[21:11:38] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[21:12:24] <Jdlrobson>	 need a bit more time on this one
[21:12:34] <cjming>	 np - take your time
[21:13:23] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[21:14:14] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T352010)', diff saved to https://phabricator.wikimedia.org/P56811 and previous config saved to /var/cache/conftool/dbconfig/20240214-211413-ladsgroup.json
[21:14:28] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[21:14:33] <jinxer-wm>	 (KubernetesCalicoDown) firing: (3) mw2379.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[21:15:41] <Jdlrobson>	 still looking... 
[21:18:43] <Jdlrobson>	 cjming: something is misbehaving but I'm not sure why. It's setup correctly
[21:18:54] <Jdlrobson>	 I am wondering if I forgot to register the dblist somewhere.
[21:20:18] <wikibugs>	 (03PS1) 10Jdlrobson: Register dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003542
[21:20:20] <Jdlrobson>	 cjming: i did it again..^
[21:20:27] <Jdlrobson>	 I need this one as well for the patch to work
[21:20:38] <Jdlrobson>	 (CI should really be detecting this)
[21:20:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 75%: T355864 - Post migration repool of db2176', diff saved to https://phabricator.wikimedia.org/P56812 and previous config saved to /var/cache/conftool/dbconfig/20240214-212038-arnaudb.json
[21:20:47] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[21:21:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Register dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003542 (owner: 10Jdlrobson)
[21:21:35] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[21:22:03] <cjming>	 Jdlrobson: ok - should we revert and you can roll a new patch with the update?  or merge and rebase the new one?
[21:22:49] <Jdlrobson>	 What would be your preference?
[21:22:49] <cjming>	 er ... rebase, then merge your follow up patch?
[21:22:53] <Jdlrobson>	 yep
[21:23:20] <cjming>	 i'm fine with syncing if you don't think it'll cause an issue and i can backport the follow up one right away
[21:23:32] <Jdlrobson>	 hmm CI is being funky
[21:24:03] <wikibugs>	 (03PS2) 10Jdlrobson: Register dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003542
[21:24:11] <Jdlrobson>	 ^ cjming ok that should do it
[21:24:39] <cjming>	 so should i sync the current one?
[21:25:22] <Jdlrobson>	 cjming: no we want to wait
[21:25:26] <Jdlrobson>	 they need to go out together
[21:25:57] <cjming>	 ok - i'm not going to sync, and i'll scap backport them together
[21:26:08] <logmsgbot>	 !log cjming@deploy2002 Sync cancelled.
[21:26:08] <Jdlrobson>	 thanks!
[21:26:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003542 (owner: 10Jdlrobson)
[21:27:52] <wikibugs>	 (03Merged) 10jenkins-bot: Register dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003542 (owner: 10Jdlrobson)
[21:28:14] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:1003505|New communities will not share scripts going forward (T331679)]], [[gerrit:1003542|Register dblist]]
[21:28:18] <stashbot>	 T331679: Communities can disable sharing of site/user scripts between Vector and Vector 2022 skins - https://phabricator.wikimedia.org/T331679
[21:29:21] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P56813 and previous config saved to /var/cache/conftool/dbconfig/20240214-212920-ladsgroup.json
[21:29:41] <logmsgbot>	 !log cjming@deploy2002 cjming and jdlrobson: Backport for [[gerrit:1003505|New communities will not share scripts going forward (T331679)]], [[gerrit:1003542|Register dblist]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:30:06] <cjming>	 Jdlrobson: wanna try retesting?
[21:30:23] <Jdlrobson>	 cjming: yes pleaze
[21:30:56] <Jdlrobson>	 cjming: hurrah! please sync!
[21:30:59] <Jdlrobson>	 now it's working :)
[21:31:02] <cjming>	 yay!
[21:31:07] <logmsgbot>	 !log cjming@deploy2002 cjming and jdlrobson: Continuing with sync
[21:34:03] <wikibugs>	 (03PS2) 10C. Scott Ananian: Turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999061 (https://phabricator.wikimedia.org/T355374)
[21:35:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 100%: T355864 - Post migration repool of db2176', diff saved to https://phabricator.wikimedia.org/P56814 and previous config saved to /var/cache/conftool/dbconfig/20240214-213544-arnaudb.json
[21:35:54] <stashbot>	 T355864: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864
[21:36:41] <logmsgbot>	 !log eevans@cumin1002 END (FAIL) - Cookbook sre.cassandra.roll-restart (exit_code=99) for nodes matching P{P:cassandra%rack = "b"} and A:restbase and A:eqiad: Restart to pickup logging jars — T353550 - eevans@cumin1002
[21:36:48] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[21:37:02] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999061 (https://phabricator.wikimedia.org/T355374) (owner: 10C. Scott Ananian)
[21:37:30] <cscott>	 who is doing the backport today?  i'm late to the party, but could i get a config change in?
[21:37:50] <cjming>	 hi cscott -- sure what's the patch number?
[21:38:21] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1003505|New communities will not share scripts going forward (T331679)]], [[gerrit:1003542|Register dblist]] (duration: 10m 06s)
[21:38:26] <stashbot>	 T331679: Communities can disable sharing of site/user scripts between Vector and Vector 2022 skins - https://phabricator.wikimedia.org/T331679
[21:38:26] <cjming>	 Jdlrobson: should be live!
[21:39:02] <cscott>	 cjming: https://gerrit.wikimedia.org/r/999061 i just added it to the calendar
[21:39:07] <Jdlrobson>	 cjming: thanks a bunch!
[21:39:12] <Jdlrobson>	 ANd thanks for the admin was just about to do that
[21:40:05] <cjming>	 Jdlrobson: yw! glad it worked out
[21:40:25] <cjming>	 cscott: good timing - i'll do it now
[21:40:32] <cscott>	 there's no canary for wikitech (i learned during last week's backport) so i can't do much during the canary phase other than check that the other sites haven't been affected; it needs to deploy fully before i can check that the config change was effective on wikitech.
[21:40:46] <cscott>	 but i'll get set up to do those checks
[21:40:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999061 (https://phabricator.wikimedia.org/T355374) (owner: 10C. Scott Ananian)
[21:41:21] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching restbase1032.eqiad.wmnet: Restart to pickup logging jars — T353550 - eevans@cumin1002
[21:41:31] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999061 (https://phabricator.wikimedia.org/T355374) (owner: 10C. Scott Ananian)
[21:41:31] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[21:41:40] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:41:48] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[21:41:54] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:999061|Turn on Parsoid read views by default on wikitech Talk pages (T355374)]]
[21:41:54] <cjming>	 cscott: sounds good
[21:41:58] <stashbot>	 T355374: Use Parsoid for DiscussionTools on wikitech - https://phabricator.wikimedia.org/T355374
[21:41:59] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:43:13] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003547
[21:43:23] <bd808>	 cscott: re wikitech pre-sync testing, T237773 is the blocker (or at least *a* blocker). I have some hope that will be resolved via T292707 before the heat death of the universe.
[21:43:23] <stashbot>	 T237773: Move Wikitech onto the production MW cluster - https://phabricator.wikimedia.org/T237773
[21:43:24] <stashbot>	 T292707: Migrate Wikitech to Kubernetes - https://phabricator.wikimedia.org/T292707
[21:43:27] <logmsgbot>	 !log cjming@deploy2002 cscott and cjming: Backport for [[gerrit:999061|Turn on Parsoid read views by default on wikitech Talk pages (T355374)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:43:41] <cjming>	 cscott: shall i sync?
[21:43:55] * subbu is excited about the milestone
[21:44:15] <cscott>	 cjming i'm just going to sanity check that the configuration on enwiki canary hasn't changed hang on
[21:44:25] <cjming>	 sure thing
[21:44:28] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P56815 and previous config saved to /var/cache/conftool/dbconfig/20240214-214427-ladsgroup.json
[21:45:25] <cscott>	 cjming ok, confirmed that i haven't broken enwiki at least, go ahead with the sync
[21:45:37] <cjming>	 alrighty
[21:45:42] <logmsgbot>	 !log cjming@deploy2002 cscott and cjming: Continuing with sync
[21:49:45] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003547 (owner: 10Ebernhardson)
[21:50:50] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Update container image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003547 (owner: 10Ebernhardson)
[21:51:17] <cscott>	 whoops, my connection dropped.  cjming, ping me when sync is done?
[21:51:38] <cjming>	 cscott: sure thing - almost there
[21:51:59] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching restbase1032.eqiad.wmnet: Restart to pickup logging jars — T353550 - eevans@cumin1002
[21:52:04] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[21:52:38] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:999061|Turn on Parsoid read views by default on wikitech Talk pages (T355374)]] (duration: 10m 44s)
[21:52:43] <stashbot>	 T355374: Use Parsoid for DiscussionTools on wikitech - https://phabricator.wikimedia.org/T355374
[21:53:29] <cjming>	 cscott: should be live!
[21:53:30] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[21:53:39] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:54:08] <Kizule>	 Hi, do we have 5 minutes or so to deploy one config patch?
[21:54:19] <cjming>	 lol - sure
[21:54:26] <Kizule>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/997274
[21:54:45] <wikibugs>	 (03PS5) 10Zoranzoki21: throttle.php: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997274 (https://phabricator.wikimedia.org/T356654)
[21:55:56] <wikibugs>	 (03CR) 10Scott French: [C: 03+1] "Thanks, Dallas. This looks good to me. I'll +2 and merge tomorrow during the deployment window." [puppet] - 10https://gerrit.wikimedia.org/r/1003515 (https://phabricator.wikimedia.org/T357436) (owner: 10Dwisehaupt)
[21:56:32] <cjming>	 Kizule: will do yours and then close windo
[21:56:35] <cjming>	 *window
[21:56:47] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[21:56:48] <Kizule>	 cjming: Sounds good, because that one doesn't need mwdebug.
[21:56:56] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:57:10] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] httpbb: add donate.wikimedia.org redirect tests [puppet] - 10https://gerrit.wikimedia.org/r/1003525 (https://phabricator.wikimedia.org/T357436) (owner: 10Scott French)
[21:57:25] <cjming>	 Kizule: so i'll just sync when it's ready
[21:57:28] <wikibugs>	 (03PS7) 10Ryan Kemper: cloudelastic: Complete cloudelastic1007's migration [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[21:57:30] <wikibugs>	 (03PS6) 10Zoranzoki21: throttle.php: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997274 (https://phabricator.wikimedia.org/T356654)
[21:57:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997274 (https://phabricator.wikimedia.org/T356654) (owner: 10Zoranzoki21)
[21:57:37] <Kizule>	 cjming: Okay
[21:58:21] <wikibugs>	 (03Merged) 10jenkins-bot: throttle.php: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/997274 (https://phabricator.wikimedia.org/T356654) (owner: 10Zoranzoki21)
[21:58:47] <logmsgbot>	 !log cjming@deploy2002 Started scap: Backport for [[gerrit:997274|throttle.php: Add throttle rule for editathon (T356654)]]
[21:58:52] <stashbot>	 T356654: Request to remove account creation limit during edit-a-thon at WIT - https://phabricator.wikimedia.org/T356654
[21:59:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T352010)', diff saved to https://phabricator.wikimedia.org/P56816 and previous config saved to /var/cache/conftool/dbconfig/20240214-215934-ladsgroup.json
[21:59:36] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[21:59:40] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[21:59:50] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[22:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240214T2200)
[22:00:17] <logmsgbot>	 !log cjming@deploy2002 zoranzoki21 and cjming: Backport for [[gerrit:997274|throttle.php: Add throttle rule for editathon (T356654)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:00:20] <logmsgbot>	 !log cjming@deploy2002 zoranzoki21 and cjming: Continuing with sync
[22:00:40] <Kizule>	 cjming: I've added it to a calendar, so we can keep up with the procedure
[22:01:04] <wikibugs>	 (03CR) 10Bking: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[22:02:15] <cjming>	 Kizule: thanks - should be live here shortly
[22:04:33] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: Complete cloudelastic1007's migration [puppet] - 10https://gerrit.wikimedia.org/r/999091 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[22:07:18] <logmsgbot>	 !log cjming@deploy2002 Finished scap: Backport for [[gerrit:997274|throttle.php: Add throttle rule for editathon (T356654)]] (duration: 08m 31s)
[22:07:24] <stashbot>	 T356654: Request to remove account creation limit during edit-a-thon at WIT - https://phabricator.wikimedia.org/T356654
[22:07:38] <cjming>	 Kizule: and it's live
[22:07:55] <Kizule>	 Thanks cjming, I guess you can close the window now. :)
[22:08:10] <cjming>	 ya :)
[22:08:15] <cjming>	 !log end of UTC late backport window
[22:08:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:08:33] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:09:11] <cscott>	 could someone remind me of the difference between labswiki and wikitech as a key in InitialiseSettings.php ?
[22:10:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - bking@cumin2002"
[22:10:10] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudelastic1007.eqiad.wmnet with OS bullseye
[22:13:16] <urandom>	 !log restarting Cassandra: restbase/codfw, row b —  T353550 
[22:13:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:13:23] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[22:13:29] <wikibugs>	 (03PS1) 10C. Scott Ananian: Correctly turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003551
[22:15:39] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Correctly turn on Parsoid read views by default on wikitech Talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1003551 (owner: 10C. Scott Ananian)
[22:15:43] <icinga-wm_>	 PROBLEM - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is CRITICAL: connect to address 10.192.16.153 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[22:16:43] <icinga-wm_>	 RECOVERY - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is OK: TCP OK - 0.030 second response time on 10.192.16.153 port 9042 https://phabricator.wikimedia.org/T93886
[22:17:12] <wikibugs>	 (03PS2) 10C. Scott Ananian: Turn on Parsoid read views by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566)
[22:18:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[22:19:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in cloudelastic
[22:19:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in cloudelastic
[22:20:11] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: cloudelastic1005*,cloudelastic1006* for IP migration - bking@cumin2002 - T355617
[22:20:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: cloudelastic1005*,cloudelastic1006* for IP migration - bking@cumin2002 - T355617
[22:20:15] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[22:20:27] <icinga-wm_>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:22:46] <wikibugs>	 (03PS3) 10Bking: cloudelastic: remove unneeded master eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617)
[22:23:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[22:23:33] <jinxer-wm>	 (SystemdUnitFailed) firing: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:23:34] <jinxer-wm>	 (ProbeDown) resolved: (2) Service titan1001:443 has failed probes (http_thanos_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#titan1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:23:38] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job pint in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:26:55] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: remove unneeded master eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[22:27:28] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: remove unneeded master eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1003535 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[22:33:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[22:34:03] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[22:39:30] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching P{P:cassandra%rack = "c"} and A:restbase and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002
[22:39:35] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[22:40:19] <wikibugs>	 (03PS1) 10Bking: cloudelastic: Begin private IP migration for cloudelastic1006 [puppet] - 10https://gerrit.wikimedia.org/r/1003557 (https://phabricator.wikimedia.org/T355617)
[22:41:39] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1003557 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[22:47:34] <wikibugs>	 (03PS1) 10Bking: cloudelastic: Complete cloudelastic1006's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003558 (https://phabricator.wikimedia.org/T355617)
[22:48:03] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[22:48:07] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[22:48:34] <jinxer-wm>	 (SystemdUnitFailed) resolved: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:49:25] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1007.eqiad.wmnet
[22:49:37] <logmsgbot>	 !log bking@cumin2002 conftool action : set/weight=10; selector: name=cloudelastic1008.eqiad.wmnet
[22:50:05] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1007.eqiad.wmnet
[22:50:57] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1008.eqiad.wmnet
[22:51:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[22:54:55] <wikibugs>	 (03PS1) 10Bking: cloudelastic: Begin private IP migration for cloudelastic1005 [puppet] - 10https://gerrit.wikimedia.org/r/1003561 (https://phabricator.wikimedia.org/T355617)
[22:57:43] <wikibugs>	 (03PS1) 10Bking: cloudelastic: Complete cloudelastic1005's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003563 (https://phabricator.wikimedia.org/T355617)
[22:58:52] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Complete cloudelastic1006's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003558 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[22:59:05] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Begin private IP migration for cloudelastic1006 [puppet] - 10https://gerrit.wikimedia.org/r/1003557 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[22:59:21] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Begin private IP migration for cloudelastic1005 [puppet] - 10https://gerrit.wikimedia.org/r/1003561 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[22:59:30] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: Complete cloudelastic1005's migration [puppet] - 10https://gerrit.wikimedia.org/r/1003563 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking)
[23:04:23] <wikibugs>	 10SRE, 10Fundraising-Backlog, 10Wikimedia-Apache-configuration, 10fundraising-tech-ops, and 2 others: Request donatewiki redirect - https://phabricator.wikimedia.org/T357436 (10Dwisehaupt)
[23:04:31] <wikibugs>	 (03CR) 10Scott French: "For completeness, the somewhat surprising redirect behavior in [0] is likely due to the `RewriteRule ^/wiki /w/index.php [L]` at [1] (comb" [puppet] - 10https://gerrit.wikimedia.org/r/1003525 (https://phabricator.wikimedia.org/T357436) (owner: 10Scott French)
[23:11:45] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T352010)', diff saved to https://phabricator.wikimedia.org/P56817 and previous config saved to /var/cache/conftool/dbconfig/20240214-231144-ladsgroup.json
[23:11:59] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[23:14:03] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new master settings - bking@cumin2002 - T355617
[23:14:08] <stashbot>	 T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617
[23:26:52] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P56818 and previous config saved to /var/cache/conftool/dbconfig/20240214-232651-ladsgroup.json
[23:32:27] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching P{P:cassandra%rack = "c"} and A:restbase and A:codfw: Restart to pickup logging jars — T353550 - eevans@cumin1002
[23:32:34] <stashbot>	 T353550: Cassandra (logstash) logging broken - https://phabricator.wikimedia.org/T353550
[23:34:55] <jinxer-wm>	 (SystemdUnitFailed) firing: ferm.service on mw2282:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:41:58] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P56819 and previous config saved to /var/cache/conftool/dbconfig/20240214-234157-ladsgroup.json
[23:57:04] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T352010)', diff saved to https://phabricator.wikimedia.org/P56820 and previous config saved to /var/cache/conftool/dbconfig/20240214-235703-ladsgroup.json
[23:57:06] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[23:57:13] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[23:57:20] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[23:57:26] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1243 (T352010)', diff saved to https://phabricator.wikimedia.org/P56821 and previous config saved to /var/cache/conftool/dbconfig/20240214-235725-ladsgroup.json