[00:00:18] <wikibugs>	 (03PS1) 10Dzahn: Revert "stewards: migrate stewards1001 to puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/973802
[00:00:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "stewards: migrate stewards1001 to puppet7" [puppet] - 10https://gerrit.wikimedia.org/r/973802 (owner: 10Dzahn)
[00:03:39] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.reimage for host stewards1001.eqiad.wmnet with OS bookworm
[00:03:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet wi...
[00:04:05] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver1003 is CRITICAL: CRITICAL - degraded: The following units failed: dump_cloud_ip_ranges.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:11:39] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on stewards1001.eqiad.wmnet with reason: host reimage
[00:14:20] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stewards1001.eqiad.wmnet with reason: host reimage
[00:16:09] <icinga-wm>	 RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops
[00:21:00] <wikibugs>	 (03Abandoned) 10BCornwall: Add Prometheus metrics for fifo-log-demux [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/967293 (https://phabricator.wikimedia.org/T345939) (owner: 10BCornwall)
[00:23:22] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "stewards: migrate stewards1001 to puppet7"" [puppet] - 10https://gerrit.wikimedia.org/r/973803
[00:27:39] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host stewards1001.eqiad.wmnet with OS bookworm
[00:27:52] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1001 for host stewards1001.eqiad.wmnet with O...
[00:31:41] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-host for host stewards1001.eqiad.wmnet
[00:32:03] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1037 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[00:32:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Revert "Revert "stewards: migrate stewards1001 to puppet7"" [puppet] - 10https://gerrit.wikimedia.org/r/973803 (owner: 10Dzahn)
[00:33:56] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host stewards1001.eqiad.wmnet
[00:38:57] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973413
[00:39:03] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973413 (owner: 10TrainBranchBot)
[00:40:56] <wikibugs>	 (03PS1) 10BCornwall: fifo-log-demux: Update project homepage [puppet] - 10https://gerrit.wikimedia.org/r/973887 (https://phabricator.wikimedia.org/T347623)
[00:42:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:43:43] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1038 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[00:58:56] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/973413 (owner: 10TrainBranchBot)
[01:02:50] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T351144 (10phaultfinder)
[01:40:32] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] search-loader: remove references to search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/973880 (https://phabricator.wikimedia.org/T351123) (owner: 10Bking)
[01:57:33] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1038 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[01:59:41] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1037 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:26:13] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1039.eqiad.wmnet with OS bookworm
[02:38:54] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:43:31] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage
[02:46:25] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1039.eqiad.wmnet with reason: host reimage
[02:59:59] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1040.eqiad.wmnet with OS bookworm
[03:00:06] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0300)
[03:07:38] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.5 [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/973414 (https://phabricator.wikimedia.org/T350081)
[03:07:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.5 [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/973414 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot)
[03:08:48] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1039.eqiad.wmnet with OS bookworm
[03:08:54] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:09:15] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm
[03:13:38] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage
[03:16:14] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: host reimage
[03:22:51] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.5 [core] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/973414 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot)
[03:22:58] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage
[03:25:55] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1041.eqiad.wmnet with reason: host reimage
[03:33:31] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bookworm
[03:43:10] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1040.eqiad.wmnet with OS bookworm
[03:46:06] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1042.eqiad.wmnet with OS bookworm
[03:48:08] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1042.eqiad.wmnet with OS bookworm
[03:49:03] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1041.eqiad.wmnet with OS bookworm
[03:49:46] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm
[03:53:54] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0400)
[04:01:27] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage
[04:01:36] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973892 (https://phabricator.wikimedia.org/T350081)
[04:01:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973892 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot)
[04:02:32] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973892 (https://phabricator.wikimedia.org/T350081) (owner: 10TrainBranchBot)
[04:02:57] <logmsgbot>	 !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.5  refs T350081
[04:03:01] <stashbot>	 T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081
[04:04:22] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1042.eqiad.wmnet with reason: host reimage
[04:22:23] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1043.eqiad.wmnet with OS bookworm
[04:22:52] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm
[04:31:33] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1042.eqiad.wmnet with OS bookworm
[04:42:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[04:54:13] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.5  refs T350081 (duration: 51m 15s)
[04:54:17] <stashbot>	 T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081
[04:58:13] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bookworm
[05:03:48] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1045.eqiad.wmnet with OS bookworm
[05:05:17] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1043.eqiad.wmnet with OS bookworm
[05:17:31] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage
[05:18:18] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1044.eqiad.wmnet with OS bookworm
[05:18:36] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bookworm
[05:20:13] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1045.eqiad.wmnet with reason: host reimage
[05:42:48] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1045.eqiad.wmnet with OS bookworm
[05:45:31] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1044.eqiad.wmnet with OS bookworm
[06:09:59] <wikibugs>	 10SRE-swift-storage, 10Move-Files-To-Commons, 10WMDE-TechWish-Maintenance, 10MW-1.42-notes (1.42.0-wmf.3; 2023-10-31), 10Wikimedia-production-error: FileBackendStore::ingestFreshFileStats: Could not stat file - https://phabricator.wikimedia.org/T348688 (10Kizule) 05Open→03Invalid Not happening anymor...
[06:13:43] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management: File not found: /v1/AUTH_mw/wikipedia-commons-local-public.7e/7/7e/EC02-0162-69_l_%2824374651802%29.jpg - https://phabricator.wikimedia.org/T348586 (10Kizule) 05Open→03Resolved Then let's close this in order to have less confusion. :)
[06:50:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Add dedicated insetup role for Buster [puppet] - 10https://gerrit.wikimedia.org/r/973896 (https://phabricator.wikimedia.org/T349619)
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0700)
[07:00:04] <jouncebot>	 kormat, marostegui, and Amir1: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0700). nyaa~
[07:00:54] <wikibugs>	 (03CR) 10Marostegui: "Thanks, normally this requires a restart on sanitarium, but given it is on x1, we don't have to do it now, and it can be done whenever the" [puppet] - 10https://gerrit.wikimedia.org/r/973809 (https://phabricator.wikimedia.org/T350321) (owner: 10Dreamy Jazz)
[07:03:32] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2132,2160].codfw.wmnet,db[1119,1164,1217].eqiad.wmnet with reason: Switch
[07:03:49] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2132,2160].codfw.wmnet,db[1119,1164,1217].eqiad.wmnet with reason: Switch
[07:08:54] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:11:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1119 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/973351 (https://phabricator.wikimedia.org/T350022) (owner: 10Marostegui)
[07:27:57] <vgutierrez>	 !log include golang-github-mmatczuk-anyflag_0.0~git20231026.5f42d2f in apt.wm.org (bookworm)
[07:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:32] <wikibugs>	 (03PS1) 10Slyngshede: Implement stricter permission checks [software/bitu] - 10https://gerrit.wikimedia.org/r/974070 (https://phabricator.wikimedia.org/T351143)
[07:39:34] <jynus>	 !log stop bacula dir (and puppet) at backup1001 T350022
[07:39:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:38] <stashbot>	 T350022: Switchover m1 master (db1164-> db1119) - https://phabricator.wikimedia.org/T350022
[07:40:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 2.9066639290823906s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:41:48] <jynus>	 prometheus job for bacula will complain, as it only have one job, which I stoppedf
[07:42:00] <jynus>	 will ack when it complains
[07:45:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.8177954390190108s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[07:48:20] <wikibugs>	 (03PS1) 10Marostegui: db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974071 (https://phabricator.wikimedia.org/T349090)
[07:48:54] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:49:15] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1164: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974071 (https://phabricator.wikimedia.org/T349090) (owner: 10Marostegui)
[07:51:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/974070 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede)
[07:52:47] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Implement stricter permission checks [software/bitu] - 10https://gerrit.wikimedia.org/r/974070 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede)
[07:53:54] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:57:52] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10DMburugu) I've discussed this with @Urbanecm and I approve his access.
[07:59:36] <moritzm>	 !log installing dbus security updates on bullseye
[07:59:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: That opportune time is upon us again. Time for a UTC morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0800).
[08:00:05] <jouncebot>	 apergos: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:16] <apergos>	 o/
[08:02:42] <apergos>	 who's running the window today?
[08:04:25] <apergos>	 Amir1 or urbanecm  either of you around? 
[08:04:45] <marostegui>	 !log Failover m1 from db1164 to db1119 - T350022
[08:04:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:50] <stashbot>	 T350022: Switchover m1 master (db1164-> db1119) - https://phabricator.wikimedia.org/T350022
[08:05:03] <marostegui>	 all done
[08:05:16] <arnaudb>	 👏
[08:05:24] <jynus>	 should we merge the other patch?
[08:05:26] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[08:05:28] <marostegui>	 yep
[08:05:35] <marostegui>	 etherpad seems to be fine
[08:05:39] <marostegui>	 no restart required
[08:05:48] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbbackups: Switchover master from db1164 to db1119 [puppet] - 10https://gerrit.wikimedia.org/r/969753 (https://phabricator.wikimedia.org/T350022) (owner: 10Jcrespo)
[08:06:12] <moritzm>	 !log installing nghttp2 security updates
[08:06:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:33] <jynus>	 will run puppet on backupmon, backup1001
[08:06:41] <jynus>	 when you tell me, marostegui
[08:06:45] <marostegui>	 jynus: go for it
[08:07:31] <wikibugs>	 (03CR) 10Ayounsi: "Please make sure someone from Traffic (hello Sukhe :) ) had a look as well given how it's tied to critical parts of the infra (DNS)." [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[08:08:09] <jynus>	 icinga checks should be updated now
[08:08:15] <jynus>	 bacula is still starting
[08:08:19] <icinga-wm>	 PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-debmonitor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:08:34] <marostegui>	 moritzm: ^that is probably because of the m1 switchover
[08:08:53] <jynus>	 maybe try restarting it?
[08:09:02] <marostegui>	 yeah
[08:09:12] <apergos>	 hrm with no backport deployment window runner, I feel uneasy just self-deploying anyways... guess I'll wait and see if one of them turns up
[08:09:13] <marostegui>	 first time it happens
[08:09:41] <icinga-wm>	 RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:10:21] <wikibugs>	 (03CR) 10Ayounsi: "Change overall lgtm but I don't know enough about nftables to properly review it." [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[08:10:22] <jynus>	 bacula should be backup up
[08:10:33] <jynus>	 running a backup to confirm
[08:10:44] <marostegui>	 jynus: great, when done let me know, so I can reimage the old master
[08:11:05] <moritzm>	 k, I'm restarting cfssl-ocsprefresh-debmonitor.service to be on the safe side
[08:11:13] <marostegui>	 moritzm: I am doing it :)
[08:11:21] <marostegui>	 And it is taking ages btw
[08:12:29] <wikibugs>	 (03PS1) 10Slyngshede: Stricter checking of user id when updating email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/974073
[08:12:44] <jynus>	 12.69 G  OK       14-Nov-23 08:12 gerrit1003.wikimedia.org-Hourly-Fri-productionEqiad-gerrit-repo-data
[08:12:53] <jynus>	 ^ marostegui
[08:13:27] <jynus>	 everything looking good on my side
[08:13:29] <wikibugs>	 (03PS2) 10Slyngshede: Stricter checking of user id when updating email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/974073
[08:13:44] <icinga-wm>	 PROBLEM - MariaDB read only m1 #page on db1164 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[08:13:54] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:13:59] <jynus>	 oh
[08:14:10] <jynus>	 puppet didnt update?
[08:14:36] <marostegui>	 yeah I guess
[08:14:38] <marostegui>	 Sorry for the page
[08:15:19] <jynus>	 it is not that
[08:15:25] <jynus>	 it says "Could not connect to localhost:3306"
[08:15:41] <jynus>	 0 processes with command name 'mysqld' did it crash?
[08:15:44] <marostegui>	 No
[08:15:47] <marostegui>	 I stopped mysql
[08:15:48] <jynus>	 or just downtime
[08:15:57] <marostegui>	 but puppet didn't run on icinga yet, so notifications were enabled
[08:16:00] <jynus>	 ah, good, then the procedure itself worked
[08:16:35] <jynus>	 it was just the "maintenance" after the switch
[08:16:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1164.eqiad.wmnet with OS bookworm
[08:17:30] * Emperor arrives with first tea of the day
[08:18:38] <jynus>	 moritzm: did the debmonitor alert got fixed?
[08:18:50] <marostegui>	 yes, check irc
[08:18:56] <marostegui>	 it was fixed with the restart
[08:19:04] <marostegui>	 [09:09:41]  <+icinga-wm> RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:19:05] <moritzm>	 yeah, it recovered with the restart
[08:19:19] <marostegui>	 it took a while to restart though, I was surprised
[08:19:39] <moritzm>	 it usually takes >3 min
[08:19:50] <jynus>	 please add those to the docs: https://wikitech.wikimedia.org/wiki/MariaDB/misc#m1
[08:20:04] <jynus>	 so next time it is not a surprise
[08:20:39] <jynus>	 the prometheus exporter for bacula didn't recover, so doing another manual restart
[08:20:45] <marostegui>	 done
[08:20:51] <marostegui>	 (added to the docs)
[08:21:57] <jynus>	 I will add that too, although I think it is a bug on daemon config for dependencies
[08:26:07] <jynus>	 actually, the exporter is ok, but I think there is some lag on the alerting
[08:26:22] <jynus>	 should recover after whatever is the window for checking
[08:27:26] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbbackups: Switchover master from db1164 to db1119" [puppet] - 10https://gerrit.wikimedia.org/r/973804
[08:27:33] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/973804 (owner: 10Marostegui)
[08:28:35] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] Revert "dbbackups: Switchover master from db1164 to db1119" [puppet] - 10https://gerrit.wikimedia.org/r/973804 (owner: 10Marostegui)
[08:28:48] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1164.eqiad.wmnet with reason: host reimage
[08:29:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/974073 (owner: 10Slyngshede)
[08:30:07] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Stricter checking of user id when updating email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/974073 (owner: 10Slyngshede)
[08:32:40] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1164.eqiad.wmnet with reason: host reimage
[08:37:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] P:bird::anycast: migrate to nftables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[08:41:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973896 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:42:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:42:27] <wikibugs>	 (03PS3) 10Elukey: services: upgrade changeprop jobqueue eqiad's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971114 (https://phabricator.wikimedia.org/T348950)
[08:42:29] <wikibugs>	 (03PS3) 10Elukey: changeprop: allow to define Kafka settings for Job Queues [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950)
[08:42:31] <wikibugs>	 (03PS3) 10Elukey: changeprop: set num_workers to zero [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950)
[08:44:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: add db1238 and prepare db1138 retirement [puppet] - 10https://gerrit.wikimedia.org/r/972507 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[08:46:12] <wikibugs>	 (03PS1) 10Hashar: Archive repository [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/974104 (https://phabricator.wikimedia.org/T347623)
[08:46:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Archive repository [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/974104 (https://phabricator.wikimedia.org/T347623) (owner: 10Hashar)
[08:46:58] <wikibugs>	 (03CR) 10Hashar: [V: 03+2 C: 03+2] Archive repository [software/fifo-log-demux] - 10https://gerrit.wikimedia.org/r/974104 (https://phabricator.wikimedia.org/T347623) (owner: 10Hashar)
[08:52:41] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1164.eqiad.wmnet with OS bookworm
[08:52:42] <wikibugs>	 (03PS1) 10Slyngshede: Version 0.0.3 - Block unauthorized access to keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143)
[08:53:42] <wikibugs>	 (03CR) 10Slyngshede: "Version 0.0.3 have already been deployed in production and test." [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede)
[08:55:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede)
[08:56:17] <wikibugs>	 (03PS2) 10Slyngshede: Version 0.0.3 - Block unauthorized access to keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143)
[08:56:17] <godog>	 !log add 80g to prometheus/ops in eqiad
[08:56:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:25] <godog>	 !log add 80g to prometheus/k8s-ml-serve in eqiad
[08:56:26] <marostegui>	 jouncebot: now
[08:56:26] <jouncebot>	 For the next 0 hour(s) and 3 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T0800)
[08:56:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:30] <wikibugs>	 (03CR) 10Slyngshede: Version 0.0.3 - Block unauthorized access to keys. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede)
[08:56:39] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Version 0.0.3 - Block unauthorized access to keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/974105 (https://phabricator.wikimedia.org/T351143) (owner: 10Slyngshede)
[08:57:42] <wikibugs>	 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz)
[08:57:47] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974106
[08:59:23] <wikibugs>	 (03PS1) 10Marostegui: pc1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974107
[08:59:32] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] Generate the netboot.cfg file to avoid typos impacting everyone (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[08:59:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one optional nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[09:00:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974107 (owner: 10Marostegui)
[09:00:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add dedicated insetup role for Buster [puppet] - 10https://gerrit.wikimedia.org/r/973896 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:02:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974106 (owner: 10Marostegui)
[09:03:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974106 (owner: 10Marostegui)
[09:03:46] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on pc[2013-2014].codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Switch
[09:04:01] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc1014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974106 (owner: 10Marostegui)
[09:04:01] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc[2013-2014].codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Switch
[09:05:12] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:974106|ProductionServices.php: Promote pc1014 to pc3 master]]
[09:06:44] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:974106|ProductionServices.php: Promote pc1014 to pc3 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:07:01] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[09:08:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: upgrade changeprop jobqueue eqiad's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971114 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[09:11:55] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973805
[09:12:11] <wikibugs>	 (03PS1) 10Marostegui: Revert "pc1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/974126
[09:12:36] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:974106|ProductionServices.php: Promote pc1014 to pc3 master]] (duration: 07m 24s)
[09:13:07] <wikibugs>	 (03PS1) 10Volans: sre.hosts.decommission: remove also from Puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319)
[09:13:34] <wikibugs>	 10Puppet, 10MediaModeration (MediaModeration 2.0), 10Trust and Safety Product Sprint (Sprint Bodhrán): [S] Add mediamoderation_scan to the private tables list on puppet - https://phabricator.wikimedia.org/T351095 (10Dreamy_Jazz) 05Open→03Resolved
[09:15:48] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Disable initial-import job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973866 (owner: 10Kosta Harlan)
[09:16:41] <wikibugs>	 (03CR) 10Volans: sre.hosts.decommission: remove also from Puppet7 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319) (owner: 10Volans)
[09:16:50] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Disable initial-import job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973866 (owner: 10Kosta Harlan)
[09:18:13] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973805 (owner: 10Marostegui)
[09:18:58] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973805 (owner: 10Marostegui)
[09:19:04] <wikibugs>	 (03PS1) 10Elukey: services: add kafka base settings for cp-jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/974109
[09:19:25] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "pc1013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/974126 (owner: 10Marostegui)
[09:19:34] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:973805|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]]
[09:20:01] <wikibugs>	 (03PS3) 10Kosta Harlan: ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867
[09:20:34] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 (owner: 10Kosta Harlan)
[09:20:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: add kafka base settings for cp-jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/974109 (owner: 10Elukey)
[09:20:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 (owner: 10Kosta Harlan)
[09:20:58] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:973805|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:21:08] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[09:21:12] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10MatthewVernon) @Urbanecm_WMF I think this is awaiting confirmation from @KFrancis that an NDA has been signed (and that we have a legal name on file), per the comment from 1...
[09:22:36] <wikibugs>	 (03PS4) 10Kosta Harlan: ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867
[09:23:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 (owner: 10Kosta Harlan)
[09:25:05] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:25:06] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
[09:25:33] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:26:18] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
[09:26:36] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:973805|Revert "ProductionServices.php: Promote pc1014 to pc3 master"]] (duration: 07m 02s)
[09:27:01] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:27:41] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:28:09] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:28:15] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:28:41] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-staging-etcd2003.codfw.wmnet
[09:28:54] <wikibugs>	 (03PS3) 10Jbond: puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619)
[09:29:42] <wikibugs>	 (03PS1) 10Marostegui: ProductionServices.php: Promote pc2014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974113
[09:30:02] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[09:30:15] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: Maintenance
[09:30:37] <wikibugs>	 (03PS1) 10Marostegui: pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974114
[09:30:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] ProductionServices.php: Promote pc2014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974113 (owner: 10Marostegui)
[09:31:03] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on pc[2013-2014].codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Switch
[09:31:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2013: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/974114 (owner: 10Marostegui)
[09:31:08] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc[2013-2014].codfw.wmnet,pc[1013-1014].eqiad.wmnet with reason: Switch
[09:31:32] <wikibugs>	 (03Merged) 10jenkins-bot: ProductionServices.php: Promote pc2014 to pc3 master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974113 (owner: 10Marostegui)
[09:32:00] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:974113|ProductionServices.php: Promote pc2014 to pc3 master]]
[09:32:32] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[09:32:45] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1136.eqiad.wmnet with reason: Maintenance
[09:33:16] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[09:33:27] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:974113|ProductionServices.php: Promote pc2014 to pc3 master]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:33:30] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[09:33:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm feel free to merge or +1 and i will 😊" [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond)
[09:33:31] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[09:33:32] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[09:33:47] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[09:33:53] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1158 (T348183)', diff saved to https://phabricator.wikimedia.org/P53379 and previous config saved to /var/cache/conftool/dbconfig/20231114-093353-arnaudb.json
[09:33:55] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.hosts.reimage for host kubestage2002.codfw.wmnet with OS bullseye
[09:34:19] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10fgiunchedi)
[09:34:34] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[09:34:44] <wikibugs>	 (03PS5) 10Kosta Harlan: ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867
[09:35:35] <wikibugs>	 (03CR) 10Klausman: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/974115 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[09:36:24] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 (owner: 10Kosta Harlan)
[09:36:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T348183)', diff saved to https://phabricator.wikimedia.org/P53380 and previous config saved to /var/cache/conftool/dbconfig/20231114-093625-arnaudb.json
[09:36:30] <jayme>	 !log reimaging kubestage2002 to verify with puppet7
[09:36:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:43] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: unregister gitlab-runner1003 [puppet] - 10https://gerrit.wikimedia.org/r/974116 (https://phabricator.wikimedia.org/T344951)
[09:37:14] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Enable and reschedule the daily updates job [deployment-charts] - 10https://gerrit.wikimedia.org/r/973867 (owner: 10Kosta Harlan)
[09:38:00] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[09:38:29] <wikibugs>	 (03CR) 10Klausman: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/449/con" [puppet] - 10https://gerrit.wikimedia.org/r/974115 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[09:38:31] <logmsgbot>	 !log kharlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[09:38:31] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/974116 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[09:39:01] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: migrate ml-staging-etcd2003 to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974115 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[09:39:11] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:974113|ProductionServices.php: Promote pc2014 to pc3 master]] (duration: 07m 11s)
[09:39:12] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: add db1238 and prepare db1138 retirement (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/972507 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[09:39:34] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc2014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974127
[09:40:06] <wikibugs>	 (03PS1) 10Marostegui: Revert "pc2013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/974128
[09:43:18] <wikibugs>	 (03PS6) 10Majavah: cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087)
[09:43:20] <wikibugs>	 (03PS6) 10Majavah: P:bird::anycast: migrate to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087)
[09:43:22] <wikibugs>	 (03PS6) 10Majavah: hieradata: migrate codfw cloudlb to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973785 (https://phabricator.wikimedia.org/T351087)
[09:43:24] <wikibugs>	 (03PS6) 10Majavah: hieradata: migrate all cloudlb hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973806 (https://phabricator.wikimedia.org/T351087)
[09:43:46] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] gitlab_runner: unregister gitlab-runner1003 [puppet] - 10https://gerrit.wikimedia.org/r/974116 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[09:43:49] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:43:57] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:44:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc2014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974127 (owner: 10Marostegui)
[09:44:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "pc2013: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/974128 (owner: 10Marostegui)
[09:45:10] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc2014 to pc3 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974127 (owner: 10Marostegui)
[09:45:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] P:bird::anycast: migrate to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[09:45:30] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-staging-etcd2003.codfw.wmnet
[09:45:36] <logmsgbot>	 !log marostegui@deploy2002 Started scap: Backport for [[gerrit:974127|Revert "ProductionServices.php: Promote pc2014 to pc3 master"]]
[09:45:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] cloudlb: haproxy: migrate to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[09:46:15] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:46:23] <wikibugs>	 (03PS1) 10Kosta Harlan: ipoid: Remove timeZone property [deployment-charts] - 10https://gerrit.wikimedia.org/r/974119
[09:46:28] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Remove timeZone property [deployment-charts] - 10https://gerrit.wikimedia.org/r/974119 (owner: 10Kosta Harlan)
[09:47:03] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Backport for [[gerrit:974127|Revert "ProductionServices.php: Promote pc2014 to pc3 master"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:47:13] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] cloudlb: haproxy: migrate to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973781 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[09:47:15] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Remove timeZone property [deployment-charts] - 10https://gerrit.wikimedia.org/r/974119 (owner: 10Kosta Harlan)
[09:47:21] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:47:25] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50862 bytes in 8.354 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:47:27] <logmsgbot>	 !log marostegui@deploy2002 marostegui: Continuing with sync
[09:48:17] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.293 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:48:39] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] webperf::site: update to use multi root CA [puppet] - 10https://gerrit.wikimedia.org/r/973843 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[09:49:39] <wikibugs>	 (03PS1) 10Stevemunene: druid: remove druid100[4-6] from druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/974120
[09:50:35] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] Add a prometheus_instance parameter to prometheus::statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/973320 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[09:51:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P53383 and previous config saved to /var/cache/conftool/dbconfig/20231114-095132-arnaudb.json
[09:51:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM modulo jobs configuration comments in related task" [puppet] - 10https://gerrit.wikimedia.org/r/973321 (https://phabricator.wikimedia.org/T343232) (owner: 10Btullis)
[09:52:00] <logmsgbot>	 !log jayme@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage
[09:52:40] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/450/con" [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[09:53:02] <logmsgbot>	 !log marostegui@deploy2002 Finished scap: Backport for [[gerrit:974127|Revert "ProductionServices.php: Promote pc2014 to pc3 master"]] (duration: 07m 26s)
[09:53:22] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: provisionning db1238.eqiad.wmnet - T344036
[09:53:27] <stashbot>	 T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036
[09:53:36] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: provisionning db1238.eqiad.wmnet - T344036
[09:53:39] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: provisionning db1238.eqiad.wmnet - T344036
[09:53:54] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1238.eqiad.wmnet with reason: provisionning db1238.eqiad.wmnet - T344036
[09:54:16] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: unregister gitlab-runner1003 [puppet] - 10https://gerrit.wikimedia.org/r/974116 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[09:54:54] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestage2002.codfw.wmnet with reason: host reimage
[09:55:12] <wikibugs>	 (03CR) 10Filippo Giunchedi: Send metrics from Airflow analytics test (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[09:55:50] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:56:02] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:57:00] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[09:57:03] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn)
[09:57:12] <wikibugs>	 (03PS1) 10Jelto: Revert "gitlab_runner: unregister gitlab-runner1003" [puppet] - 10https://gerrit.wikimedia.org/r/974129 (https://phabricator.wikimedia.org/T344951)
[10:01:07] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:02:11] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.206 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:03:01] <jnuche>	 jouncebot: nowandnext
[10:03:02] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 56 minute(s)
[10:03:02] <jouncebot>	 In 0 hour(s) and 56 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1100)
[10:03:43] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:03:49] <jnuche>	 train presync failed last night, rerunning it now
[10:03:56] <logmsgbot>	 !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.5  refs T350081
[10:04:01] <stashbot>	 T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081
[10:04:41] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/973418 (https://phabricator.wikimedia.org/T351184)
[10:04:54] <wikibugs>	 10Puppet, 10Wikidata, 10Wikidata Analytics, 10wmde-wikidata-tech, 10Technical-Debt: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072 (10Lucas_Werkmeister_WMDE)
[10:05:29] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-role for role: etcd::v3::ml_etcd::staging
[10:06:39] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P53384 and previous config saved to /var/cache/conftool/dbconfig/20231114-100638-arnaudb.json
[10:07:51] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s4 T351184
[10:07:55] <stashbot>	 T351184: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T351184
[10:08:17] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s4 T351184
[10:08:43] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Set db1160 with weight 0 T351184', diff saved to https://phabricator.wikimedia.org/P53385 and previous config saved to /var/cache/conftool/dbconfig/20231114-100843-arnaudb.json
[10:10:31] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Revert "gitlab_runner: unregister gitlab-runner1003" [puppet] - 10https://gerrit.wikimedia.org/r/974129 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[10:10:36] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) Some additional information  * puppet7 agents can talk to both centrallog1002 and ce...
[10:10:44] <wikibugs>	 (03PS1) 10Hnowlan: page-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974122 (https://phabricator.wikimedia.org/T350708)
[10:11:23] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] Revert "gitlab_runner: unregister gitlab-runner1003" [puppet] - 10https://gerrit.wikimedia.org/r/974129 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[10:11:28] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: migrate ML staging etcd role to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974121 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[10:15:46] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: etcd::v3::ml_etcd::staging
[10:15:51] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:16:41] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: unregister gitlab-runner1004 [puppet] - 10https://gerrit.wikimedia.org/r/974123 (https://phabricator.wikimedia.org/T344951)
[10:17:07] <wikibugs>	 (03CR) 10ArielGlenn: use virtual db domain for CentralAuth, GlobalBlocking, OATHAuth (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971967 (https://phabricator.wikimedia.org/T348486) (owner: 10ArielGlenn)
[10:17:35] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:18:07] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/974123 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[10:21:43] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) from a very simple test this appears to only affect buster  ` # in the following eve...
[10:21:45] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T348183)', diff saved to https://phabricator.wikimedia.org/P53386 and previous config saved to /var/cache/conftool/dbconfig/20231114-102145-arnaudb.json
[10:21:47] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[10:21:49] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[10:22:01] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[10:22:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53387 and previous config saved to /var/cache/conftool/dbconfig/20231114-102206-arnaudb.json
[10:24:16] <logmsgbot>	 !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.5  refs T350081 (duration: 20m 19s)
[10:24:21] <stashbot>	 T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081
[10:24:40] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10MatthewVernon)
[10:25:18] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53388 and previous config saved to /var/cache/conftool/dbconfig/20231114-102517-arnaudb.json
[10:25:42] <wikibugs>	 (03PS1) 10MVernon: admin: ngkountas to have a shell account in the restricted group [puppet] - 10https://gerrit.wikimedia.org/r/974125 (https://phabricator.wikimedia.org/T350779)
[10:25:48] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman)
[10:25:58] <moritzm>	 !log imported 5.1.19+4.0.11-3+wmf2+bullseye1 to component/php74 for bullseye-wikimedia
[10:26:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:26:24] <logmsgbot>	 !log jnuche@deploy2002 Pruned MediaWiki: 1.42.0-wmf.3 (duration: 02m 06s)
[10:26:45] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-staging-ctrl2002.codfw.wmnet
[10:29:32] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: migrate ml-staging-ctrl2002.codfw.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974146 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[10:33:43] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/973418 (https://phabricator.wikimedia.org/T351184) (owner: 10Gerrit maintenance bot)
[10:33:52] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-staging-ctrl2002.codfw.wmnet
[10:34:30] <arnaudb>	 !log Starting s4 eqiad failover from db1138 to db1160 - T351184
[10:34:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:34] <stashbot>	 T351184: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T351184
[10:36:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Promote db1160 to s4 primary T351184', diff saved to https://phabricator.wikimedia.org/P53389 and previous config saved to /var/cache/conftool/dbconfig/20231114-103601-arnaudb.json
[10:38:24] <moritzm>	 !log imported php-redis 5.3.2+4.3.0-2+deb11u1+wmf2+bullseye1 to component/php74 for bullseye-wikimedia
[10:38:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:38:35] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) Feels like this could be related to https://bugs.debian.org/cgi-bin/bugreport.cgi?bu...
[10:39:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'T351184 - weight mirror', diff saved to https://phabricator.wikimedia.org/P53390 and previous config saved to /var/cache/conftool/dbconfig/20231114-103941-arnaudb.json
[10:39:51] <stashbot>	 T351184: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T351184
[10:40:12] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-role for role: ml_k8s::staging::master
[10:40:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P53391 and previous config saved to /var/cache/conftool/dbconfig/20231114-104024-arnaudb.json
[10:41:40] <wikibugs>	 (03PS1) 10Elukey: profile::thanos: add new istio recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974148 (https://phabricator.wikimedia.org/T302995)
[10:41:42] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra::filesystem: add Lift Wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995)
[10:42:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/974125 (https://phabricator.wikimedia.org/T350779) (owner: 10MVernon)
[10:42:49] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] gitlab_runner: unregister gitlab-runner1004 [puppet] - 10https://gerrit.wikimedia.org/r/974123 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[10:43:08] <wikibugs>	 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff)
[10:46:04] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'migrate db1138 to db1238 - T344036', diff saved to https://phabricator.wikimedia.org/P53392 and previous config saved to /var/cache/conftool/dbconfig/20231114-104603-arnaudb.json
[10:46:08] <stashbot>	 T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036
[10:46:30] <wikibugs>	 (03PS1) 10Kamila Součková: kube-state-metrics: enable Prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/974151 (https://phabricator.wikimedia.org/T264625)
[10:46:33] <wikibugs>	 (03PS2) 10Elukey: profile::pyrra::filesystem: add Lift Wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995)
[10:48:00] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: add config to db1238 [puppet] - 10https://gerrit.wikimedia.org/r/973419 (https://phabricator.wikimedia.org/T344036)
[10:48:10] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: ml_k8s::staging::master
[10:48:29] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: unregister gitlab-runner1004 [puppet] - 10https://gerrit.wikimedia.org/r/974123 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[10:49:13] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: add config to db1238 [puppet] - 10https://gerrit.wikimedia.org/r/973419 (https://phabricator.wikimedia.org/T344036)
[10:49:15] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/452/con" [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[10:49:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: add config to db1238 [puppet] - 10https://gerrit.wikimedia.org/r/973419 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[10:49:30] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman)
[10:49:36] <wikibugs>	 (03PS3) 10Arnaudb: mariadb: add config to db1238 [puppet] - 10https://gerrit.wikimedia.org/r/973419 (https://phabricator.wikimedia.org/T344036)
[10:50:14] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: add config to db1238 [puppet] - 10https://gerrit.wikimedia.org/r/973419 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[10:50:45] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-staging2001.codfw.wmnet
[10:51:33] <wikibugs>	 (03PS1) 10Jelto: Revert "gitlab_runner: unregister gitlab-runner1004" [puppet] - 10https://gerrit.wikimedia.org/r/974134 (https://phabricator.wikimedia.org/T344951)
[10:54:34] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.mysql.clone of db1138.eqiad.wmnet onto db1238.eqiad.wmnet
[10:54:36] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: migrate ml-staging2001 to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974152 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[10:55:16] <moritzm>	 !log imported php-msgpack 2.1.2+0.5.7-2+wmf1+bullseye1 to component/php74 for bullseye-wikimedia
[10:55:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:31] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317', diff saved to https://phabricator.wikimedia.org/P53393 and previous config saved to /var/cache/conftool/dbconfig/20231114-105530-arnaudb.json
[10:56:41] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) >  > edit: or possibly this one https://github.com/rsyslog/rsyslog/issues/4035 ok i...
[10:57:21] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] admin: ngkountas to have a shell account in the restricted group [puppet] - 10https://gerrit.wikimedia.org/r/974125 (https://phabricator.wikimedia.org/T350779) (owner: 10MVernon)
[10:57:43] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host an-presto1001.eqiad.wmnet
[10:58:24] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-staging2001.codfw.wmnet
[10:58:57] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access to production to run maintenance scripts and inspect production MediaWiki tables for Nik Gkountas - https://phabricator.wikimedia.org/T350779 (10MatthewVernon) 05Open→03Resolved a:03MatthewVernon This is now done (modulo time for...
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1100)
[11:01:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch an-presto1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974153 (https://phabricator.wikimedia.org/T349619)
[11:01:53] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] Revert "gitlab_runner: unregister gitlab-runner1004" [puppet] - 10https://gerrit.wikimedia.org/r/974134 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[11:02:16] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Revert "gitlab_runner: unregister gitlab-runner1004" [puppet] - 10https://gerrit.wikimedia.org/r/974134 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[11:04:42] <wikibugs>	 (03PS1) 10MVernon: admin: add urbanecm to stewards-users group [puppet] - 10https://gerrit.wikimedia.org/r/974154 (https://phabricator.wikimedia.org/T350834)
[11:05:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: add urbanecm to stewards-users group [puppet] - 10https://gerrit.wikimedia.org/r/974154 (https://phabricator.wikimedia.org/T350834) (owner: 10MVernon)
[11:06:59] <urbanecm>	 Emperor: fyi, there's https://gerrit.wikimedia.org/r/c/operations/puppet/+/972911 by Daniel already ready :)
[11:07:33] <Emperor>	 doh
[11:07:41] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: unregister gitlab-runners in codfw [puppet] - 10https://gerrit.wikimedia.org/r/974155 (https://phabricator.wikimedia.org/T344951)
[11:08:13] <wikibugs>	 (03Abandoned) 10MVernon: admin: add urbanecm to stewards-users group [puppet] - 10https://gerrit.wikimedia.org/r/974154 (https://phabricator.wikimedia.org/T350834) (owner: 10MVernon)
[11:08:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch an-presto1001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974153 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:09:08] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/974155 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[11:09:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Sounds good" [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319) (owner: 10Volans)
[11:09:39] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-role for role: ml_k8s::staging::worker
[11:09:42] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] admin: add urbanecm to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn)
[11:10:04] <wikibugs>	 (03PS6) 10MVernon: admin: add urbanecm to stewards-users [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn)
[11:10:29] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:10:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53394 and previous config saved to /var/cache/conftool/dbconfig/20231114-111037-arnaudb.json
[11:10:39] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[11:10:49] <Emperor>	 I'll rebase the CR and then merge it
[11:10:50] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[11:10:53] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[11:11:01] <Emperor>	 (assuming CI still content)
[11:11:39] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: migrate ML staging worker role to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974156 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[11:12:57] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[11:13:11] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[11:13:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1174 (T348183)', diff saved to https://phabricator.wikimedia.org/P53395 and previous config saved to /var/cache/conftool/dbconfig/20231114-111316-arnaudb.json
[11:15:40] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: ml_k8s::staging::worker
[11:15:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T348183)', diff saved to https://phabricator.wikimedia.org/P53396 and previous config saved to /var/cache/conftool/dbconfig/20231114-111549-arnaudb.json
[11:15:54] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[11:16:12] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stewards-users and group approver role for urbanecm - https://phabricator.wikimedia.org/T350834 (10MatthewVernon) 05Open→03Resolved a:05DMburugu→03MatthewVernon Done (once puppet has done its magic).
[11:17:01] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10fgiunchedi) I can confirm that e.g. bookworm hosts are sending syslog fine, e.g. titan1002:...
[11:17:53] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman)
[11:18:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host an-presto1001.eqiad.wmnet
[11:19:51] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10fgiunchedi) Ditto bullseye:  ` centrallog2002:~$ tail -5 /srv/syslog/thanos-fe1001/syslog.l...
[11:19:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10MatthewVernon) I think this request needs management approval? Which would be @OSefu-WMF for @Hghani and @kzimmerman for @OSefu-WMF.  Can you both approve the relevant request, please?
[11:21:34] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] page-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974122 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan)
[11:21:38] <wikibugs>	 (03PS1) 10Kamila Součková: kube-state-metrics: DRY network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625)
[11:22:14] <wikibugs>	 (03CR) 10Santiago Faci: [C: 03+1] "It looks good! Thanks!!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974122 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan)
[11:22:17] <wikibugs>	 (03Merged) 10jenkins-bot: page-analytics: bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/974122 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan)
[11:23:55] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Jelto)
[11:24:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-Onboarding-Tool, 10Stewards-and-global-tools, and 2 others: VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10Urbanecm)
[11:25:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host gitlab1003.wikimedia.org
[11:25:50] <wikibugs>	 (03PS1) 10Jcrespo: RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882)
[11:26:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[11:26:55] <wikibugs>	 (03CR) 10Jcrespo: "This is my second attempt, and thanks to Riccardo's help, it looks much cleaner now!" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[11:28:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch gitlab1003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974160 (https://phabricator.wikimedia.org/T349619)
[11:28:51] <wikibugs>	 (03CR) 10Jcrespo: "Unit test works for me locally, Could I be missing a dependency for CI?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[11:29:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.8 point update - https://phabricator.wikimedia.org/T348327 (10MoritzMuehlenhoff)
[11:29:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch gitlab1003 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974160 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:30:40] <wikibugs>	 (03CR) 10MVernon: RemoteExecution: Add comments and fix a few lint errors (031 comment) [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[11:30:56] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P53397 and previous config saved to /var/cache/conftool/dbconfig/20231114-113055-arnaudb.json
[11:31:11] <wikibugs>	 (03PS2) 10Kamila Součková: kube-state-metrics: DRY network policy [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625)
[11:32:56] <wikibugs>	 (03CR) 10Jcrespo: "Let's focus on the real fix first (the next patch), then the things we find along the way, otherwise we will never finish :-D" [software/transferpy] - 10https://gerrit.wikimedia.org/r/972729 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[11:34:00] <wikibugs>	 (03PS2) 10Jcrespo: RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882)
[11:34:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[11:34:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host gitlab1003.wikimedia.org
[11:35:50] <wikibugs>	 (03CR) 10Jcrespo: "that ain't it" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[11:36:46] <wikibugs>	 (03PS3) 10Jcrespo: RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882)
[11:37:27] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, I'll let Keith vote though" [puppet] - 10https://gerrit.wikimedia.org/r/974148 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[11:37:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[11:38:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974151 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková)
[11:38:43] <wikibugs>	 10sre-alert-triage, 10Data-Platform-SRE: Alert triage: overdue alert [critical] The following units failed: wikidatardf-lexemes-dumps.service - https://phabricator.wikimedia.org/T343896 (10LSobanski) The alert has since recovered but looking at the names in the linked change I'm adding Data Platform SRE to rev...
[11:40:09] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::presto::server
[11:40:47] <wikibugs>	 (03CR) 10Jcrespo: "Ah, it is the cumin version, it is hardcoded." [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[11:41:35] <wikibugs>	 (03PS4) 10Jcrespo: RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882)
[11:42:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch analytics_cluster::presto::server to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974162 (https://phabricator.wikimedia.org/T349619)
[11:44:11] <wikibugs>	 (03CR) 10Jcrespo: "I will add a "cumin>=4.2.0" I guess?" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[11:45:13] <moritzm>	 !log imported xdebug 3.0.3+2.9.8+2.8.1+2.5.5-0+deb11u1+wmf1+bullseye1 to component/php74 for bullseye-wikimedia
[11:45:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:56] <wikibugs>	 (03PS5) 10Jcrespo: RemoteExecution: Remove cumin logged errors from low level execution [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882)
[11:46:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P53398 and previous config saved to /var/cache/conftool/dbconfig/20231114-114602-arnaudb.json
[11:46:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::presto::server to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974162 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:51:08] <wikibugs>	 (03PS4) 10Volans: puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[11:51:20] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[11:53:55] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:56:17] <wikibugs>	 (03CR) 10Jcrespo: "This should be ready now for review- I don't expect you to ok'ed the merge as it is, just to sanity check and confirm this is the right ap" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[11:56:18] <jinxer-wm>	 (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[11:59:49] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1013 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:01:08] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T348183)', diff saved to https://phabricator.wikimedia.org/P53399 and previous config saved to /var/cache/conftool/dbconfig/20231114-120108-arnaudb.json
[12:01:10] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[12:01:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::presto::server
[12:01:23] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[12:01:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1191 (T348183)', diff saved to https://phabricator.wikimedia.org/P53400 and previous config saved to /var/cache/conftool/dbconfig/20231114-120129-arnaudb.json
[12:01:33] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[12:02:04] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:04:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T348183)', diff saved to https://phabricator.wikimedia.org/P53401 and previous config saved to /var/cache/conftool/dbconfig/20231114-120401-arnaudb.json
[12:04:09] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[12:05:32] <wikibugs>	 10SRE, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: Enable API access for Mailman3 - https://phabricator.wikimedia.org/T351202 (10Urbanecm)
[12:05:35] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply
[12:06:03] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply
[12:06:36] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply
[12:06:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: gitlab
[12:06:50] <wikibugs>	 10SRE, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: Enable API access for Mailman3 - https://phabricator.wikimedia.org/T351202 (10Urbanecm) FTR, I'm currently working on automating the various MediaWiki accesses (group membership, accounts on private wikis, etc.), but I...
[12:07:03] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply
[12:07:04] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:08:05] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply
[12:08:31] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply
[12:08:33] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10OSefu-WMF) Approved!
[12:09:17] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch gitlab to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974163 (https://phabricator.wikimedia.org/T349619)
[12:11:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch gitlab to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974163 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:11:22] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) >>! In T351181#9329892, @jbond wrote: >>  >> edit: or possibly this one https://gith...
[12:11:25] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, tests on gitlab1003 were ok" [puppet] - 10https://gerrit.wikimedia.org/r/974163 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:13:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] kube-state-metrics: enable Prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/974151 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková)
[12:13:55] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[12:15:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "I would propose to create a calico networkpolicy instead to not have to not introduce another use of kubernetesMasters.cidrs (ideally that" [deployment-charts] - 10https://gerrit.wikimedia.org/r/974158 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková)
[12:16:18] <jinxer-wm>	 (KubernetesCalicoDown) resolved: kubestage2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:17:14] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: gitlab
[12:19:08] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P53402 and previous config saved to /var/cache/conftool/dbconfig/20231114-121908-arnaudb.json
[12:19:29] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "thanks for all the work CR looks good but some minor things around style guide issues and ode placement" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[12:20:44] <logmsgbot>	 !log jayme@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestage2002.codfw.wmnet with OS bullseye
[12:22:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319) (owner: 10Volans)
[12:22:30] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[12:22:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[12:23:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] webperf::site: update to use multi root CA [puppet] - 10https://gerrit.wikimedia.org/r/973843 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[12:29:39] <wikibugs>	 (03Merged) 10jenkins-bot: puppet.puppet.get_puppet_ca_hostname: return hardcoded start [software/spicerack] - 10https://gerrit.wikimedia.org/r/971957 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[12:32:22] <wikibugs>	 (03PS1) 10Btullis: Increase the size of the innodb pool on analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/974164 (https://phabricator.wikimedia.org/T284150)
[12:33:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::misc::analytics::backup
[12:33:44] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] kube-state-metrics: enable Prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/974151 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková)
[12:34:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P53403 and previous config saved to /var/cache/conftool/dbconfig/20231114-123414-arnaudb.json
[12:35:14] <wikibugs>	 (03PS1) 10Btullis: Enable notifications for new analytics_meta hosts [puppet] - 10https://gerrit.wikimedia.org/r/974165 (https://phabricator.wikimedia.org/T284150)
[12:35:25] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974164 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis)
[12:35:46] <wikibugs>	 (03PS1) 10Hashar: Add a banner for the 2023 developer survey [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109)
[12:36:00] <wikibugs>	 (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974165 (https://phabricator.wikimedia.org/T284150) (owner: 10Btullis)
[12:36:16] <wikibugs>	 (03Merged) 10jenkins-bot: kube-state-metrics: enable Prometheus scraping [deployment-charts] - 10https://gerrit.wikimedia.org/r/974151 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková)
[12:36:28] <wikibugs>	 (03CR) 10Hashar: [C: 04-1] "-1 since the link to the google form is a placeholder." [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/974166 (https://phabricator.wikimedia.org/T351109) (owner: 10Hashar)
[12:36:55] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Plugin to process Puppet Catalog Compiler results [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 (owner: 10Hashar)
[12:37:15] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[12:37:28] <wikibugs>	 (03Merged) 10jenkins-bot: Plugin to process Puppet Catalog Compiler results [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/969981 (owner: 10Hashar)
[12:37:51] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[12:38:55] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::thanos: add new istio recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974148 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[12:39:39] <wikibugs>	 (03PS1) 10Btullis: Promote an-mariadb1001 to be the new primary for analytics_meta [puppet] - 10https://gerrit.wikimedia.org/r/974167 (https://phabricator.wikimedia.org/T284150)
[12:41:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::misc::analytics::backup
[12:42:31] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[12:42:59] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[12:45:22] <wikibugs>	 (03CR) 10Jbond: peopleweb: migrate role to puppet 7 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/973855 (owner: 10Dzahn)
[12:46:00] <wikibugs>	 (03PS1) 10Sergio Gimeno: GrowthExperiments: enable AddLink backend for 16,17th rounds of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974169 (https://phabricator.wikimedia.org/T308142)
[12:46:13] <logmsgbot>	 !log hashar@deploy2002 Started deploy [gerrit/gerrit@a087269]: Plugin to process Puppet Catalog Compiler results - https://gerrit.wikimedia.org/r/969981
[12:46:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mariadb::misc::analytics::backup to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974170 (https://phabricator.wikimedia.org/T349619)
[12:46:17] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [gerrit/gerrit@a087269]: Plugin to process Puppet Catalog Compiler results - https://gerrit.wikimedia.org/r/969981 (duration: 00m 04s)
[12:47:40] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[12:48:21] <logmsgbot>	 !log hashar@deploy2002 Started deploy [gerrit/gerrit@a087269]: Plugin to process Puppet Catalog Compiler results - https://gerrit.wikimedia.org/r/969981
[12:48:28] <logmsgbot>	 !log hashar@deploy2002 Finished deploy [gerrit/gerrit@a087269]: Plugin to process Puppet Catalog Compiler results - https://gerrit.wikimedia.org/r/969981 (duration: 00m 07s)
[12:48:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mariadb::misc::analytics::backup to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974170 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:49:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T348183)', diff saved to https://phabricator.wikimedia.org/P53404 and previous config saved to /var/cache/conftool/dbconfig/20231114-124921-arnaudb.json
[12:49:23] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[12:49:26] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[12:49:36] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[12:49:43] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1194 (T348183)', diff saved to https://phabricator.wikimedia.org/P53405 and previous config saved to /var/cache/conftool/dbconfig/20231114-124942-arnaudb.json
[12:51:05] <wikibugs>	 (03PS1) 10Kamila Součková: kube-state-metrics: enable in codfw + staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/974171 (https://phabricator.wikimedia.org/T264625)
[12:51:25] <wikibugs>	 (03PS1) 10Btullis: WIP - Temporarily disable the production jobs that write to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/974172 (https://phabricator.wikimedia.org/T284150)
[12:51:27] <wikibugs>	 (03PS1) 10Btullis: WIP Re-enable the production pipelines that write to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/974173 (https://phabricator.wikimedia.org/T284150)
[12:52:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T348183)', diff saved to https://phabricator.wikimedia.org/P53406 and previous config saved to /var/cache/conftool/dbconfig/20231114-125214-arnaudb.json
[12:52:40] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::misc::analytics::backup
[12:55:05] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[12:55:09] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Marostegui) I think we are gong to need to tweak this a bit more: ` -rw-rw---- 1 mysql mysql   61G Nov 14 12:44 syslog.ibd `  61GB is quite large for what this is, t...
[12:55:50] <wikibugs>	 (03PS1) 10Majavah: P:openstack: galera: fix firewall port [puppet] - 10https://gerrit.wikimedia.org/r/974175 (https://phabricator.wikimedia.org/T351061)
[12:56:19] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:56:39] <wikibugs>	 (03PS2) 10Hnowlan: rest-gateway: increase resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/972746
[12:56:47] <wikibugs>	 10SRE, 10serviceops: Rebuild PHP 7.4 packages for Bullseye - https://phabricator.wikimedia.org/T350767 (10MoritzMuehlenhoff)
[12:57:25] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/454/con" [puppet] - 10https://gerrit.wikimedia.org/r/974175 (https://phabricator.wikimedia.org/T351061) (owner: 10Majavah)
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1300)
[13:00:40] <wikibugs>	 (03CR) 10Volans: [C: 03+2] sre.hosts.decommission: remove also from Puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319) (owner: 10Volans)
[13:00:42] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] rest-gateway: increase resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/972746 (owner: 10Hnowlan)
[13:02:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: analytics_cluster::mariadb
[13:03:54] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: galera: fix firewall port [puppet] - 10https://gerrit.wikimedia.org/r/974175 (https://phabricator.wikimedia.org/T351061) (owner: 10Majavah)
[13:04:56] <wikibugs>	 (03PS1) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094)
[13:04:59] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.decommission: remove also from Puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/974108 (https://phabricator.wikimedia.org/T348319) (owner: 10Volans)
[13:05:16] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm
[13:05:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[13:06:00] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1004.eqiad.wmnet
[13:06:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch analytics_cluster::mariadb to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974178 (https://phabricator.wikimedia.org/T349619)
[13:07:21] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P53407 and previous config saved to /var/cache/conftool/dbconfig/20231114-130721-arnaudb.json
[13:09:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch analytics_cluster::mariadb to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974178 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:10:02] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol2005-dev.codfw.wmnet
[13:10:09] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 03+1] IP Masking: Set expiryAfterDays to 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[13:11:00] <wikibugs>	 (03CR) 10Majavah: "duplicate of https://gerrit.wikimedia.org/r/c/operations/puppet/+/973782?" [puppet] - 10https://gerrit.wikimedia.org/r/973171 (owner: 10Jbond)
[13:11:05] <wikibugs>	 (03PS2) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094)
[13:12:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[13:14:04] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 12 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[13:14:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: analytics_cluster::mariadb
[13:17:07] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab_runner: unregister gitlab-runners in codfw [puppet] - 10https://gerrit.wikimedia.org/r/974155 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[13:17:23] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[13:17:58] <wikibugs>	 (03PS4) 10Elukey: changeprop: set num_workers to zero [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950)
[13:19:05] <wikibugs>	 (03PS3) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094)
[13:19:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: releases
[13:19:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] changeprop: set num_workers to zero [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[13:19:57] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1004.eqiad.wmnet
[13:20:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[13:20:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1004.eqiad.wmnet
[13:20:21] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-cache2003.codfw.wmnet
[13:21:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch releases to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974179 (https://phabricator.wikimedia.org/T349619)
[13:21:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10Volans)
[13:22:06] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, and 2 others: Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10Volans) 05In progress→03Resolved This is now done.
[13:22:20] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/transferpy] - 10https://gerrit.wikimedia.org/r/974159 (https://phabricator.wikimedia.org/T330882) (owner: 10Jcrespo)
[13:22:24] <wikibugs>	 (03PS2) 10Btullis: Switch datahub to use the new an-mariadb servers instead of an-coord [deployment-charts] - 10https://gerrit.wikimedia.org/r/972823 (https://phabricator.wikimedia.org/T284150)
[13:22:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P53408 and previous config saved to /var/cache/conftool/dbconfig/20231114-132227-arnaudb.json
[13:22:44] <wikibugs>	 (03PS1) 10Jelto: Revert "gitlab_runner: unregister gitlab-runners in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/974135 (https://phabricator.wikimedia.org/T344951)
[13:22:52] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: Migrate ml-cache2003.codfw.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974180 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[13:24:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch releases to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974179 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:24:28] <wikibugs>	 (03PS4) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094)
[13:25:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[13:26:09] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Revert "gitlab_runner: unregister gitlab-runners in codfw" [puppet] - 10https://gerrit.wikimedia.org/r/974135 (https://phabricator.wikimedia.org/T344951) (owner: 10Jelto)
[13:26:33] <wikibugs>	 (03PS5) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094)
[13:26:43] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-cache2003.codfw.wmnet
[13:26:48] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 12 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[13:26:53] <icinga-wm>	 PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:27:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[13:28:59] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/459/console" [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[13:29:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: releases
[13:30:19] <icinga-wm>	 RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[13:30:48] <logmsgbot>	 !log taavi@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host cloudcontrol2005-dev.codfw.wmnet
[13:32:55] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[13:33:46] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: migrate ml-cache2003.codfw.wmnet to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974182 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[13:34:18] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-host for host ml-cache1003.eqiad.wmnet
[13:37:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T348183)', diff saved to https://phabricator.wikimedia.org/P53409 and previous config saved to /var/cache/conftool/dbconfig/20231114-133734-arnaudb.json
[13:37:36] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[13:37:39] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[13:37:50] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[13:37:56] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1202 (T348183)', diff saved to https://phabricator.wikimedia.org/P53410 and previous config saved to /var/cache/conftool/dbconfig/20231114-133755-arnaudb.json
[13:38:21] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ml-cache1003.eqiad.wmnet
[13:39:33] <wikibugs>	 (03PS6) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094)
[13:40:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T348183)', diff saved to https://phabricator.wikimedia.org/P53411 and previous config saved to /var/cache/conftool/dbconfig/20231114-134028-arnaudb.json
[13:41:28] <logmsgbot>	 !log klausman@cumin1001 START - Cookbook sre.puppet.migrate-role for role: ml_cache::storage
[13:42:11] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop: sync
[13:42:26] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[13:42:59] <wikibugs>	 (03Abandoned) 10Jbond: bird::anycast: move firewall rules to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/973171 (owner: 10Jbond)
[13:43:07] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync
[13:43:22] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync
[13:43:26] <wikibugs>	 (03CR) 10Jbond: bird::anycast: move firewall rules to firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973171 (owner: 10Jbond)
[13:43:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch phab-test1001 to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/974184 (https://phabricator.wikimedia.org/T349619)
[13:43:56] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] P:bird::anycast: migrate to nftables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[13:44:05] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] hiera: migrate ML cache/cassandara role to Puppet v7 [puppet] - 10https://gerrit.wikimedia.org/r/974183 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[13:44:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10serviceops-radar, 10Patch-For-Review, 10Puppet (Puppet 7.0): expose_puppet_certs:  Services will need to trust the new ca - https://phabricator.wikimedia.org/T340741 (10jbond)
[13:45:11] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1138.eqiad.wmnet onto db1238.eqiad.wmnet
[13:47:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[13:47:49] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/460/con" [puppet] - 10https://gerrit.wikimedia.org/r/973840 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[13:48:25] <logmsgbot>	 !log klausman@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: ml_cache::storage
[13:50:43] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973840 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[13:51:13] <wikibugs>	 (03CR) 10Brouberol: "I have reverted the recent changes on the subnet files, that I will get to in another CR. This one was getting out of hand." [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[13:51:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/973741 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi)
[13:53:52] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/973841 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[13:55:20] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/973842 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[13:55:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P53412 and previous config saved to /var/cache/conftool/dbconfig/20231114-135534-arnaudb.json
[13:55:43] <wikibugs>	 (03CR) 10Klausman: [V: 03+1 C: 03+2] hiera: Cleanup of temporary overrides for Puppet v7 migration [puppet] - 10https://gerrit.wikimedia.org/r/974185 (https://phabricator.wikimedia.org/T349619) (owner: 10Klausman)
[13:57:14] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10klausman)
[13:57:27] <icinga-wm>	 PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Connect - Init7, AS13030/IPv4: Active - Init7 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:57:34] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/463/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[13:59:26] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::pyrra::filesystem: add Lift Wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1400)
[14:00:05] <jouncebot>	 apergos: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:07] <wikibugs>	 (03PS65) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059)
[14:00:24] <Lucas_WMDE>	 I can’t deploy, sorry
[14:00:26] <apergos>	 no it's not, I think I removed it from the calendar
[14:00:38] <Lucas_WMDE>	 I still see it there
[14:00:50] <apergos>	 ok seriously? every single thing I touch these days I do wrong
[14:00:56] <apergos>	 trying again to remove it
[14:02:01] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/464/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[14:03:25] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 15%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53413 and previous config saved to /var/cache/conftool/dbconfig/20231114-140325-arnaudb.json
[14:04:03] <apergos>	 all right Lucas_WMDE it is now gone, I am sure of it
[14:04:12] <Lucas_WMDE>	 yay ^^
[14:04:14] <Lucas_WMDE>	 nothing to deploy then
[14:04:37] <urbanecm>	 apergos: and i was thinking "why are we cancelling the window" :))
[14:04:46] <urbanecm>	 someone mind me stealing it?
[14:04:51] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.reimage for host an-druid1004.eqiad.wmnet with OS bullseye
[14:05:02] <apergos>	 I sure don't mind :-D
[14:05:32] <wikibugs>	 (03PS4) 10Urbanecm: IP Masking: Set expiryAfterDays to 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695)
[14:05:40] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] IP Masking: Set expiryAfterDays to 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[14:06:34] <wikibugs>	 (03Merged) 10jenkins-bot: IP Masking: Set expiryAfterDays to 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973172 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[14:06:39] <wikibugs>	 (03PS1) 10Urbanecm: IP Masking: Expire temporary accounts in 1 year [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974143 (https://phabricator.wikimedia.org/T344695)
[14:06:46] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] IP Masking: Expire temporary accounts in 1 year [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974143 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[14:07:28] <wikibugs>	 10SRE, 10SRE-Access-Requests: Add Hamid & Omari to analytics-product-users - https://phabricator.wikimedia.org/T351130 (10MatthewVernon) Thanks. I just need @kzimmerman to approve your access and then I can proceed.
[14:08:48] <wikibugs>	 (03PS1) 10Urbanecm: TempUser: Fix unchecked array access for optional key [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974191
[14:08:58] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] TempUser: Fix unchecked array access for optional key [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974191 (owner: 10Urbanecm)
[14:09:20] <wikibugs>	 (03PS1) 10Urbanecm: IP Masking: Add expireTemporaryAccounts.php [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974144 (https://phabricator.wikimedia.org/T344695)
[14:10:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P53414 and previous config saved to /var/cache/conftool/dbconfig/20231114-141041-arnaudb.json
[14:10:56] <icinga-wm>	 PROBLEM - Check systemd state on puppetserver1001 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] IP Masking: Add expireTemporaryAccounts.php [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974144 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[14:15:43] <wikibugs>	 (03PS66) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059)
[14:16:50] <wikibugs>	 (03CR) 10Jbond: "LGTM just some changes on the rspec" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[14:17:15] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/465/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[14:18:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 30%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53415 and previous config saved to /var/cache/conftool/dbconfig/20231114-141830-arnaudb.json
[14:18:42] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/973739 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi)
[14:18:44] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Release 1.15.14 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/973732 (https://phabricator.wikimedia.org/T348837) (owner: 10Vgutierrez)
[14:19:09] <wikibugs>	 (03PS67) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059)
[14:19:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10Jdforrester-WMF)
[14:20:22] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10Jdforrester-WMF)
[14:20:33] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10Jdforrester-WMF) a:05Jdforrester-WMF→03None
[14:20:51] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host sretest1004.eqiad.wmnet
[14:20:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet
[14:20:56] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.dhcp for host sretest1004.eqiad.wmnet
[14:21:13] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/466/con" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[14:22:11] <logmsgbot>	 !log brouberol@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-druid1004.eqiad.wmnet with reason: host reimage
[14:23:11] <wikibugs>	 (03Merged) 10jenkins-bot: IP Masking: Expire temporary accounts in 1 year [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974143 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[14:23:58] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "Final diff for netboot.cfg: https://phabricator.wikimedia.org/P53293" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[14:24:40] <wikibugs>	 (03Merged) 10jenkins-bot: TempUser: Fix unchecked array access for optional key [core] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974191 (owner: 10Urbanecm)
[14:24:41] <wikibugs>	 (03Merged) 10jenkins-bot: IP Masking: Add expireTemporaryAccounts.php [extensions/CentralAuth] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974144 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[14:24:52] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-druid1004.eqiad.wmnet with reason: host reimage
[14:25:14] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:974143|IP Masking: Expire temporary accounts in 1 year (T344695)]], [[gerrit:974191|TempUser: Fix unchecked array access for optional key]], [[gerrit:974144|IP Masking: Add expireTemporaryAccounts.php (T344695)]]
[14:25:47] <stashbot>	 urbanecm@deploy2002: Failed to log message to wiki. Somebody should check the error logs.
[14:25:47] <stashbot>	 T344695: [IP Masking] Expire temporary accounts in 1 year - https://phabricator.wikimedia.org/T344695
[14:25:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T348183)', diff saved to https://phabricator.wikimedia.org/P53416 and previous config saved to /var/cache/conftool/dbconfig/20231114-142547-arnaudb.json
[14:25:49] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[14:25:54] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[14:26:02] <urbanecm>	 okay... 
[14:26:02] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[14:26:10] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1227 (T348183)', diff saved to https://phabricator.wikimedia.org/P53417 and previous config saved to /var/cache/conftool/dbconfig/20231114-142608-arnaudb.json
[14:26:29] <urbanecm>	 seems transient (and thank you, TheresNoTime, for https://bash.toolforge.org/quip/CGD9XYIBa_6PSCT9HbBu :D)
[14:26:39] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:974143|IP Masking: Expire temporary accounts in 1 year (T344695)]], [[gerrit:974191|TempUser: Fix unchecked array access for optional key]], [[gerrit:974144|IP Masking: Add expireTemporaryAccounts.php (T344695)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:26:41] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1104.eqiad.wmnet
[14:26:41] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1104.eqiad.wmnet
[14:26:44] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[14:26:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[14:27:33] <logmsgbot>	 !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1046.eqiad.wmnet with OS bookworm
[14:28:03] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] openstack: update to use multiroot CA [puppet] - 10https://gerrit.wikimedia.org/r/973840 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[14:28:32] <fabfur>	 !log swapped cp1104 <-> cp1079 (T349244)
[14:28:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:36] <stashbot>	 T349244: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244
[14:28:55] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:28:56] <wikibugs>	 (03CR) 10JMeybohm: "@Bking: This is what caused the diff you saw yesterday. Would you be so kind to rebase, merge and deploy?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/972725 (owner: 10Bking)
[14:29:00] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[14:29:03] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "should be ok to do any time, i ended up backporting the relevant code" [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[14:29:26] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:29:34] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] kube-state-metrics: enable in codfw + staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/974171 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková)
[14:30:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - bking@cumin2002"
[14:30:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:30:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts search-loader2001.codfw.wmnet,search-loader1001.eqiad.wmnet
[14:30:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T348183)', diff saved to https://phabricator.wikimedia.org/P53418 and previous config saved to /var/cache/conftool/dbconfig/20231114-143021-arnaudb.json
[14:30:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] toolforge: update to use trsuted ca path [puppet] - 10https://gerrit.wikimedia.org/r/973841 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[14:30:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::kubeadm: migrate to trusted ca path [puppet] - 10https://gerrit.wikimedia.org/r/973842 (https://phabricator.wikimedia.org/T340741) (owner: 10Jbond)
[14:31:18] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.remove-downtime for cp1105.eqiad.wmnet
[14:31:18] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for cp1105.eqiad.wmnet
[14:31:41] <wikibugs>	 (03CR) 10Volans: "Makes sense to me but I'll leave it to the experts ;)" [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[14:32:17] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:974143|IP Masking: Expire temporary accounts in 1 year (T344695)]], [[gerrit:974191|TempUser: Fix unchecked array access for optional key]], [[gerrit:974144|IP Masking: Add expireTemporaryAccounts.php (T344695)]] (duration: 07m 03s)
[14:32:21] <stashbot>	 T344695: [IP Masking] Expire temporary accounts in 1 year - https://phabricator.wikimedia.org/T344695
[14:32:29] <fabfur>	 !log swapped cp1105 <-> cp1080 (T349244)
[14:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:14] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: increase resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/972746 (owner: 10Hnowlan)
[14:33:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[14:33:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 45%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53420 and previous config saved to /var/cache/conftool/dbconfig/20231114-143335-arnaudb.json
[14:34:03] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: increase resource limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/972746 (owner: 10Hnowlan)
[14:34:54] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] kube-state-metrics: enable in codfw + staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/974171 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková)
[14:37:38] <wikibugs>	 (03Merged) 10jenkins-bot: kube-state-metrics: enable in codfw + staging-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/974171 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková)
[14:38:29] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[14:38:38] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[14:38:55] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "Ack, thanks John" [puppet] - 10https://gerrit.wikimedia.org/r/973739 (https://phabricator.wikimedia.org/T331512) (owner: 10Filippo Giunchedi)
[14:41:45] <wikibugs>	 (03PS68) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059)
[14:42:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[14:42:38] <logmsgbot>	 !log brouberol@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-druid1004.eqiad.wmnet with OS bullseye
[14:42:46] <wikibugs>	 (03PS1) 10Daimona Eaytoy: prod: Enable $wgCampaignEventsEnableParticipantQuestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974200 (https://phabricator.wikimedia.org/T347607)
[14:43:24] <wikibugs>	 (03PS69) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059)
[14:44:35] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[14:44:48] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[14:44:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: NTP: alert on ntp/time errors (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/973306 (owner: 10Slyngshede)
[14:45:18] <wikibugs>	 (03PS1) 10Eevans: install_server: configure aqs1011 for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/974201 (https://phabricator.wikimedia.org/T347738)
[14:45:27] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[14:45:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P53421 and previous config saved to /var/cache/conftool/dbconfig/20231114-144528-arnaudb.json
[14:45:39] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[14:46:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Logging, 10Patch-For-Review, 10Puppet (Puppet 7.0): Switch rsyslog to use the new PKI infrastructure - https://phabricator.wikimedia.org/T347565 (10jbond)
[14:46:47] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[14:46:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Next steps for Puppet 7 - https://phabricator.wikimedia.org/T330490 (10jbond)
[14:46:58] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[14:47:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): volatile: We need to configure the volatile endpoint on puppetserveres - https://phabricator.wikimedia.org/T341056 (10jbond) 05In progress→03Resolved a:03jbond volatile is now synced to all pupp...
[14:48:40] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 60%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53423 and previous config saved to /var/cache/conftool/dbconfig/20231114-144840-arnaudb.json
[14:48:55] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:49:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[14:49:26] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:49:29] <wikibugs>	 (03CR) 10Herron: "this is great!  please see a few comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[14:49:32] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "LGTM. We should be careful rolling this out even if it should be atomic as the nameservers are on bird as well. If you want someone to rol" [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[14:50:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus-puppet-agent-stats: this timer sometime fails (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971946 (owner: 10Jbond)
[14:50:11] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::backups
[14:50:11] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[14:51:32] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] "The privacy team has given us the go-ahead for this change: https://phabricator.wikimedia.org/T349910#9325309" [puppet] - 10https://gerrit.wikimedia.org/r/969341 (https://phabricator.wikimedia.org/T349910) (owner: 10Btullis)
[14:51:45] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:52:13] <wikibugs>	 (03PS1) 10Jbond: wmcs::openstack::codfw1dev::backups: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974202 (https://phabricator.wikimedia.org/T349619)
[14:52:24] <wikibugs>	 (03CR) 10Elukey: [V: 03+1] profile::pyrra::filesystem: add Lift Wing pilot (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[14:52:28] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:52:41] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[14:53:02] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::openstack::codfw1dev::backups: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/974202 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond)
[14:53:09] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[14:53:19] <wikibugs>	 (03CR) 10Herron: [C: 03+1] profile::thanos: add new istio recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974148 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[14:53:55] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:55:17] <wikibugs>	 (03CR) 10Herron: profile::pyrra::filesystem: add Lift Wing pilot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[14:55:24] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] Switch phab-test1001 to insetup::buster [puppet] - 10https://gerrit.wikimedia.org/r/974184 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:56:06] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] "It's possible we can decommission this host, but let's merge this for now and we'll work on clarifying what to do with it." [puppet] - 10https://gerrit.wikimedia.org/r/974184 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:56:16] <wikibugs>	 (03CR) 10Herron: profile::pyrra::filesystem: add Lift Wing pilot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[14:57:18] <wikibugs>	 (03PS1) 10Elukey: services: remove num_workers from cp-jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/974204
[14:57:59] <wikibugs>	 (03PS70) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059)
[14:58:03] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: wmcs::openstack::codfw1dev::backups
[15:00:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P53425 and previous config saved to /var/cache/conftool/dbconfig/20231114-150034-arnaudb.json
[15:02:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::thanos: add new istio recording rule [puppet] - 10https://gerrit.wikimedia.org/r/974148 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[15:03:40] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+2] P:bird::anycast: migrate to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973782 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[15:03:45] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 75%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53426 and previous config saved to /var/cache/conftool/dbconfig/20231114-150345-arnaudb.json
[15:05:34] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:10:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: mariadb::analytics_replica
[15:10:51] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: remove num_workers from cp-jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/974204 (owner: 10Elukey)
[15:13:30] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mobileapps: move traffic to mw on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973182 (https://phabricator.wikimedia.org/T350846)
[15:13:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch mariadb::analytics_replica to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974206 (https://phabricator.wikimedia.org/T349619)
[15:13:49] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mw-api-int: double the number of replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/973183 (https://phabricator.wikimedia.org/T350846)
[15:15:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T348183)', diff saved to https://phabricator.wikimedia.org/P53427 and previous config saved to /var/cache/conftool/dbconfig/20231114-151541-arnaudb.json
[15:15:43] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[15:15:46] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[15:15:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch mariadb::analytics_replica to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974206 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:15:57] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[15:16:03] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch mariadb::analytics_replica to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974206 (https://phabricator.wikimedia.org/T349619)
[15:16:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1236 (T348183)', diff saved to https://phabricator.wikimedia.org/P53428 and previous config saved to /var/cache/conftool/dbconfig/20231114-151602-arnaudb.json
[15:16:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudrabbit1003']
[15:16:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudrabbit1003']
[15:16:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudrabbit1003']
[15:17:03] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudrabbit1003']
[15:17:10] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudrabbit1003']
[15:17:45] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudrabbit1003']
[15:18:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1044']
[15:18:50] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 90%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53430 and previous config saved to /var/cache/conftool/dbconfig/20231114-151850-arnaudb.json
[15:19:57] <wikibugs>	 (03PS1) 10Hnowlan: trafficserver: restore traffic to page-analytics [puppet] - 10https://gerrit.wikimedia.org/r/974207 (https://phabricator.wikimedia.org/T350708)
[15:20:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch phab-test1001 to insetup::buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974184 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:20:24] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043']
[15:21:24] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.roll-restart-workers (exit_code=0) for Presto analytics cluster: Roll restart of all Presto's jvm daemons.
[15:22:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1046']
[15:22:56] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
[15:23:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/974201 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans)
[15:23:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: mariadb::analytics_replica
[15:23:32] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
[15:25:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1044']
[15:26:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: move traffic to mw on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973182 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[15:26:19] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1044']
[15:27:01] <wikibugs>	 (03Merged) 10jenkins-bot: mobileapps: move traffic to mw on k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/973182 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[15:28:02] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043']
[15:28:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043']
[15:29:12] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:29:32] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:29:44] <icinga-wm>	 RECOVERY - Check systemd state on puppetserver1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:30:17] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm)
[15:30:23] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[15:30:50] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-4].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm)
[15:32:07] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:32:18] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:33:44] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P53431 and previous config saved to /var/cache/conftool/dbconfig/20231114-153344-arnaudb.json
[15:33:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db1127 (re)pooling @ 100%: Host failed to be depooled properly', diff saved to https://phabricator.wikimedia.org/P53432 and previous config saved to /var/cache/conftool/dbconfig/20231114-153355-arnaudb.json
[15:33:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host vrts1002.eqiad.wmnet
[15:34:44] <wikibugs>	 (03PS3) 10Elukey: profile::pyrra::filesystem: add Lift Wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995)
[15:34:54] <icinga-wm>	 PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:34:55] <wikibugs>	 (03CR) 10Elukey: profile::pyrra::filesystem: add Lift Wing pilot (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[15:35:26] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1044']
[15:35:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1044']
[15:36:48] <wikibugs>	 (03CR) 10Bking: [C: 03+2] search-loader: remove references to search-loader[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/973880 (https://phabricator.wikimedia.org/T351123) (owner: 10Bking)
[15:37:19] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043']
[15:37:28] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043']
[15:38:19] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1044']
[15:38:23] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043']
[15:38:35] <wikibugs>	 (03PS3) 10Hnowlan: api-gateway, rest-gateway: drop envoy-future, use latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973776 (https://phabricator.wikimedia.org/T324130)
[15:39:08] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1044']
[15:39:14] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1046']
[15:39:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043']
[15:39:23] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1044']
[15:39:24] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043']
[15:39:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1046']
[15:39:59] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1044']
[15:40:19] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043']
[15:40:27] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043']
[15:40:29] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1046']
[15:40:31] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1044']
[15:40:37] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "LGTM thanks for piloting this! 🚀" [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[15:41:19] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] api-gateway, rest-gateway: drop envoy-future, use latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973776 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan)
[15:41:26] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Jhancock.wm)
[15:42:11] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway, rest-gateway: drop envoy-future, use latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/973776 (https://phabricator.wikimedia.org/T324130) (owner: 10Hnowlan)
[15:42:30] <icinga-wm>	 PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE, AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:42:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch vrts1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974210 (https://phabricator.wikimedia.org/T349619)
[15:43:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM requested for search-loader - https://phabricator.wikimedia.org/T346273 (10bking) 05Open→03Resolved a:03bking This is done...closing out ticket.
[15:44:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch vrts1002 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974210 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:44:39] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm)
[15:46:49] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[15:47:05] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[15:48:28] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bookworm
[15:48:29] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm
[15:48:51] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P53433 and previous config saved to /var/cache/conftool/dbconfig/20231114-154850-arnaudb.json
[15:49:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host vrts1002.eqiad.wmnet
[15:49:39] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] install_server: configure aqs1011 for partition reuse [puppet] - 10https://gerrit.wikimedia.org/r/974201 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans)
[15:50:14] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[15:50:33] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[15:51:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Apply Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/974211 (https://phabricator.wikimedia.org/T346039)
[15:51:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::pyrra::filesystem: add Lift Wing pilot [puppet] - 10https://gerrit.wikimedia.org/r/974149 (https://phabricator.wikimedia.org/T302995) (owner: 10Elukey)
[15:53:07] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[15:53:23] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[15:53:55] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:59:53] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1044.eqiad.wmnet with OS bookworm
[15:59:53] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1043.eqiad.wmnet with OS bookworm
[16:00:05] <jouncebot>	 eoghan, jelto, and arnoldokoth: gettimeofday() says it's time for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1600)
[16:00:31] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1044.eqiad.wmnet with OS bookworm
[16:00:36] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm
[16:01:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: insetup::serviceops_collab
[16:02:38] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch insetup::serviceops_collab to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974213 (https://phabricator.wikimedia.org/T349619)
[16:03:57] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T348183)', diff saved to https://phabricator.wikimedia.org/P53434 and previous config saved to /var/cache/conftool/dbconfig/20231114-160356-arnaudb.json
[16:03:59] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[16:04:02] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[16:04:12] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[16:04:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch insetup::serviceops_collab to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974213 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[16:06:19] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[16:06:32] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[16:07:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:08:55] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@0b76984]: test deploy to phab2002 for T350876
[16:08:58] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[16:08:59] <stashbot>	 T350876: Deploy Phabricator/Phorge 2023-11-14 - https://phabricator.wikimedia.org/T350876
[16:09:11] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[16:09:25] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-4].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm)
[16:09:27] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@0b76984]: test deploy to phab2002 for T350876 (duration: 00m 32s)
[16:09:53] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra::filesystem: remove grouping for lift wing [puppet] - 10https://gerrit.wikimedia.org/r/974214
[16:09:56] <logmsgbot>	 !log brennen@deploy2002 Started deploy [phabricator/deployment@0b76984]: deploy to phab1004 for T350876
[16:09:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-4].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm)
[16:10:07] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[16:11:00] <logmsgbot>	 !log brennen@deploy2002 Finished deploy [phabricator/deployment@0b76984]: deploy to phab1004 for T350876 (duration: 01m 04s)
[16:11:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: insetup::serviceops_collab
[16:11:37] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[16:11:51] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[16:11:57] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2108 (T348183)', diff saved to https://phabricator.wikimedia.org/P53435 and previous config saved to /var/cache/conftool/dbconfig/20231114-161157-arnaudb.json
[16:12:30] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[16:12:56] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install restbase20[28-35] - https://phabricator.wikimedia.org/T349758 (10Jhancock.wm)
[16:13:02] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Jhancock.wm)
[16:13:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Issues which should be fixed by puppet7 upgrade - https://phabricator.wikimedia.org/T351104 (10jbond) p:05Triage→03Medium
[16:13:18] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): syslog tls clients failing to connect to centrallog2002 post puppet7 migration - https://phabricator.wikimedia.org/T351181 (10jbond) p:05Triage→03High
[16:14:07] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm
[16:14:23] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage
[16:15:41] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Platform-SRE: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm)
[16:16:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T348183)', diff saved to https://phabricator.wikimedia.org/P53436 and previous config saved to /var/cache/conftool/dbconfig/20231114-161617-arnaudb.json
[16:16:34] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10SRE Observability, 10observability: Q1:rack/setup/install logging-hd200[1-3] - https://phabricator.wikimedia.org/T349834 (10Jhancock.wm)
[16:17:13] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Infrastructure-Foundations: Q2:rack/setup/install ganeti203[34] - https://phabricator.wikimedia.org/T349926 (10Jhancock.wm)
[16:17:23] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1044.eqiad.wmnet with reason: host reimage
[16:19:09] <wikibugs>	 (03PS1) 10MVernon: swift: migrate one node to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616)
[16:21:54] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[16:25:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me (Not sure if you're aware but I added a script to the puppetdb hosts to check whether a server is compatible with nftable" [puppet] - 10https://gerrit.wikimedia.org/r/973785 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[16:25:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Maps, 10Puppet-Infrastructure, and 2 others: Postgres puppet modules use MD5 for users by default - https://phabricator.wikimedia.org/T300048 (10jbond) 05Open→03Resolved a:03jbond going to close this as i think its resolved but please reopen if not
[16:26:27] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.reboot-single for host vrts1002.eqiad.wmnet
[16:26:38] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] hieradata: migrate codfw cloudlb to nftables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/973785 (https://phabricator.wikimedia.org/T351087) (owner: 10Majavah)
[16:28:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[16:29:49] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[16:30:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349873 (10Jhancock.wm) We received the servers and need racking details please. @Clement_Goubert  or @Joe Thank you!
[16:30:21] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1002.eqiad.wmnet
[16:30:41] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm)
[16:31:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P53437 and previous config saved to /var/cache/conftool/dbconfig/20231114-163123-arnaudb.json
[16:34:45] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@017fbf1]: search: clean wcqs revision map
[16:35:14] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@017fbf1]: search: clean wcqs revision map (duration: 00m 29s)
[16:35:37] <dcausse>	 thanks! ^
[16:37:18] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm
[16:42:00] <wikibugs>	 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops, and 2 others: librenms.syslog table size - https://phabricator.wikimedia.org/T349362 (10Ladsgroup) FWIW, the rows are almost all like this: ` +-----------+----------+----------+-------+------------+---------------------+---------+-----------------------...
[16:44:28] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1044.eqiad.wmnet with OS bookworm
[16:46:31] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P53438 and previous config saved to /var/cache/conftool/dbconfig/20231114-164630-arnaudb.json
[16:47:43] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[16:47:56] <wikibugs>	 (03PS1) 10Ladsgroup: beta: Set pagelinks migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974221 (https://phabricator.wikimedia.org/T351237)
[16:47:56] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[16:50:02] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@0ae1184]: make cirrus index imports world readable in hdfs
[16:50:30] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@0ae1184]: make cirrus index imports world readable in hdfs (duration: 00m 28s)
[16:53:40] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "seems to help!" [puppet] - 10https://gerrit.wikimedia.org/r/973847 (https://phabricator.wikimedia.org/T349695) (owner: 10FNegri)
[16:55:10] <wikibugs>	 (03Abandoned) 10Elukey: profile::pyrra::filesystem: remove grouping for lift wing [puppet] - 10https://gerrit.wikimedia.org/r/974214 (owner: 10Elukey)
[16:58:20] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] trafficserver: restore traffic to page-analytics [puppet] - 10https://gerrit.wikimedia.org/r/974207 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan)
[17:00:05] <jouncebot>	 jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1700).
[17:00:05] <jouncebot>	 urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[17:00:12] <urbanecm>	 here!
[17:00:29] <jbond>	 urbanecm: give me a sec
[17:00:33] <urbanecm>	 sure
[17:01:21] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] trafficserver: restore traffic to page-analytics [puppet] - 10https://gerrit.wikimedia.org/r/974207 (https://phabricator.wikimedia.org/T350708) (owner: 10Hnowlan)
[17:01:33] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] mediawiki: Run expireTemporaryAccounts.php daily [puppet] - 10https://gerrit.wikimedia.org/r/973167 (https://phabricator.wikimedia.org/T344695) (owner: 10Urbanecm)
[17:01:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T348183)', diff saved to https://phabricator.wikimedia.org/P53440 and previous config saved to /var/cache/conftool/dbconfig/20231114-170136-arnaudb.json
[17:01:39] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[17:01:52] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[17:01:56] <jbond>	 urbanecm: do you want me to deploy it anywhre specific o you can test?
[17:01:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2120 (T348183)', diff saved to https://phabricator.wikimedia.org/P53441 and previous config saved to /var/cache/conftool/dbconfig/20231114-170158-arnaudb.json
[17:01:59] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[17:02:08] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: sync
[17:02:29] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: sync
[17:02:38] <urbanecm>	 jbond: it'd run tomorrow anyway, so i don't think that's needed :). let's wait for puppet.
[17:02:52] <jbond>	 urbanecm: ack sgtm, then all donw
[17:02:54] <jbond>	 urbanecm: ack sgtm, then all done
[17:02:58] <urbanecm>	 thanks
[17:03:01] <jbond>	 np
[17:03:25] <hnowlan>	 jbond: did your merge happen to pick up my changes to profile::trafficserver? 
[17:04:00] <jbond>	 hnowlan: yes i just noticed it was your cr not mine i merged
[17:04:03] <jbond>	 sorry about that 
[17:04:09] <hnowlan>	 no worries, was just about to merge it 
[17:04:15] <wikibugs>	 (03PS1) 10Majavah: team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222
[17:04:15] <jbond>	 ok cool :)
[17:04:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 (owner: 10Majavah)
[17:05:09] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 (owner: 10Majavah)
[17:06:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 (owner: 10Majavah)
[17:06:21] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T348183)', diff saved to https://phabricator.wikimedia.org/P53442 and previous config saved to /var/cache/conftool/dbconfig/20231114-170621-arnaudb.json
[17:09:21] <wikibugs>	 (03PS2) 10Majavah: team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222
[17:09:35] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 (owner: 10Majavah)
[17:11:24] <wikibugs>	 (03Merged) 10jenkins-bot: team-wmcs: Increment OpenstackAPIResponse threshold [alerts] - 10https://gerrit.wikimedia.org/r/974222 (owner: 10Majavah)
[17:12:01] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1046.eqiad.wmnet with OS bookworm
[17:12:15] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm
[17:14:05] <urbanecm>	 jbond: actually... i tried `run-puppet-agent` at `deployment-mwmaint02` (as beta's where we'd like to QA the job first), and i don't see the timer added there. is that beta being broken, or puppet code not done correctly?
[17:14:47] <jbond>	 urbanecm: just about to junmp on a call will check in ~30 mins if thats ok
[17:14:53] <urbanecm>	 absolutely.
[17:16:03] <wikibugs>	 (03CR) 10Krinkle: mc: Make it possible to use mcrouter server set by environment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01)
[17:16:34] <wikibugs>	 (03CR) 10Krinkle: mc: Make it possible to use mcrouter server set by environment (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973838 (https://phabricator.wikimedia.org/T346690) (owner: 10D3r1ck01)
[17:18:09] <taavi>	 urbanecm: why is that timer running once for every wiki? are centralauth temporary accounts not global?
[17:19:37] <Amir1>	 it's complicated... At least for a transitionary period they need to be local as wikis don't want surprises 
[17:20:07] <urbanecm>	 taavi: original reason was that a temp account doesn't need to exist everywhere. but...not sure if we actually need to run it everywhere.
[17:20:19] <urbanecm>	 Amir1: i think technically, they'd be in `globaluser` no matter what?
[17:20:49] <Amir1>	 yeah, they'll be but they don't exist in every wiki
[17:20:58] <Amir1>	 even if they visit them
[17:21:15] <Amir1>	 but indeed it doesn't need to be on every wiki
[17:21:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P53444 and previous config saved to /var/cache/conftool/dbconfig/20231114-172127-arnaudb.json
[17:21:48] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1043.eqiad.wmnet with OS bookworm
[17:23:25] <urbanecm>	 taavi: it calls `AuthManager::revokeAccessForUser( UserIdentity $tempAcc )`, and i don't think i can construct an user identity for a temp account that doesn't exist locally. so, i think it needs to run everywhere.
[17:24:40] <taavi>	 urbanecm: ok, should expireTemporaryAccounts.php have some filters to only process accounts attached to that wiki in that case?
[17:25:15] <urbanecm>	 possibly yes
[17:27:28] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] [toolsdb] Lower innodb_buffer_pool_size [puppet] - 10https://gerrit.wikimedia.org/r/973847 (https://phabricator.wikimedia.org/T349695) (owner: 10FNegri)
[17:27:32] <taavi>	 the job seems to have made it to beta in the meantime, you probably ran puppet before the git-sync-upstream timer had ran on deployment-puppetmaster
[17:27:58] <urbanecm>	 gotcha
[17:29:21] <Amir1>	 can I deploy a patch?
[17:31:12] <urbanecm>	 no objection from me
[17:32:55] <Amir1>	 coooolio
[17:33:06] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] beta: Set pagelinks migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974221 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup)
[17:33:49] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Set pagelinks migration to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974221 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup)
[17:36:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P53445 and previous config saved to /var/cache/conftool/dbconfig/20231114-173634-arnaudb.json
[17:42:21] <icinga-wm>	 PROBLEM - Check systemd state on ganeti1019 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:43:01] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: wmcs::openstack::codfw1dev::control
[17:43:59] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10hnowlan)
[17:45:11] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[17:45:47] <wikibugs>	 (03PS1) 10Jbond: wmcs::openstack::codfw1dev::control: migrate to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974224
[17:46:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::openstack::codfw1dev::control: migrate to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/974224 (owner: 10Jbond)
[17:47:06] <wikibugs>	 10SRE, 10Phabricator maintenance bot, 10collaboration-services, 10Release-Engineering-Team (Radar): phabricator maintenance bot should not add the SRE tag to (certain) subteam tasks any more - https://phabricator.wikimedia.org/T334294 (10Aklapper)
[17:48:56] <wikibugs>	 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10hnowlan)
[17:51:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T348183)', diff saved to https://phabricator.wikimedia.org/P53446 and previous config saved to /var/cache/conftool/dbconfig/20231114-175140-arnaudb.json
[17:51:43] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[17:51:45] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[17:51:56] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[17:52:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2121 (T348183)', diff saved to https://phabricator.wikimedia.org/P53447 and previous config saved to /var/cache/conftool/dbconfig/20231114-175202-arnaudb.json
[17:53:59] <wikibugs>	 (03PS1) 10Jbond: Revert "wmcs::openstack::codfw1dev::control: migrate to puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/974226
[17:54:11] <logmsgbot>	 !log jbond@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: wmcs::openstack::codfw1dev::control
[17:54:20] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "wmcs::openstack::codfw1dev::control: migrate to puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/974226 (owner: 10Jbond)
[17:55:23] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[17:55:24] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[17:56:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T348183)', diff saved to https://phabricator.wikimedia.org/P53448 and previous config saved to /var/cache/conftool/dbconfig/20231114-175623-arnaudb.json
[17:59:19] <wikibugs>	 (03CR) 10Jbond: nftables::service: Ensure we correctly check for ipv4 and ipv6 ips (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[17:59:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] nftables::service: Ensure we correctly check for ipv4 and ipv6 ips [puppet] - 10https://gerrit.wikimedia.org/r/974176 (https://phabricator.wikimedia.org/T351094) (owner: 10Jbond)
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1800)
[18:01:24] <fabfur>	 splunk told me that I'm not oncall anymore, going to party! Nothing to report
[18:04:19] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1047.eqiad.wmnet with OS bookworm
[18:06:23] <wikibugs>	 (03PS7) 10Majavah: hieradata: migrate all cloudlb hosts to nftables [puppet] - 10https://gerrit.wikimedia.org/r/973806 (https://phabricator.wikimedia.org/T351087)
[18:06:25] <wikibugs>	 (03PS4) 10Majavah: Add wiki replica backends to conftool [puppet] - 10https://gerrit.wikimedia.org/r/973760 (https://phabricator.wikimedia.org/T300427)
[18:06:27] <wikibugs>	 (03PS4) 10Majavah: P:wmcs: wikireplicas: allow access from cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973777 (https://phabricator.wikimedia.org/T300427)
[18:06:30] <wikibugs>	 (03PS10) 10Majavah: Add wiki replicas to cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427)
[18:09:17] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 4 DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/973761 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah)
[18:11:12] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1048.eqiad.wmnet with OS bookworm
[18:11:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P53449 and previous config saved to /var/cache/conftool/dbconfig/20231114-181130-arnaudb.json
[18:14:09] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/468/con" [puppet] - 10https://gerrit.wikimedia.org/r/973760 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah)
[18:19:54] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage
[18:22:28] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1047.eqiad.wmnet with reason: host reimage
[18:23:00] <wikibugs>	 10SRE-swift-storage: Swift container for archived mariadb tables - https://phabricator.wikimedia.org/T350924 (10Ladsgroup) The biggest problem for that is the reorgs, a lot of teams we set to own something might not exist in a couple of years, generally I think it's better to keep at the discretion of the DBA wh...
[18:26:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P53450 and previous config saved to /var/cache/conftool/dbconfig/20231114-182636-arnaudb.json
[18:27:50] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage
[18:32:06] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1048.eqiad.wmnet with reason: host reimage
[18:33:38] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudvirt1046.eqiad.wmnet with OS bookworm
[18:36:31] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1011.eqiad.wmnet with OS bullseye
[18:41:43] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T348183)', diff saved to https://phabricator.wikimedia.org/P53451 and previous config saved to /var/cache/conftool/dbconfig/20231114-184142-arnaudb.json
[18:41:45] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[18:41:48] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[18:41:58] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[18:42:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2122 (T348183)', diff saved to https://phabricator.wikimedia.org/P53452 and previous config saved to /var/cache/conftool/dbconfig/20231114-184204-arnaudb.json
[18:45:53] <icinga-wm>	 RECOVERY - Check systemd state on ganeti1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:46:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T348183)', diff saved to https://phabricator.wikimedia.org/P53453 and previous config saved to /var/cache/conftool/dbconfig/20231114-184637-arnaudb.json
[18:50:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 44.64% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:50:51] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1011.eqiad.wmnet with reason: host reimage
[18:50:54] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1047.eqiad.wmnet with OS bookworm
[18:53:19] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1011.eqiad.wmnet with reason: host reimage
[18:53:26] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1048.eqiad.wmnet with OS bookworm
[18:53:55] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:54:51] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:55:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-int at codfw: 45.54% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[18:56:54] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1049.eqiad.wmnet with OS bookworm
[18:57:33] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1048 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[18:58:19] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-role for role: stewards
[19:00:05] <jouncebot>	 jeena and jnuche: Time to snap out of that daydream and deploy MediaWiki train - Utc-7+Utc-0 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T1900).
[19:01:43] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P53454 and previous config saved to /var/cache/conftool/dbconfig/20231114-190143-arnaudb.json
[19:04:33] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: stewards
[19:05:49] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[19:09:31] <jeena>	 The train is rolling
[19:09:45] <jeena>	 despite lack of bot messages
[19:12:43] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage
[19:14:13] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.5  refs T350081
[19:14:18] <stashbot>	 T350081: 1.42.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T350081
[19:15:39] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1049.eqiad.wmnet with reason: host reimage
[19:16:19] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aqs1011.eqiad.wmnet with OS bullseye
[19:16:50] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P53455 and previous config saved to /var/cache/conftool/dbconfig/20231114-191649-arnaudb.json
[19:18:08] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1050.eqiad.wmnet with OS bookworm
[19:22:23] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on moscovium.eqiad.wmnet with reason: maintenance
[19:22:36] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on moscovium.eqiad.wmnet with reason: maintenance
[19:25:07] <logmsgbot>	 !log sfaci@deploy2002 Started deploy [analytics/refinery@2f94afe]: Regular analytics weekly train [analytics/refinery@2f94afe0]
[19:31:56] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T348183)', diff saved to https://phabricator.wikimedia.org/P53456 and previous config saved to /var/cache/conftool/dbconfig/20231114-193156-arnaudb.json
[19:31:58] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[19:32:10] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[19:32:11] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[19:32:12] <logmsgbot>	 !log sfaci@deploy2002 Finished deploy [analytics/refinery@2f94afe]: Regular analytics weekly train [analytics/refinery@2f94afe0] (duration: 07m 04s)
[19:32:18] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2150 (T348183)', diff saved to https://phabricator.wikimedia.org/P53457 and previous config saved to /var/cache/conftool/dbconfig/20231114-193217-arnaudb.json
[19:33:45] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage
[19:34:50] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[19:35:48] <logmsgbot>	 !log sfaci@deploy2002 Started deploy [analytics/refinery@2f94afe] (thin): Regular analytics weekly train THIN [analytics/refinery@2f94afe0]
[19:35:54] <logmsgbot>	 !log sfaci@deploy2002 Finished deploy [analytics/refinery@2f94afe] (thin): Regular analytics weekly train THIN [analytics/refinery@2f94afe0] (duration: 00m 06s)
[19:36:03] <logmsgbot>	 !log sfaci@deploy2002 Started deploy [analytics/refinery@2f94afe] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2f94afe0]
[19:36:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T348183)', diff saved to https://phabricator.wikimedia.org/P53458 and previous config saved to /var/cache/conftool/dbconfig/20231114-193635-arnaudb.json
[19:36:53] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1050.eqiad.wmnet with reason: host reimage
[19:39:18] <logmsgbot>	 !log sfaci@deploy2002 Finished deploy [analytics/refinery@2f94afe] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2f94afe0] (duration: 03m 14s)
[19:40:22] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1049.eqiad.wmnet with OS bookworm
[19:41:31] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1052.eqiad.wmnet with OS bookworm
[19:51:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P53459 and previous config saved to /var/cache/conftool/dbconfig/20231114-195141-arnaudb.json
[19:52:12] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-role for role: etherpad
[19:53:55] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:56:02] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Dzahn)
[19:57:05] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Dzahn) stewards: https://gerrit.wikimedia.org/r/c/operations/puppet/+/973863  peopleweb: https://gerrit.wikimedia.org/r/c/operations/puppet/+/973855  etherp...
[19:57:08] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage
[19:57:10] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:57:39] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: etherpad
[19:59:54] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1050.eqiad.wmnet with OS bookworm
[20:01:06] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-jobrunner_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:01:42] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1052.eqiad.wmnet with reason: host reimage
[20:02:11] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-host for host doc2002.codfw.wmnet
[20:03:51] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudvirt1043']
[20:04:42] <logmsgbot>	 !log robh@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['cloudvirt1043']
[20:06:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P53460 and previous config saved to /var/cache/conftool/dbconfig/20231114-200648-arnaudb.json
[20:07:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:08:22] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host doc2002.codfw.wmnet
[20:09:23] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bullseye
[20:11:06] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1053.eqiad.wmnet with OS bookworm
[20:17:05] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.migrate-role for role: doc
[20:19:26] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner2002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:21:01] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[20:21:11] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:21:38] <icinga-wm>	 PROBLEM - Check systemd state on an-presto1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:21:38] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: doc
[20:21:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T348183)', diff saved to https://phabricator.wikimedia.org/P53461 and previous config saved to /var/cache/conftool/dbconfig/20231114-202154-arnaudb.json
[20:21:58] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[20:22:00] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[20:22:11] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[20:22:12] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:22:13] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[20:22:26] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[20:22:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2159 (T348183)', diff saved to https://phabricator.wikimedia.org/P53462 and previous config saved to /var/cache/conftool/dbconfig/20231114-202232-arnaudb.json
[20:24:16] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on people1004.eqiad.wmnet with reason: maintenance
[20:24:30] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on people1004.eqiad.wmnet with reason: maintenance
[20:24:48] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1052.eqiad.wmnet with OS bookworm
[20:25:29] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on people2003.codfw.wmnet with reason: maintenance
[20:25:40] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage
[20:25:42] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on people2003.codfw.wmnet with reason: maintenance
[20:25:56] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1054.eqiad.wmnet with OS bookworm
[20:26:51] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T348183)', diff saved to https://phabricator.wikimedia.org/P53463 and previous config saved to /var/cache/conftool/dbconfig/20231114-202650-arnaudb.json
[20:27:06] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1004 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:28:15] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1053.eqiad.wmnet with reason: host reimage
[20:29:46] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on doc2002.codfw.wmnet with reason: maintenance
[20:30:00] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on doc2002.codfw.wmnet with reason: maintenance
[20:30:27] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1055.eqiad.wmnet with OS bookworm
[20:31:41] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 0:10:00 on doc1003.eqiad.wmnet with reason: maintenance
[20:31:54] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on doc1003.eqiad.wmnet with reason: maintenance
[20:32:03] <logmsgbot>	 !log robh@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudvirt1043.eqiad.wmnet with OS bullseye
[20:33:05] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Dzahn)
[20:33:32] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bullseye
[20:39:33] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage
[20:40:14] <icinga-wm>	 PROBLEM - Check systemd state on doc2002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-host-data-sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:41:22] <icinga-wm>	 RECOVERY - Check systemd state on doc2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:41:33] <mutante>	 !log doc2002 - systemctl start rsync-doc-host-data-sync - failed unit after maintenance reboot
[20:41:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:57] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P53464 and previous config saved to /var/cache/conftool/dbconfig/20231114-204156-arnaudb.json
[20:42:09] <mutante>	 !log destroying phab-test1001.eqiad.wmnet - T351115
[20:42:09] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1054.eqiad.wmnet with reason: host reimage
[20:42:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:42:13] <stashbot>	 T351115: decom phab-test1001 - https://phabricator.wikimedia.org/T351115
[20:43:16] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission for hosts phab-test1001.eqiad.wmnet
[20:44:08] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage
[20:44:51] <wikibugs>	 (03PS1) 10Dzahn: site/hiera: remove decom'ed phab-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/974266 (https://phabricator.wikimedia.org/T351115)
[20:46:48] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage
[20:47:05] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1055.eqiad.wmnet with reason: host reimage
[20:47:30] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.dns.netbox
[20:47:34] <wikibugs>	 (03CR) 10Brouberol: Generate the netboot.cfg file to avoid typos impacting everyone (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/973308 (https://phabricator.wikimedia.org/T351059) (owner: 10Brouberol)
[20:49:44] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage
[20:49:50] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab-test1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001"
[20:51:02] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: phab-test1001.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1001"
[20:51:02] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:51:03] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts phab-test1001.eqiad.wmnet
[20:51:36] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:52:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site/hiera: remove decom'ed phab-test1001 [puppet] - 10https://gerrit.wikimedia.org/r/974266 (https://phabricator.wikimedia.org/T351115) (owner: 10Dzahn)
[20:54:57] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1053.eqiad.wmnet with OS bookworm
[20:55:40] <wikibugs>	 (03CR) 10Dzahn: "thanks for merging this after manager approval :)" [puppet] - 10https://gerrit.wikimedia.org/r/972911 (https://phabricator.wikimedia.org/T350834) (owner: 10Dzahn)
[20:55:54] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to WMF LDAP group and deployment and analytics-privatedata-users shell access group for Grace (ecarg) - https://phabricator.wikimedia.org/T350918 (10ATsay-WMF) I approve this as Grace's manager. Thanks!
[20:56:48] <wikibugs>	 10SRE, 10Stewards-Onboarding-Tool, 10Wikimedia-Mailing-lists: stewards1001 / stewards2001: Enable API access for Mailman3 - https://phabricator.wikimedia.org/T351202 (10Peachey88) {T279023}
[20:57:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P53465 and previous config saved to /var/cache/conftool/dbconfig/20231114-205703-arnaudb.json
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231114T2100)
[21:00:04] <jouncebot>	 Kizule, danisztls, ebernhardson, and jdrewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:40] <ebernhardson>	 \o
[21:00:42] <jan_drewniak>	 o/
[21:00:52] <danisztls>	 o/
[21:01:27] <wikibugs>	 (03PS2) 10Jdrewniak: [Vector] enable Zebra CSS module on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974264 (https://phabricator.wikimedia.org/T347711)
[21:03:55] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1054.eqiad.wmnet with OS bookworm
[21:07:35] <ebernhardson>	 so, whos running the deploy window?
[21:09:20] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1056.eqiad.wmnet with OS bookworm
[21:09:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] install_server: configure reuse for all aqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/974259 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans)
[21:09:29] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1055.eqiad.wmnet with OS bookworm
[21:11:05] <urbanecm>	 ebernhardson: i prefer not to, but since there's no one else, let's do that
[21:11:36] <ebernhardson>	 urbanecm: I appreciate it, thanks
[21:11:39] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [Zebra] Remove underline from pages with blank title [skins/Vector] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974227 (https://phabricator.wikimedia.org/T351119) (owner: 10Jdrewniak)
[21:12:01] <urbanecm>	 jan_drewniak: can you advise whether deploying https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/974264/ before the backport would be a good or bad idea?
[21:12:07] <urbanecm>	 (i assume bad, since they seem to be touching the same area)
[21:12:10] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T348183)', diff saved to https://phabricator.wikimedia.org/P53466 and previous config saved to /var/cache/conftool/dbconfig/20231114-211209-arnaudb.json
[21:12:10] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1043.eqiad.wmnet with OS bullseye
[21:12:12] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[21:12:16] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] install_server: configure reuse for all aqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/974259 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans)
[21:12:21] <wikibugs>	 (03PS2) 10Urbanecm: Deploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974254 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza)
[21:12:25] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Deploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974254 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza)
[21:12:25] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[21:12:27] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[21:12:31] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53467 and previous config saved to /var/cache/conftool/dbconfig/20231114-211231-arnaudb.json
[21:12:52] <jan_drewniak>	 urbanecm: Its be better if the vector patch goes first, then the config
[21:12:59] <urbanecm>	 okay, noted.
[21:13:09] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy Reader Demographics 2 survey on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974254 (https://phabricator.wikimedia.org/T344393) (owner: 10DDesouza)
[21:13:16] <wikibugs>	 (03PS2) 10Urbanecm: throttle.php: Cleanup old rules, add new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973369 (https://phabricator.wikimedia.org/T351002) (owner: 10Zoranzoki21)
[21:13:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] throttle.php: Cleanup old rules, add new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973369 (https://phabricator.wikimedia.org/T351002) (owner: 10Zoranzoki21)
[21:13:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973369 (https://phabricator.wikimedia.org/T351002) (owner: 10Zoranzoki21)
[21:14:04] <wikibugs>	 (03Merged) 10jenkins-bot: throttle.php: Cleanup old rules, add new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/973369 (https://phabricator.wikimedia.org/T351002) (owner: 10Zoranzoki21)
[21:14:24] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] PageRerenderSerializer: Match stream name with conventions [extensions/CirrusSearch] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974228 (owner: 10Ebernhardson)
[21:14:29] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:974254|Deploy Reader Demographics 2 survey on enwiki (T344393)]], [[gerrit:973369|throttle.php: Cleanup old rules, add new one (T351002)]]
[21:14:45] <stashbot>	 T344393: Quicksurvey deployment for readers survey  - https://phabricator.wikimedia.org/T344393
[21:14:45] <stashbot>	 T351002: Lift IP cap on 2023-11-23 for Editathon Czechoslovakia - cs.wikipedia - https://phabricator.wikimedia.org/T351002
[21:15:34] <danisztls>	 urbanecm: regarding mine, there's nothing to test as it just increases coverage
[21:15:47] <urbanecm>	  ack
[21:15:50] <logmsgbot>	 !log urbanecm@deploy2002 dani and urbanecm and zoranzoki21: Backport for [[gerrit:974254|Deploy Reader Demographics 2 survey on enwiki (T344393)]], [[gerrit:973369|throttle.php: Cleanup old rules, add new one (T351002)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:15:52] <logmsgbot>	 !log urbanecm@deploy2002 dani and urbanecm and zoranzoki21: Continuing with sync
[21:15:55] <urbanecm>	 proceeding then
[21:16:20] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1043.eqiad.wmnet with OS bookworm
[21:17:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53468 and previous config saved to /var/cache/conftool/dbconfig/20231114-211700-arnaudb.json
[21:21:18] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:974254|Deploy Reader Demographics 2 survey on enwiki (T344393)]], [[gerrit:973369|throttle.php: Cleanup old rules, add new one (T351002)]] (duration: 06m 49s)
[21:21:24] <stashbot>	 T344393: Quicksurvey deployment for readers survey  - https://phabricator.wikimedia.org/T344393
[21:21:25] <stashbot>	 T351002: Lift IP cap on 2023-11-23 for Editathon Czechoslovakia - cs.wikipedia - https://phabricator.wikimedia.org/T351002
[21:21:35] <urbanecm>	 danisztls: should be live
[21:23:16] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage
[21:23:23] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1046.eqiad.wmnet with OS bookworm
[21:25:52] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1056.eqiad.wmnet with reason: host reimage
[21:26:48] <wikibugs>	 (03CR) 10Dzahn: "This VM has now been deleted" [puppet] - 10https://gerrit.wikimedia.org/r/974184 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[21:28:33] <wikibugs>	 (03Merged) 10jenkins-bot: [Zebra] Remove underline from pages with blank title [skins/Vector] (wmf/1.42.0-wmf.4) - 10https://gerrit.wikimedia.org/r/974227 (https://phabricator.wikimedia.org/T351119) (owner: 10Jdrewniak)
[21:29:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974169 (https://phabricator.wikimedia.org/T308142) (owner: 10Sergio Gimeno)
[21:29:47] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:974227|[Zebra] Remove underline from pages with blank title (T351119)]]
[21:29:52] <stashbot>	 T351119: Zebra - Pages with blank titles shouldn't have underlines  - https://phabricator.wikimedia.org/T351119
[21:30:37] <logmsgbot>	 !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage
[21:31:11] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and jdrewniak: Backport for [[gerrit:974227|[Zebra] Remove underline from pages with blank title (T351119)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:31:29] <urbanecm>	 jan_drewniak: can you test the backport please?
[21:32:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P53469 and previous config saved to /var/cache/conftool/dbconfig/20231114-213207-arnaudb.json
[21:32:08] <jan_drewniak>	 Wybór Łysek i na sposób i bólu niż sekrecie sposób wiele innego liczba Chonan
[21:32:23] <wikibugs>	 (03Merged) 10jenkins-bot: PageRerenderSerializer: Match stream name with conventions [extensions/CirrusSearch] (wmf/1.42.0-wmf.5) - 10https://gerrit.wikimedia.org/r/974228 (owner: 10Ebernhardson)
[21:32:28] <jan_drewniak>	 Wow autocorrect dictation in Polish...
[21:32:43] <urbanecm>	 :D
[21:33:11] <urbanecm>	 Let me tell you that the bride has no secrets for the bride as soon as possible Chonan, according to translator
[21:33:32] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1043.eqiad.wmnet with reason: host reimage
[21:34:08] <jan_drewniak>	 urbanecm: in other words, patch looks good to sync :P 
[21:34:19] * urbanecm adds that to my dictionary
[21:34:21] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and jdrewniak: Continuing with sync
[21:35:07] <ebernhardson>	 urbanecm: mine isn't testable, it changes a string which is only used in job's related to page updates
[21:35:13] <urbanecm>	 ack
[21:35:18] <urbanecm>	 will deploy once it merges
[21:35:28] <urbanecm>	 oh, it merged
[21:35:43] <wikibugs>	 (03PS3) 10Urbanecm: [Vector] enable Zebra CSS module on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974264 (https://phabricator.wikimedia.org/T347711) (owner: 10Jdrewniak)
[21:35:50] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] [Vector] enable Zebra CSS module on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974264 (https://phabricator.wikimedia.org/T347711) (owner: 10Jdrewniak)
[21:36:39] <wikibugs>	 (03Merged) 10jenkins-bot: [Vector] enable Zebra CSS module on test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/974264 (https://phabricator.wikimedia.org/T347711) (owner: 10Jdrewniak)
[21:39:47] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:974227|[Zebra] Remove underline from pages with blank title (T351119)]] (duration: 09m 59s)
[21:39:52] <stashbot>	 T351119: Zebra - Pages with blank titles shouldn't have underlines  - https://phabricator.wikimedia.org/T351119
[21:40:42] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:974264|[Vector] enable Zebra CSS module on test wikis (T347711)]], [[gerrit:974228|PageRerenderSerializer: Match stream name with conventions]]
[21:40:46] <stashbot>	 T347711: [Zebra] Enable refactored Zebra on certain wikis for testing purposes - https://phabricator.wikimedia.org/T347711
[21:42:17] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and jdrewniak and ebernhardson: Backport for [[gerrit:974264|[Vector] enable Zebra CSS module on test wikis (T347711)]], [[gerrit:974228|PageRerenderSerializer: Match stream name with conventions]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:42:25] <urbanecm>	 jan_drewniak: can you test please? :)
[21:42:45] <jan_drewniak>	 urbanecm: yup, I see it, looks good to sync :) 
[21:42:51] <urbanecm>	 good, syncing
[21:42:53] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and jdrewniak and ebernhardson: Continuing with sync
[21:46:15] <wikibugs>	 (03PS1) 10Fabfur: haproxy: re-set varnish maxconn on all cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/974268 (https://phabricator.wikimedia.org/T310609)
[21:47:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P53470 and previous config saved to /var/cache/conftool/dbconfig/20231114-214713-arnaudb.json
[21:48:18] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:974264|[Vector] enable Zebra CSS module on test wikis (T347711)]], [[gerrit:974228|PageRerenderSerializer: Match stream name with conventions]] (duration: 07m 36s)
[21:48:23] <stashbot>	 T347711: [Zebra] Enable refactored Zebra on certain wikis for testing purposes - https://phabricator.wikimedia.org/T347711
[21:49:10] <urbanecm>	 should be all done! :)
[21:49:41] <ebernhardson>	 urbanecm: thanks again
[21:49:53] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 10 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/974268 (https://phabricator.wikimedia.org/T310609) (owner: 10Fabfur)
[21:49:55] <urbanecm>	 np
[21:52:21] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1056.eqiad.wmnet with OS bookworm
[22:00:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10MoritzMuehlenhoff) @Dwisehaupt I think we have all data now except the hostname, see my earlier comment. crm1001 or something else?
[22:00:27] <logmsgbot>	 !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1043.eqiad.wmnet with OS bookworm
[22:02:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53471 and previous config saved to /var/cache/conftool/dbconfig/20231114-220220-arnaudb.json
[22:02:22] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[22:02:26] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[22:02:36] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[22:02:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53472 and previous config saved to /var/cache/conftool/dbconfig/20231114-220241-arnaudb.json
[22:05:38] <logmsgbot>	 !log andrew@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1046.eqiad.wmnet with OS bookworm
[22:07:18] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53473 and previous config saved to /var/cache/conftool/dbconfig/20231114-220717-arnaudb.json
[22:07:21] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[22:19:30] <logmsgbot>	 !log eevans@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host aqs1012.eqiad.wmnet with OS bullseye
[22:22:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P53474 and previous config saved to /var/cache/conftool/dbconfig/20231114-222224-arnaudb.json
[22:23:24] <wikibugs>	 (03PS1) 10Eevans: install_server: actually use the aqs reuse config (breakfix) [puppet] - 10https://gerrit.wikimedia.org/r/974274 (https://phabricator.wikimedia.org/T347738)
[22:24:07] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] install_server: actually use the aqs reuse config (breakfix) [puppet] - 10https://gerrit.wikimedia.org/r/974274 (https://phabricator.wikimedia.org/T347738) (owner: 10Eevans)
[22:30:48] <wikibugs>	 (03PS7) 10Krinkle: Enable $wgStatsTarget for new Stats lib for requests to kube-mw-debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite)
[22:30:52] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Enable $wgStatsTarget for new Stats lib for requests to kube-mw-debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite)
[22:31:10] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Enable $wgStatsTarget for new Stats lib for requests to kube-mw-debug (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite)
[22:32:43] <wikibugs>	 (03PS8) 10Krinkle: Enable $wgStatsTarget for requests to kube-mw-debug [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite)
[22:33:22] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[22:37:31] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P53476 and previous config saved to /var/cache/conftool/dbconfig/20231114-223730-arnaudb.json
[22:47:31] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/973213/471/" [puppet] - 10https://gerrit.wikimedia.org/r/973213 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn)
[22:50:27] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on prod confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/973213 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn)
[22:52:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T348183)', diff saved to https://phabricator.wikimedia.org/P53477 and previous config saved to /var/cache/conftool/dbconfig/20231114-225236-arnaudb.json
[22:52:39] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[22:52:41] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[22:52:52] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[22:52:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2182 (T348183)', diff saved to https://phabricator.wikimedia.org/P53478 and previous config saved to /var/cache/conftool/dbconfig/20231114-225258-arnaudb.json
[22:53:08] <wikibugs>	 10SRE, 10Data Pipelines, 10Data-Engineering, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10Isaac) Realizing I never linked any code for this in case folks wanted to work with the data but here's an example where I'm trying to grab both sources:...
[22:53:55] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:54:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10vm-requests: eqiad: 1 VM requested for community-crm - https://phabricator.wikimedia.org/T349402 (10Dwisehaupt) Sorry, I forgot to respond to that. crm1001 is good.
[22:56:53] <icinga-wm>	 PROBLEM - Disk space on druid1009 is CRITICAL: DISK CRITICAL - free space: /srv 47486 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1009&var-datasource=eqiad+prometheus/ops
[22:57:15] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:58:19] <icinga-wm>	 PROBLEM - Disk space on druid1011 is CRITICAL: DISK CRITICAL - free space: /srv 51183 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1011&var-datasource=eqiad+prometheus/ops
[23:01:11] <icinga-wm>	 PROBLEM - Disk space on druid1010 is CRITICAL: DISK CRITICAL - free space: /srv 48820 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1010&var-datasource=eqiad+prometheus/ops
[23:05:49] <jinxer-wm>	 (HelmReleaseBadStatus) firing: Helm release kube-system/kube-state-metrics on k8s-staging@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=kube-system - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[23:08:57] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-jobrunner_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-jobrunner_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[23:11:17] <wikibugs>	 (03PS1) 10Dzahn: phabricator::main: add support for PHP versions other than 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/974280 (https://phabricator.wikimedia.org/T327068)
[23:12:04] <wikibugs>	 (03PS1) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355)
[23:14:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[23:15:18] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[23:17:10] <wikibugs>	 (03PS1) 10JHathaway: puppetserver: change ssldir to a concat fragment [puppet] - 10https://gerrit.wikimedia.org/r/974282
[23:17:12] <wikibugs>	 (03PS1) 10JHathaway: puppetserver: cache code [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809)
[23:17:44] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/974282 (owner: 10JHathaway)
[23:17:55] <wikibugs>	 (03CR) 10JHathaway: "kindly review" [puppet] - 10https://gerrit.wikimedia.org/r/974283 (https://phabricator.wikimedia.org/T350809) (owner: 10JHathaway)
[23:20:27] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T348183)', diff saved to https://phabricator.wikimedia.org/P53479 and previous config saved to /var/cache/conftool/dbconfig/20231114-232026-arnaudb.json
[23:20:33] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[23:21:27] <wikibugs>	 (03CR) 10Cwhite: "Thank you for having a look and for the clarification!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955015 (https://phabricator.wikimedia.org/T240685) (owner: 10Cwhite)
[23:23:43] <wikibugs>	 (03PS2) 10Bking: query_service: add monitoring for ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355)
[23:23:52] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974281 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[23:26:30] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host aqs1012.eqiad.wmnet with OS bullseye
[23:28:37] <wikibugs>	 (03PS1) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285
[23:29:09] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285 (owner: 10Dzahn)
[23:33:07] <wikibugs>	 (03PS2) 10Dzahn: wmflib: add function to return PHP version based on distro version [puppet] - 10https://gerrit.wikimedia.org/r/974285
[23:33:43] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[23:34:51] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/974280/472/" [puppet] - 10https://gerrit.wikimedia.org/r/974280 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn)
[23:35:33] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P53480 and previous config saved to /var/cache/conftool/dbconfig/20231114-233532-arnaudb.json
[23:37:17] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host aqs1012.eqiad.wmnet with OS bullseye
[23:37:49] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1043 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[23:38:07] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop in prod confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/974280 (https://phabricator.wikimedia.org/T327068) (owner: 10Dzahn)
[23:47:43] <wikibugs>	 (03PS1) 10Dzahn: php: add templates to support php8.2 on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/974286 (https://phabricator.wikimedia.org/T327068)
[23:50:39] <wikibugs>	 (03PS3) 10MVernon: swift: migrate one node to envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616)
[23:50:40] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P53481 and previous config saved to /var/cache/conftool/dbconfig/20231114-235039-arnaudb.json
[23:51:40] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/974215 (https://phabricator.wikimedia.org/T317616) (owner: 10MVernon)
[23:53:55] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure