[00:00:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[00:05:13] <wikibugs>	 (03PS2) 10Stang: zhwiki: Create group ipblock-exempt-grantor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005109 (https://phabricator.wikimedia.org/T357991)
[00:07:13] <wikibugs>	 (03PS3) 10Stang: zhwiki: Create group ipblock-exempt-grantor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005109 (https://phabricator.wikimedia.org/T357991)
[00:27:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[00:27:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[00:39:08] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1004703
[00:39:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1004703 (owner: 10TrainBranchBot)
[00:50:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893#9562035 (10Eevans) >>! In T354893#9551121, @Eevans wrote: > @Jclark-ctr it looks like these hosts weren't allocated the additional IP addresses, do you know what is required to assi...
[00:59:07] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1004703 (owner: 10TrainBranchBot)
[01:22:02] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#9562078 (10Ottomata) Just came across https://www.jikkou.io/docs/tutorials/get_started/ . Worth a look!  - https://www.jikkou.io/do...
[01:42:07] <icinga-wm_>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:42:41] <icinga-wm_>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:42:45] <icinga-wm_>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:42:45] <icinga-wm_>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:42:45] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:44:08] <wikibugs>	 (03CR) 10Ssingh: fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[01:55:17] <icinga-wm_>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:55:49] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:55:49] <icinga-wm_>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:55:49] <icinga-wm_>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:55:51] <icinga-wm_>	 RECOVERY - BFD status on cr2-eqdfw is OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[01:59:48] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[02:00:40] <wikibugs>	 (03Merged) 10jenkins-bot: Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[02:01:03] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[02:03:55] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[02:10:10] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[02:11:03] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[02:20:45] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[02:20:54] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[02:22:37] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[02:23:01] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[02:23:13] <icinga-wm_>	 RECOVERY - snapshot of s6 in codfw on backupmon1001 is OK: Last snapshot for s6 at codfw (db2097) taken on 2024-02-21 01:21:06 (622 GiB, +0.7 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[02:28:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2205.mgmt.codfw.wmnet with reboot policy FORCED
[02:29:15] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2204.mgmt.codfw.wmnet with reboot policy FORCED
[02:34:19] <wikibugs>	 (03PS1) 10RLazarus: k8s-controller-sidecars: Add missing namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005212 (https://phabricator.wikimedia.org/T348284)
[02:38:36] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:39:59] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2205.mgmt.codfw.wmnet with reboot policy FORCED
[02:41:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:48:35] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[02:49:17] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2204.mgmt.codfw.wmnet with reboot policy FORCED
[02:50:10] <wikibugs>	 (03PS2) 10RLazarus: k8s-controller-sidecars: Add missing namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005212 (https://phabricator.wikimedia.org/T348284)
[02:51:06] <wikibugs>	 (03PS3) 10RLazarus: k8s-controller-sidecars: Add missing namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005212 (https://phabricator.wikimedia.org/T348284)
[02:56:16] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[02:58:25] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2206 to codfw - jhancock@cumin2002"
[02:59:18] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2206 to codfw - jhancock@cumin2002"
[02:59:18] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[03:00:23] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2206.mgmt.codfw.wmnet with reboot policy FORCED
[03:00:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2207.mgmt.codfw.wmnet with reboot policy FORCED
[03:00:42] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] "Self-merging this just so I don't leave the diffs unapplied overnight" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005212 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[03:00:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2208.mgmt.codfw.wmnet with reboot policy FORCED
[03:00:53] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2207.mgmt.codfw.wmnet with reboot policy FORCED
[03:00:54] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2208.mgmt.codfw.wmnet with reboot policy FORCED
[03:01:04] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[03:01:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2207.mgmt.codfw.wmnet with reboot policy FORCED
[03:03:32] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2208.mgmt.codfw.wmnet with reboot policy FORCED
[03:03:36] <wikibugs>	 (03Merged) 10jenkins-bot: k8s-controller-sidecars: Add missing namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005212 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[03:07:47] <wikibugs>	 (03CR) 10Samwilson: [C: 03+1] InitialiseSettings: Enable Edit Recovery on 3 projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004736 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar)
[03:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:13:36] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:15:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2207.mgmt.codfw.wmnet with reboot policy FORCED
[03:21:19] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9562208 (10Bugreporter) We need to move the task subscribation and assignments. After to prevent confusion we may consider disabling the brion account.
[03:21:37] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2206.mgmt.codfw.wmnet with reboot policy FORCED
[03:24:20] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9562210 (10Bugreporter) Alternatively we can keep the bvibber Phab account as the personal account and rename brion to something like bvibber-wmf so much le...
[03:25:01] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2208.mgmt.codfw.wmnet with reboot policy FORCED
[03:26:18] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[03:26:29] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[03:27:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[03:28:11] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[03:29:37] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[03:29:39] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2209 to codfw - jhancock@cumin2002"
[03:30:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2209 to codfw - jhancock@cumin2002"
[03:30:33] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[03:31:14] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2209.mgmt.codfw.wmnet with reboot policy FORCED
[03:33:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[03:35:30] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2210 to codfw - jhancock@cumin2002"
[03:36:20] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2210 to codfw - jhancock@cumin2002"
[03:36:21] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[03:37:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2210.mgmt.codfw.wmnet with reboot policy FORCED
[03:39:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[03:41:47] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2211 to codfw - jhancock@cumin2002"
[03:42:41] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2211 to codfw - jhancock@cumin2002"
[03:42:41] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[03:51:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2210.mgmt.codfw.wmnet with reboot policy FORCED
[03:52:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2211.mgmt.codfw.wmnet with reboot policy FORCED
[03:52:54] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2209.mgmt.codfw.wmnet with reboot policy FORCED
[03:53:31] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[03:54:29] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[03:55:20] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[03:55:42] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2212 to codfw - jhancock@cumin2002"
[03:56:36] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2212 to codfw - jhancock@cumin2002"
[03:56:36] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[03:57:08] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[03:58:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2212.mgmt.codfw.wmnet with reboot policy FORCED
[03:59:17] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2213 to codfw - jhancock@cumin2002"
[04:00:08] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2213 to codfw - jhancock@cumin2002"
[04:00:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[04:06:40] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[04:08:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2214 to codfw - jhancock@cumin2002"
[04:09:50] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2214 to codfw - jhancock@cumin2002"
[04:09:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[04:10:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2214.mgmt.codfw.wmnet with reboot policy FORCED
[04:12:47] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[04:14:25] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2211.mgmt.codfw.wmnet with reboot policy FORCED
[04:14:41] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2215 to codfw - jhancock@cumin2002"
[04:15:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2215 to codfw - jhancock@cumin2002"
[04:15:33] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[04:15:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2215.mgmt.codfw.wmnet with reboot policy FORCED
[04:18:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[04:18:45] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2212.mgmt.codfw.wmnet with reboot policy FORCED
[04:19:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2216 to codfw - jhancock@cumin2002"
[04:20:46] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2216 to codfw - jhancock@cumin2002"
[04:20:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[04:21:07] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2213.mgmt.codfw.wmnet with reboot policy FORCED
[04:21:26] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2216.mgmt.codfw.wmnet with reboot policy FORCED
[04:22:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[04:23:34] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2214.mgmt.codfw.wmnet with reboot policy FORCED
[04:24:36] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2217 to codfw - jhancock@cumin2002"
[04:25:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2217 to codfw - jhancock@cumin2002"
[04:25:29] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[04:25:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2217.mgmt.codfw.wmnet with reboot policy FORCED
[04:27:47] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[04:29:53] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2218 to codfw - jhancock@cumin2002"
[04:30:36] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[04:30:46] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2218 to codfw - jhancock@cumin2002"
[04:30:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[04:31:18] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2218.mgmt.codfw.wmnet with reboot policy FORCED
[04:31:56] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[04:32:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[04:34:44] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2219 to codfw - jhancock@cumin2002"
[04:35:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2219 to codfw - jhancock@cumin2002"
[04:35:36] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[04:36:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2219.mgmt.codfw.wmnet with reboot policy FORCED
[04:36:56] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2215.mgmt.codfw.wmnet with reboot policy FORCED
[04:39:06] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[04:41:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2220 to codfw - jhancock@cumin2002"
[04:41:55] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2220 to codfw - jhancock@cumin2002"
[04:41:55] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[04:42:28] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2220.mgmt.codfw.wmnet with reboot policy FORCED
[04:43:09] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2216.mgmt.codfw.wmnet with reboot policy FORCED
[04:51:52] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2218.mgmt.codfw.wmnet with reboot policy FORCED
[04:52:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2217.mgmt.codfw.wmnet with reboot policy FORCED
[04:58:47] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2219.mgmt.codfw.wmnet with reboot policy FORCED
[05:00:47] <icinga-wm_>	 RECOVERY - snapshot of s2 in codfw on backupmon1001 is OK: Last snapshot for s2 at codfw (db2097) taken on 2024-02-21 04:07:31 (1020 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[05:02:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2220.mgmt.codfw.wmnet with reboot policy FORCED
[05:06:20] * kart_ deploying MinT
[05:06:29] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2024-02-20-062448-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/995170 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry)
[05:07:46] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2024-02-20-062448-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/995170 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry)
[05:09:09] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[05:13:02] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[05:13:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance
[05:13:48] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance
[05:14:58] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[05:15:24] <wikibugs>	 (03PS1) 10Marostegui: db2167: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1005217 (https://phabricator.wikimedia.org/T354826)
[05:20:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2167: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1005217 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui)
[05:21:07] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[05:21:41] <logmsgbot>	 !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s2
[05:21:46] <logmsgbot>	 !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s7
[05:23:01] <wikibugs>	 (03PS1) 10Marostegui: clouddb1018: Upgrade to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005218 (https://phabricator.wikimedia.org/T356838)
[05:23:22] <wikibugs>	 (03CR) 10Marostegui: "The host is already depooled." [puppet] - 10https://gerrit.wikimedia.org/r/1005218 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui)
[05:25:03] <wikibugs>	 (03PS1) 10RLazarus: k8s-controller-sidecars: Bump the pod's memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005219 (https://phabricator.wikimedia.org/T348284)
[05:33:56] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[05:37:43] <wikibugs>	 (03PS1) 10Marostegui: es1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005221 (https://phabricator.wikimedia.org/T358080)
[05:38:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1026 T358080', diff saved to https://phabricator.wikimedia.org/P57434 and previous config saved to /var/cache/conftool/dbconfig/20240221-053822-root.json
[05:38:28] <stashbot>	 T358080: Upgrade es2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358080
[05:39:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005221 (https://phabricator.wikimedia.org/T358080) (owner: 10Marostegui)
[05:39:55] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1026.eqiad.wmnet with OS bookworm
[05:41:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[05:41:31] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance
[05:41:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2103 (T355609)', diff saved to https://phabricator.wikimedia.org/P57435 and previous config saved to /var/cache/conftool/dbconfig/20240221-054136-marostegui.json
[05:41:42] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[05:42:09] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[05:45:00] <kart_>	 !log Updated MinT to 2024-02-20-062448-production (T333969, T354666)
[05:45:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:45:08] <stashbot>	 T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969
[05:45:08] <stashbot>	 T354666: Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666
[05:53:17] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1026.eqiad.wmnet with reason: host reimage
[05:55:46] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1026.eqiad.wmnet with reason: host reimage
[05:58:37] <wikibugs>	 (03PS1) 10Marostegui: Revert "es1026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005083
[06:01:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:09:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T355609)', diff saved to https://phabricator.wikimedia.org/P57436 and previous config saved to /var/cache/conftool/dbconfig/20240221-060928-marostegui.json
[06:09:35] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[06:11:47] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1026.eqiad.wmnet with OS bookworm
[06:11:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es1026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005083 (owner: 10Marostegui)
[06:13:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 1%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57437 and previous config saved to /var/cache/conftool/dbconfig/20240221-061325-root.json
[06:24:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P57438 and previous config saved to /var/cache/conftool/dbconfig/20240221-062434-marostegui.json
[06:29:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57439 and previous config saved to /var/cache/conftool/dbconfig/20240221-062928-root.json
[06:39:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P57440 and previous config saved to /var/cache/conftool/dbconfig/20240221-063940-marostegui.json
[06:44:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57441 and previous config saved to /var/cache/conftool/dbconfig/20240221-064433-root.json
[06:46:58] <icinga-wm_>	 RECOVERY - snapshot of x1 in codfw on backupmon1001 is OK: Last snapshot for x1 at codfw (db2097) taken on 2024-02-21 06:04:25 (481 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[06:48:35] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[06:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:54:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T355609)', diff saved to https://phabricator.wikimedia.org/P57442 and previous config saved to /var/cache/conftool/dbconfig/20240221-065447-marostegui.json
[06:54:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2116.codfw.wmnet with reason: Maintenance
[06:54:53] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[06:55:03] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2116.codfw.wmnet with reason: Maintenance
[06:55:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T355609)', diff saved to https://phabricator.wikimedia.org/P57443 and previous config saved to /var/cache/conftool/dbconfig/20240221-065508-marostegui.json
[06:59:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57444 and previous config saved to /var/cache/conftool/dbconfig/20240221-065938-root.json
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T0700)
[07:01:04] <jinxer-wm>	 (PuppetDisabled) firing: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[07:04:22] <jinxer-wm>	 (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:05:30] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:06:30] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:09:21] <jinxer-wm>	 (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[07:14:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57445 and previous config saved to /var/cache/conftool/dbconfig/20240221-071443-root.json
[07:15:32] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:15:32] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:22:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T355609)', diff saved to https://phabricator.wikimedia.org/P57446 and previous config saved to /var/cache/conftool/dbconfig/20240221-072255-marostegui.json
[07:23:01] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[07:27:02] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] profile::mariadb::wmf_root_client: Remove cumin1001 from allow list [puppet] - 10https://gerrit.wikimedia.org/r/1005106 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff)
[07:29:05] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove cumin1001 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1005401 (https://phabricator.wikimedia.org/T353419)
[07:29:07] <wikibugs>	 (03CR) 10Dom Walden: [C: 03+1] beta: Switch block schema to read-new/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998625 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling)
[07:29:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57447 and previous config saved to /var/cache/conftool/dbconfig/20240221-072948-root.json
[07:32:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Configure cluster::management for Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005402 (https://phabricator.wikimedia.org/T349619)
[07:38:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P57448 and previous config saved to /var/cache/conftool/dbconfig/20240221-073801-marostegui.json
[07:44:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57449 and previous config saved to /var/cache/conftool/dbconfig/20240221-074452-root.json
[07:51:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:53:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P57450 and previous config saved to /var/cache/conftool/dbconfig/20240221-075307-marostegui.json
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T0800).
[08:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[08:03:55] <wikibugs>	 (03PS1) 10Samwilson: CommonSettings: Set $wgWikisourceHttpProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005434 (https://phabricator.wikimedia.org/T357857)
[08:04:33] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] clouddb1018: Upgrade to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005218 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui)
[08:06:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Configure cluster::management for Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005402 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:08:09] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9562410 (10MoritzMuehlenhoff)
[08:08:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T355609)', diff saved to https://phabricator.wikimedia.org/P57451 and previous config saved to /var/cache/conftool/dbconfig/20240221-080814-marostegui.json
[08:08:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2130.codfw.wmnet with reason: Maintenance
[08:08:20] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[08:08:30] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2130.codfw.wmnet with reason: Maintenance
[08:08:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T355609)', diff saved to https://phabricator.wikimedia.org/P57452 and previous config saved to /var/cache/conftool/dbconfig/20240221-080836-marostegui.json
[08:09:32] <wikibugs>	 (03PS1) 10Samwilson: InitializeSettings: Add Wikisource logging channel to prod and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005435 (https://phabricator.wikimedia.org/T357857)
[08:15:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch backup1001 to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005436
[08:17:56] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch backup2001 to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005437
[08:19:52] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[08:20:06] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[08:20:08] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:20:24] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[08:20:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T357189)', diff saved to https://phabricator.wikimedia.org/P57454 and previous config saved to /var/cache/conftool/dbconfig/20240221-082029-arnaudb.json
[08:20:38] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[08:21:09] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2180,2188-2190].codfw.wmnet with reason: Silence for reboot T356240
[08:21:25] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2180,2188-2190].codfw.wmnet with reason: Silence for reboot T356240
[08:21:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[08:22:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 db2188 db2189 db2190 depool for T356240', diff saved to https://phabricator.wikimedia.org/P57455 and previous config saved to /var/cache/conftool/dbconfig/20240221-082219-arnaudb.json
[08:23:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-api-int (k8s) 1.18s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:23:33] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2180.codfw.wmnet
[08:23:33] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2188.codfw.wmnet
[08:23:34] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2190.codfw.wmnet
[08:23:34] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2189.codfw.wmnet
[08:28:07] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2188.codfw.wmnet
[08:28:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-api-int (k8s) 1.091s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[08:28:16] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2189.codfw.wmnet
[08:28:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T357189)', diff saved to https://phabricator.wikimedia.org/P57456 and previous config saved to /var/cache/conftool/dbconfig/20240221-082818-arnaudb.json
[08:28:20] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2190.codfw.wmnet
[08:28:24] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[08:28:33] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2180.codfw.wmnet
[08:29:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57457 and previous config saved to /var/cache/conftool/dbconfig/20240221-082935-arnaudb.json
[08:29:55] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57458 and previous config saved to /var/cache/conftool/dbconfig/20240221-082955-arnaudb.json
[08:30:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57459 and previous config saved to /var/cache/conftool/dbconfig/20240221-083006-arnaudb.json
[08:30:17] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57460 and previous config saved to /var/cache/conftool/dbconfig/20240221-083016-arnaudb.json
[08:36:57] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts sretest2005.codfw.wmnet
[08:37:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T355609)', diff saved to https://phabricator.wikimedia.org/P57461 and previous config saved to /var/cache/conftool/dbconfig/20240221-083731-marostegui.json
[08:37:37] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[08:41:44] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox
[08:43:06] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:43:06] <logmsgbot>	 !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts sretest2005.codfw.wmnet
[08:43:16] <wikibugs>	 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152#9562456 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `sretest2005.codfw.wmnet` - sretest2005.codfw...
[08:43:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P57462 and previous config saved to /var/cache/conftool/dbconfig/20240221-084325-arnaudb.json
[08:44:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57463 and previous config saved to /var/cache/conftool/dbconfig/20240221-084440-arnaudb.json
[08:45:00] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57464 and previous config saved to /var/cache/conftool/dbconfig/20240221-084459-arnaudb.json
[08:45:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57465 and previous config saved to /var/cache/conftool/dbconfig/20240221-084511-arnaudb.json
[08:45:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57466 and previous config saved to /var/cache/conftool/dbconfig/20240221-084521-arnaudb.json
[08:45:58] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-Uploading, 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9562459 (10Bawolff) FWIW, i've been investigating this. It does seem to be happening...
[08:52:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P57467 and previous config saved to /var/cache/conftool/dbconfig/20240221-085238-marostegui.json
[08:58:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P57468 and previous config saved to /var/cache/conftool/dbconfig/20240221-085830-arnaudb.json
[08:58:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] clouddb1018: Upgrade to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005218 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui)
[08:59:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57469 and previous config saved to /var/cache/conftool/dbconfig/20240221-085944-arnaudb.json
[09:00:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57470 and previous config saved to /var/cache/conftool/dbconfig/20240221-090004-arnaudb.json
[09:00:12] <hashar>	 !log Restarted CI Jenkins on contint2002 to update the timestamper plugin
[09:00:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57471 and previous config saved to /var/cache/conftool/dbconfig/20240221-090016-arnaudb.json
[09:00:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57472 and previous config saved to /var/cache/conftool/dbconfig/20240221-090026-arnaudb.json
[09:00:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:50] <wikibugs>	 10SRE: Improve automation for the vendor maintenance calendar - https://phabricator.wikimedia.org/T357630#9562472 (10Peachey88)
[09:05:49] <jinxer-wm>	 (PuppetDisabled) resolved: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled
[09:06:14] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Remove pif_edits views [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838)
[09:06:19] <wikibugs>	 10SRE: Improve automation for the vendor maintenance calendar - https://phabricator.wikimedia.org/T357630#9562486 (10ayounsi) See also {T230835}  Putting my clinic duty hat on: @andrea.denisse please assign a subteam and a priority.
[09:06:50] <logmsgbot>	 !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s7
[09:06:54] <logmsgbot>	 !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s2
[09:07:45] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P57473 and previous config saved to /var/cache/conftool/dbconfig/20240221-090744-marostegui.json
[09:09:46] <wikibugs>	 (03PS1) 10Marostegui: es1030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005439 (https://phabricator.wikimedia.org/T358080)
[09:09:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1030 T358080', diff saved to https://phabricator.wikimedia.org/P57474 and previous config saved to /var/cache/conftool/dbconfig/20240221-090957-root.json
[09:10:03] <stashbot>	 T358080: Upgrade es2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358080
[09:10:56] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1030.eqiad.wmnet with OS bookworm
[09:10:58] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es1030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005439 (https://phabricator.wikimedia.org/T358080) (owner: 10Marostegui)
[09:13:38] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T357189)', diff saved to https://phabricator.wikimedia.org/P57475 and previous config saved to /var/cache/conftool/dbconfig/20240221-091337-arnaudb.json
[09:13:39] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:13:44] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[09:13:53] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[09:13:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T357189)', diff saved to https://phabricator.wikimedia.org/P57476 and previous config saved to /var/cache/conftool/dbconfig/20240221-091358-arnaudb.json
[09:14:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57477 and previous config saved to /var/cache/conftool/dbconfig/20240221-091449-arnaudb.json
[09:15:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57478 and previous config saved to /var/cache/conftool/dbconfig/20240221-091509-arnaudb.json
[09:15:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57479 and previous config saved to /var/cache/conftool/dbconfig/20240221-091521-arnaudb.json
[09:15:31] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57480 and previous config saved to /var/cache/conftool/dbconfig/20240221-091531-arnaudb.json
[09:22:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T355609)', diff saved to https://phabricator.wikimedia.org/P57481 and previous config saved to /var/cache/conftool/dbconfig/20240221-092251-marostegui.json
[09:22:53] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance
[09:22:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T357189)', diff saved to https://phabricator.wikimedia.org/P57482 and previous config saved to /var/cache/conftool/dbconfig/20240221-092256-arnaudb.json
[09:22:57] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[09:23:06] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[09:23:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance
[09:24:25] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1030.eqiad.wmnet with reason: host reimage
[09:26:55] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1030.eqiad.wmnet with reason: host reimage
[09:38:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P57484 and previous config saved to /var/cache/conftool/dbconfig/20240221-093802-arnaudb.json
[09:40:59] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bookworm
[09:42:28] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1030.eqiad.wmnet with OS bookworm
[09:43:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 1%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57485 and previous config saved to /var/cache/conftool/dbconfig/20240221-094319-root.json
[09:44:57] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2145.codfw.wmnet with reason: Maintenance
[09:45:10] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2145.codfw.wmnet with reason: Maintenance
[09:45:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T355609)', diff saved to https://phabricator.wikimedia.org/P57486 and previous config saved to /var/cache/conftool/dbconfig/20240221-094516-marostegui.json
[09:45:24] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[09:47:03] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator, 10Patch-For-Review: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9562598 (10MoritzMuehlenhoff) @bvibber Renaming the user name for SSH access will leave files in the old home inacessible (we don't ne...
[09:47:58] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: conftool: Add mw-parsoid stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1004151 (https://phabricator.wikimedia.org/T357392)
[09:48:02] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: service::catalog: Add mw-parsoid service [puppet] - 10https://gerrit.wikimedia.org/r/1004152 (https://phabricator.wikimedia.org/T357392)
[09:48:04] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: mw-parsoid: Add LVS backends on wikikube servers [puppet] - 10https://gerrit.wikimedia.org/r/1004153 (https://phabricator.wikimedia.org/T357392)
[09:48:06] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: mw-parsoid: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1004154 (https://phabricator.wikimedia.org/T357392)
[09:48:08] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: mw-parsoid: Switch to production and have it page [puppet] - 10https://gerrit.wikimedia.org/r/1004155 (https://phabricator.wikimedia.org/T357392)
[09:49:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "Needs some changes, see comments oinline" [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi)
[09:49:40] <jinxer-wm>	 (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[09:49:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] k8s-controller-sidecars: Bump the pod's memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005219 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[09:50:42] <wikibugs>	 10sre-alert-triage, 10SRE Observability (FY2023/2024-Q3): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#9562611 (10LSobanski) There are now four other similar alerts that are over a month old:  Linting problems found for EnvoyRuntimeAdminOverrid...
[09:52:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2031 T358080', diff saved to https://phabricator.wikimedia.org/P57487 and previous config saved to /var/cache/conftool/dbconfig/20240221-095205-root.json
[09:52:11] <stashbot>	 T358080: Upgrade es2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358080
[09:53:07] <wikibugs>	 (03PS1) 10Marostegui: es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005448 (https://phabricator.wikimedia.org/T358080)
[09:53:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P57488 and previous config saved to /var/cache/conftool/dbconfig/20240221-095309-arnaudb.json
[09:53:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2031.codfw.wmnet with OS bookworm
[09:53:56] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[09:54:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005448 (https://phabricator.wikimedia.org/T358080) (owner: 10Marostegui)
[09:55:37] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999715 (owner: 10Muehlenhoff)
[09:56:16] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage
[09:58:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57489 and previous config saved to /var/cache/conftool/dbconfig/20240221-095823-root.json
[09:59:54] <wikibugs>	 (03PS1) 10JMeybohm: kafka_shipper: Name omkafka actions to ingest metrics [puppet] - 10https://gerrit.wikimedia.org/r/1005449 (https://phabricator.wikimedia.org/T357616)
[10:05:05] <wikibugs>	 (03CR) 10Ayounsi: [V: 03+1 C: 03+2] "reimage of sretest1003 worked fine." [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[10:05:45] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] deployment_server: add mw-mcrouter service 1 [puppet] - 10https://gerrit.wikimedia.org/r/979339 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[10:05:48] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add namespace for mw-mcrouter service 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979340 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[10:08:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T357189)', diff saved to https://phabricator.wikimedia.org/P57490 and previous config saved to /var/cache/conftool/dbconfig/20240221-100815-arnaudb.json
[10:08:17] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[10:08:21] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[10:08:29] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:08:31] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[10:08:35] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:08:49] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:08:56] <wikibugs>	 (03PS1) 10Ayounsi: Routed Ganeti: move the tap v4 IP to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152)
[10:09:15] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:09:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see also https://github.com/prometheus-community/rsyslog_exporter/pull/12#issuecomment-1956303298 for metrics that will be added" [puppet] - 10https://gerrit.wikimedia.org/r/1005449 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm)
[10:10:41] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[10:11:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T355609)', diff saved to https://phabricator.wikimedia.org/P57491 and previous config saved to /var/cache/conftool/dbconfig/20240221-101111-marostegui.json
[10:11:17] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[10:11:25] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:11:45] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51453 bytes in 4.147 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:12:03] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bookworm
[10:12:07] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:12:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2031.codfw.wmnet with reason: host reimage
[10:13:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57492 and previous config saved to /var/cache/conftool/dbconfig/20240221-101328-root.json
[10:14:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2031.codfw.wmnet with reason: host reimage
[10:15:12] <wikibugs>	 (03PS2) 10Ayounsi: Routed Ganeti: move the tap v4 IP to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152)
[10:15:32] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[10:16:27] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[10:16:39] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] logstash_checker.py: Add ability to check all MediaWiki canaries at once [puppet] - 10https://gerrit.wikimedia.org/r/1003885 (https://phabricator.wikimedia.org/T357402) (owner: 10Ahmon Dancy)
[10:16:41] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance
[10:16:47] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T357189)', diff saved to https://phabricator.wikimedia.org/P57493 and previous config saved to /var/cache/conftool/dbconfig/20240221-101646-arnaudb.json
[10:16:52] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[10:18:27] <wikibugs>	 (03PS3) 10Ayounsi: Routed Ganeti: move the tap v4 IP to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152)
[10:20:58] <wikibugs>	 (03PS1) 10Ayounsi: Add .vscode to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/1005451
[10:22:27] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005451 (owner: 10Ayounsi)
[10:22:51] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[10:24:33] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T357189)', diff saved to https://phabricator.wikimedia.org/P57494 and previous config saved to /var/cache/conftool/dbconfig/20240221-102432-arnaudb.json
[10:24:38] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[10:25:45] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1005451 (owner: 10Ayounsi)
[10:26:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P57495 and previous config saved to /var/cache/conftool/dbconfig/20240221-102618-marostegui.json
[10:26:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add .vscode to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/1005451 (owner: 10Ayounsi)
[10:28:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57496 and previous config saved to /var/cache/conftool/dbconfig/20240221-102833-root.json
[10:31:43] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] changeprop: clean up k8s jobrunner references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004066 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[10:32:41] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[10:32:58] <wikibugs>	 (03Merged) 10jenkins-bot: changeprop: clean up k8s jobrunner references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004066 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[10:32:58] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[10:34:16] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[10:34:32] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[10:34:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/999715 (owner: 10Muehlenhoff)
[10:35:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2031.codfw.wmnet with OS bookworm
[10:35:56] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[10:36:27] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[10:36:35] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[10:37:04] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[10:39:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P57497 and previous config saved to /var/cache/conftool/dbconfig/20240221-103938-arnaudb.json
[10:41:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P57498 and previous config saved to /var/cache/conftool/dbconfig/20240221-104124-marostegui.json
[10:42:20] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005466
[10:43:04] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] service::catalog: Add mw-parsoid service [puppet] - 10https://gerrit.wikimedia.org/r/1004152 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[10:43:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] conftool: Add mw-parsoid stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1004151 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[10:43:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mw-parsoid: Add LVS backends on wikikube servers [puppet] - 10https://gerrit.wikimedia.org/r/1004153 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[10:43:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57499 and previous config saved to /var/cache/conftool/dbconfig/20240221-104339-root.json
[10:44:05] <wikibugs>	 (03PS2) 10Hnowlan: users: add jwheeler to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1004187 (https://phabricator.wikimedia.org/T357731)
[10:44:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005466 (owner: 10Marostegui)
[10:45:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57500 and previous config saved to /var/cache/conftool/dbconfig/20240221-104526-root.json
[10:48:35] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[10:51:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1004187 (https://phabricator.wikimedia.org/T357731) (owner: 10Hnowlan)
[10:52:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova-compute: persist compute node id [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) (owner: 10Arturo Borrero Gonzalez)
[10:54:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P57501 and previous config saved to /var/cache/conftool/dbconfig/20240221-105445-arnaudb.json
[10:55:55] <wikibugs>	 (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005457 (https://phabricator.wikimedia.org/T356736)
[10:56:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T355609)', diff saved to https://phabricator.wikimedia.org/P57502 and previous config saved to /var/cache/conftool/dbconfig/20240221-105630-marostegui.json
[10:56:33] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance
[10:56:36] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[10:56:48] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance
[10:56:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T355609)', diff saved to https://phabricator.wikimedia.org/P57503 and previous config saved to /var/cache/conftool/dbconfig/20240221-105654-marostegui.json
[10:58:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57504 and previous config saved to /var/cache/conftool/dbconfig/20240221-105844-root.json
[11:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1100)
[11:00:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57505 and previous config saved to /var/cache/conftool/dbconfig/20240221-110031-root.json
[11:01:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver2002.codfw.wmnet
[11:02:01] <wikibugs>	 (03CR) 10Tchanders: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005457 (https://phabricator.wikimedia.org/T356736) (owner: 10STran)
[11:02:25] <jinxer-wm>	 (SystemdUnitFailed) firing: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:03:03] <wikibugs>	 (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005457 (https://phabricator.wikimedia.org/T356736) (owner: 10STran)
[11:03:25] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:04:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2002.codfw.wmnet
[11:05:04] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply
[11:05:50] <logmsgbot>	 !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply
[11:06:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1001.eqiad.wmnet
[11:07:10] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply
[11:07:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:08:11] <logmsgbot>	 !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply
[11:08:35] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[11:08:35] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:09:03] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1001.eqiad.wmnet
[11:09:29] <akosiaris>	 the httpbb mw-parsoid alerts can be ignored for now. I am still in the process of setting up the service. I didn't expect them to fire though.
[11:09:45] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[11:09:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T357189)', diff saved to https://phabricator.wikimedia.org/P57506 and previous config saved to /var/cache/conftool/dbconfig/20240221-110951-arnaudb.json
[11:09:53] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[11:09:54] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply
[11:09:56] <logmsgbot>	 !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply
[11:09:58] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2191-2193].codfw.wmnet,db1151.eqiad.wmnet with reason: Silence for reboot T356240
[11:10:06] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance
[11:10:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T357189)', diff saved to https://phabricator.wikimedia.org/P57507 and previous config saved to /var/cache/conftool/dbconfig/20240221-111012-arnaudb.json
[11:10:13] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2191-2193].codfw.wmnet,db1151.eqiad.wmnet with reason: Silence for reboot T356240
[11:10:22] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[11:10:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 - depooling db2191 db2192 db2193 db1151', diff saved to https://phabricator.wikimedia.org/P57508 and previous config saved to /var/cache/conftool/dbconfig/20240221-111023-arnaudb.json
[11:11:43] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2191.codfw.wmnet
[11:11:44] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1151.eqiad.wmnet
[11:11:44] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2193.codfw.wmnet
[11:11:45] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2192.codfw.wmnet
[11:13:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57510 and previous config saved to /var/cache/conftool/dbconfig/20240221-111348-root.json
[11:15:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57511 and previous config saved to /var/cache/conftool/dbconfig/20240221-111536-root.json
[11:16:10] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2193.codfw.wmnet
[11:16:14] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2192.codfw.wmnet
[11:16:15] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2191.codfw.wmnet
[11:17:03] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1151.eqiad.wmnet
[11:18:05] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T357189)', diff saved to https://phabricator.wikimedia.org/P57512 and previous config saved to /var/cache/conftool/dbconfig/20240221-111805-arnaudb.json
[11:18:14] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[11:18:35] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:18:54] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] Add gitreview configuration [software/bitu] - 10https://gerrit.wikimedia.org/r/997809 (https://phabricator.wikimedia.org/T355180) (owner: 10Slyngshede)
[11:20:10] <wikibugs>	 (03PS1) 10Jelto: etherpad: make exporter and blackbox checks configurable [puppet] - 10https://gerrit.wikimedia.org/r/1005458 (https://phabricator.wikimedia.org/T316421)
[11:24:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T355609)', diff saved to https://phabricator.wikimedia.org/P57513 and previous config saved to /var/cache/conftool/dbconfig/20240221-112408-marostegui.json
[11:24:15] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[11:24:37] <wikibugs>	 (03PS2) 10Jelto: etherpad: make exporter and blackbox checks configurable [puppet] - 10https://gerrit.wikimedia.org/r/1005458 (https://phabricator.wikimedia.org/T316421)
[11:24:59] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] c-cqlsh is now deprecated; long live cqlsh-instance [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1004235 (owner: 10Eevans)
[11:30:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57514 and previous config saved to /var/cache/conftool/dbconfig/20240221-113041-root.json
[11:32:09] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Added cassandra IPs for restbase10[34-42] - volans@cumin1002"
[11:32:35] <logmsgbot>	 !log volans@cumin1002 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "Added cassandra IPs for restbase10[34-42] - volans@cumin1002"
[11:32:52] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.dns.netbox
[11:33:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P57515 and previous config saved to /var/cache/conftool/dbconfig/20240221-113311-arnaudb.json
[11:34:26] <TheresNoTime>	 jouncebot: nowandnext
[11:34:26] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1100)
[11:34:27] <jouncebot>	 In 2 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1400)
[11:35:12] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Added cassandra IPs for restbase10[34-42] - volans@cumin1002"
[11:36:34] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Added cassandra IPs for restbase10[34-42] - volans@cumin1002"
[11:36:34] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:39:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P57516 and previous config saved to /var/cache/conftool/dbconfig/20240221-113914-marostegui.json
[11:45:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57517 and previous config saved to /var/cache/conftool/dbconfig/20240221-114546-root.json
[11:48:18] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P57518 and previous config saved to /var/cache/conftool/dbconfig/20240221-114817-arnaudb.json
[11:48:56] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57519 and previous config saved to /var/cache/conftool/dbconfig/20240221-114856-arnaudb.json
[11:49:10] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57520 and previous config saved to /var/cache/conftool/dbconfig/20240221-114909-arnaudb.json
[11:49:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57521 and previous config saved to /var/cache/conftool/dbconfig/20240221-114925-arnaudb.json
[11:54:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P57522 and previous config saved to /var/cache/conftool/dbconfig/20240221-115421-marostegui.json
[11:56:34] <icinga-wm_>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:56:58] <icinga-wm_>	 PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:57:38] <icinga-wm_>	 PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:58:30] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 80 connections established with conf2004.codfw.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal
[11:58:44] <icinga-wm_>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[11:59:27] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 84 connections established with conf1007.eqiad.wmnet:4001 (min=85) https://wikitech.wikimedia.org/wiki/PyBal
[11:59:27] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 114 connections established with conf1007.eqiad.wmnet:4001 (min=115) https://wikitech.wikimedia.org/wiki/PyBal
[11:59:39] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[11:59:40] <jinxer-wm>	 (KubernetesAPINotScrapable) resolved: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable
[12:00:13] <icinga-wm_>	 PROBLEM - PyBal connections to etcd on lvs2014 is CRITICAL: CRITICAL: 98 connections established with conf2004.codfw.wmnet:4001 (min=99) https://wikitech.wikimedia.org/wiki/PyBal
[12:00:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57523 and previous config saved to /var/cache/conftool/dbconfig/20240221-120051-root.json
[12:01:13] <akosiaris>	 !log restart pybal on lvs1020 to pickup mw-parsoid service. T357392
[12:01:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:01:37] <icinga-wm_>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:01:43] <stashbot>	 T357392: Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392
[12:02:01] <icinga-wm_>	 RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:02:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2026 T358080', diff saved to https://phabricator.wikimedia.org/P57524 and previous config saved to /var/cache/conftool/dbconfig/20240221-120202-root.json
[12:02:21] <akosiaris>	 !log restart pybal on lvs2014 to pickup mw-parsoid service. T357392
[12:02:23] <stashbot>	 T358080: Upgrade es2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358080
[12:02:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:03:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T357189)', diff saved to https://phabricator.wikimedia.org/P57525 and previous config saved to /var/cache/conftool/dbconfig/20240221-120324-arnaudb.json
[12:03:26] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[12:03:40] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance
[12:03:41] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[12:03:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2026.codfw.wmnet with OS bookworm
[12:03:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T357189)', diff saved to https://phabricator.wikimedia.org/P57526 and previous config saved to /var/cache/conftool/dbconfig/20240221-120345-arnaudb.json
[12:04:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57527 and previous config saved to /var/cache/conftool/dbconfig/20240221-120401-arnaudb.json
[12:04:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57528 and previous config saved to /var/cache/conftool/dbconfig/20240221-120414-arnaudb.json
[12:04:25] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bookworm
[12:04:27] <icinga-wm_>	 RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 115 connections established with conf1007.eqiad.wmnet:4001 (min=115) https://wikitech.wikimedia.org/wiki/PyBal
[12:04:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57529 and previous config saved to /var/cache/conftool/dbconfig/20240221-120429-arnaudb.json
[12:05:13] <icinga-wm_>	 RECOVERY - PyBal connections to etcd on lvs2014 is OK: OK: 99 connections established with conf2004.codfw.wmnet:4001 (min=99) https://wikitech.wikimedia.org/wiki/PyBal
[12:07:49] <kart_>	 Deploying fix for cxserver..
[12:08:29] <icinga-wm_>	 RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 81 connections established with conf2004.codfw.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal
[12:08:43] <icinga-wm_>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:09:27] <icinga-wm_>	 RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 85 connections established with conf1007.eqiad.wmnet:4001 (min=85) https://wikitech.wikimedia.org/wiki/PyBal
[12:09:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T355609)', diff saved to https://phabricator.wikimedia.org/P57530 and previous config saved to /var/cache/conftool/dbconfig/20240221-120927-marostegui.json
[12:09:28] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1033
[12:09:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance
[12:09:34] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[12:09:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance
[12:09:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T355609)', diff saved to https://phabricator.wikimedia.org/P57531 and previous config saved to /var/cache/conftool/dbconfig/20240221-120949-marostegui.json
[12:09:55] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1033
[12:10:19] <akosiaris>	 !log restart pybal on lvs2013, lvs 1019 to pickup mw-parsoid service. T357392
[12:10:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:28] <stashbot>	 T357392: Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392
[12:10:37] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-content-change-enrich: Switch to mw-api-int-async [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004156 (https://phabricator.wikimedia.org/T357785) (owner: 10Clément Goubert)
[12:10:40] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway)
[12:11:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T357189)', diff saved to https://phabricator.wikimedia.org/P57532 and previous config saved to /var/cache/conftool/dbconfig/20240221-121129-arnaudb.json
[12:11:36] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[12:12:19] <claime>	 !log mw-page-content-change-enrich: Switch to mw-api-int-async - T357785
[12:12:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:27] <stashbot>	 T357785: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785
[12:12:37] <icinga-wm_>	 RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[12:13:00] <wikibugs>	 10SRE, 10Data-Engineering, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9563142 (10BTullis) I'm happy for this change to go ahead. I'll keep an eye on the [[https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink...
[12:13:04] <wikibugs>	 (03CR) 10Muehlenhoff: "All backup-related hosts are fully migrated to Puppet 7 already :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1005437 (owner: 10Muehlenhoff)
[12:13:12] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:13:17] <wikibugs>	 (03PS5) 10Jelto: etherpad: make exporter and blackbox checks configurable [puppet] - 10https://gerrit.wikimedia.org/r/1005458 (https://phabricator.wikimedia.org/T316421)
[12:13:27] <wikibugs>	 (03Merged) 10jenkins-bot: Ganeti: pass the v4 and v6 IPs to the VM as fw_cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:13:42] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:14:24] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[12:14:45] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[12:15:07] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2026.codfw.wmnet with OS bookworm
[12:15:23] <wikibugs>	 (03CR) 10David Caro: toolforge: k8s: Do not log secrets to Puppet log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005488 (owner: 10Majavah)
[12:15:32] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2026.codfw.wmnet with OS bookworm
[12:15:34] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:15:36] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Just the question, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005488 (owner: 10Majavah)
[12:15:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:16:05] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] toolforge: k8s: Do not log secrets to Puppet log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005488 (owner: 10Majavah)
[12:16:09] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add mw-parsoid [dns] - 10https://gerrit.wikimedia.org/r/1004138 (https://phabricator.wikimedia.org/T357392)
[12:16:31] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1005458 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto)
[12:18:14] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[12:18:46] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[12:18:49] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+1] CommonSettings: Set $wgWikisourceHttpProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005434 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson)
[12:19:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Add mw-parsoid [dns] - 10https://gerrit.wikimedia.org/r/1004138 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[12:19:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57533 and previous config saved to /var/cache/conftool/dbconfig/20240221-121906-arnaudb.json
[12:19:19] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57534 and previous config saved to /var/cache/conftool/dbconfig/20240221-121918-arnaudb.json
[12:19:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57535 and previous config saved to /var/cache/conftool/dbconfig/20240221-121934-arnaudb.json
[12:19:45] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[12:19:49] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[12:19:59] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:20:02] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[12:20:11] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:20:36] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[12:21:22] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2026.codfw.wmnet with OS bookworm
[12:21:43] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage
[12:21:47] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+1] InitializeSettings: Add Wikisource logging channel to prod and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005435 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson)
[12:22:50] <kart_>	 !log Updated cxserver to 2024-02-21-112101-production (T357769)
[12:22:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:55] <stashbot>	 T357769: cxserver "fetch segmented page content" API endpoint doesn't work for space-separated multi-word titles - https://phabricator.wikimedia.org/T357769
[12:22:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, and 2 others: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428#9563208 (10ayounsi) {T358096} for the Cassandra/extra IPs usecase.
[12:24:07] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage
[12:24:21] <logmsgbot>	 !log akosiaris@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-parsoid,name=codfw
[12:26:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P57536 and previous config saved to /var/cache/conftool/dbconfig/20240221-122636-arnaudb.json
[12:26:42] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Thank you, please deploy at will" [puppet] - 10https://gerrit.wikimedia.org/r/1005437 (owner: 10Muehlenhoff)
[12:30:35] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: mw-parsoid: Switch to production and have it page [puppet] - 10https://gerrit.wikimedia.org/r/1004155 (https://phabricator.wikimedia.org/T357392)
[12:30:38] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "Will you take care of dropping the views too or should I?" [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui)
[12:30:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] mw-parsoid: Switch to production and have it page [puppet] - 10https://gerrit.wikimedia.org/r/1004155 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris)
[12:31:50] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9563252 (10Clement_Goubert)
[12:33:10] <wikibugs>	 10SRE, 10Data-Engineering, 10MW-on-K8s, 10serviceops: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9563249 (10Clement_Goubert) 05In progress→03Resolved I can confirm that mw-page-content-enrich now requests from mw-api-int (blue) and not the appserver...
[12:33:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9563260 (10MoritzMuehlenhoff)
[12:34:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57537 and previous config saved to /var/cache/conftool/dbconfig/20240221-123410-arnaudb.json
[12:34:24] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57538 and previous config saved to /var/cache/conftool/dbconfig/20240221-123423-arnaudb.json
[12:34:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57539 and previous config saved to /var/cache/conftool/dbconfig/20240221-123439-arnaudb.json
[12:36:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T355609)', diff saved to https://phabricator.wikimedia.org/P57540 and previous config saved to /var/cache/conftool/dbconfig/20240221-123615-marostegui.json
[12:36:28] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[12:36:45] <wikibugs>	 (03PS1) 10Muehlenhoff: acmechief: Remove obsolete entries from apt record [puppet] - 10https://gerrit.wikimedia.org/r/1005498 (https://phabricator.wikimedia.org/T331613)
[12:36:56] <wikibugs>	 (03PS2) 10Muehlenhoff: acmechief: Remove obsolete entries from apt record [puppet] - 10https://gerrit.wikimedia.org/r/1005498 (https://phabricator.wikimedia.org/T331613)
[12:38:18] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Figure out next steps for cergen in Puppet setup - https://phabricator.wikimedia.org/T357750#9563263 (10MoritzMuehlenhoff)
[12:39:38] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:41:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P57541 and previous config saved to /var/cache/conftool/dbconfig/20240221-124142-arnaudb.json
[12:44:14] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] sre.hosts.reimage: Fix dry-run failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1005112 (owner: 10Clément Goubert)
[12:46:38] <wikibugs>	 (03CR) 10Marostegui: "I'd prefer if you merge this and drop the views too." [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui)
[12:48:08] <TheresNoTime>	 jouncebot: nowandnext
[12:48:08] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 11 minute(s)
[12:48:08] <jouncebot>	 In 1 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1400)
[12:48:49] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hosts.reimage: Fix dry-run failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1005112 (owner: 10Clément Goubert)
[12:49:16] <wikibugs>	 (03PS4) 10Samtar: InitialiseSettings: Enable Edit Recovery on 3 projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004736 (https://phabricator.wikimedia.org/T355548)
[12:51:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P57542 and previous config saved to /var/cache/conftool/dbconfig/20240221-125121-marostegui.json
[12:51:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004736 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar)
[12:52:18] <wikibugs>	 (03PS2) 10Tim Starling: beta: Switch block schema to read-new/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998625 (https://phabricator.wikimedia.org/T355034)
[12:52:36] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings: Enable Edit Recovery on 3 projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004736 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar)
[12:52:47] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] beta: Switch block schema to read-new/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998625 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling)
[12:52:54] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1033.eqiad.wmnet with OS bookworm
[12:53:05] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1033.eqiad.wmnet with OS book...
[12:53:29] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Switch block schema to read-new/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998625 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling)
[12:53:55] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:1004736|InitialiseSettings: Enable Edit Recovery on 3 projects (T355548)]]
[12:54:00] <stashbot>	 T355548: Edit Recovery deployment - https://phabricator.wikimedia.org/T355548
[12:54:36] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] acmechief: Remove obsolete entries from apt record [puppet] - 10https://gerrit.wikimedia.org/r/1005498 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[12:55:33] <logmsgbot>	 !log samtar@deploy2002 samtar: Backport for [[gerrit:1004736|InitialiseSettings: Enable Edit Recovery on 3 projects (T355548)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:55:41] * TheresNoTime testing, a few minutes
[12:56:32] <wikibugs>	 10SRE, 10User-aborrero: reimage cookbook: failure when - https://phabricator.wikimedia.org/T358099#9563302 (10aborrero)
[12:56:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T357189)', diff saved to https://phabricator.wikimedia.org/P57543 and previous config saved to /var/cache/conftool/dbconfig/20240221-125648-arnaudb.json
[12:56:51] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[12:56:54] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[12:57:03] <wikibugs>	 10SRE, 10User-aborrero: reimage cookbook: failure when updating netbox data from puppetdb on cloudvirt1033 - https://phabricator.wikimedia.org/T358099#9563314 (10aborrero)
[12:57:05] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance
[12:57:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T357189)', diff saved to https://phabricator.wikimedia.org/P57544 and previous config saved to /var/cache/conftool/dbconfig/20240221-125711-arnaudb.json
[12:57:54] <Daimona>	 !log T357007 Running mwscript /home/daimona/GenerateInvitationList.php --wiki=metawiki --listfile=/home/daimona/list.txt (same as current master)
[12:57:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:57:59] <stashbot>	 T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007
[13:00:15] <logmsgbot>	 !log samtar@deploy2002 samtar: Continuing with sync
[13:02:48] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudvirt1033 - aborrero@cumin1002"
[13:03:24] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:03:37] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudvirt1033 - aborrero@cumin1002"
[13:04:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] acmechief: Remove obsolete entries from apt record [puppet] - 10https://gerrit.wikimedia.org/r/1005498 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[13:04:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T357189)', diff saved to https://phabricator.wikimedia.org/P57545 and previous config saved to /var/cache/conftool/dbconfig/20240221-130450-arnaudb.json
[13:05:08] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[13:06:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P57546 and previous config saved to /var/cache/conftool/dbconfig/20240221-130628-marostegui.json
[13:07:52] <wikibugs>	 (03PS1) 10JMeybohm: New upstream version v1.0.0-8522c38 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616)
[13:07:55] <jinxer-wm>	 (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:08:32] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:1004736|InitialiseSettings: Enable Edit Recovery on 3 projects (T355548)]] (duration: 14m 36s)
[13:08:43] <stashbot>	 T355548: Edit Recovery deployment - https://phabricator.wikimedia.org/T355548
[13:08:49] <wikibugs>	 (03PS2) 10JMeybohm: New upstream version v1.0.0-8522c38 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616)
[13:11:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report
[13:11:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0)
[13:11:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host apt2002.wikimedia.org
[13:11:41] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.netbox
[13:13:47] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM apt2002.wikimedia.org - jmm@cumin2002"
[13:13:50] <wikibugs>	 (03CR) 10JMeybohm: "Not by this, though. This will enable scaping of rsyslog_action metrics for omkafka actions we define." [puppet] - 10https://gerrit.wikimedia.org/r/1005449 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm)
[13:14:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM apt2002.wikimedia.org - jmm@cumin2002"
[13:14:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:14:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache apt2002.wikimedia.org on all recursors
[13:14:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) apt2002.wikimedia.org on all recursors
[13:14:44] <wikibugs>	 (03CR) 10JMeybohm: "For (way to broad) PCC, see: https://puppet-compiler.wmflabs.org/output/1005449/1413/" [puppet] - 10https://gerrit.wikimedia.org/r/1005449 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm)
[13:15:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM apt2002.wikimedia.org - jmm@cumin2002"
[13:16:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM apt2002.wikimedia.org - jmm@cumin2002"
[13:16:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] kafka_shipper: Name omkafka actions to ingest metrics [puppet] - 10https://gerrit.wikimedia.org/r/1005449 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm)
[13:18:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host apt2002.wikimedia.org with OS bookworm
[13:19:09] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613#9563370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host apt2002.wikimedia.org with OS bookworm
[13:19:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P57547 and previous config saved to /var/cache/conftool/dbconfig/20240221-131957-arnaudb.json
[13:21:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T355609)', diff saved to https://phabricator.wikimedia.org/P57548 and previous config saved to /var/cache/conftool/dbconfig/20240221-132134-marostegui.json
[13:21:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance
[13:21:41] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[13:21:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance
[13:21:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T355609)', diff saved to https://phabricator.wikimedia.org/P57549 and previous config saved to /var/cache/conftool/dbconfig/20240221-132156-marostegui.json
[13:22:16] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[13:22:37] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[13:32:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on apt2002.wikimedia.org with reason: host reimage
[13:34:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apt2002.wikimedia.org with reason: host reimage
[13:35:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P57550 and previous config saved to /var/cache/conftool/dbconfig/20240221-133503-arnaudb.json
[13:37:08] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563419 (10aborrero)
[13:37:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove cumin1001 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1005401 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff)
[13:38:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#8416726 (10aborrero)
[13:39:47] <wikibugs>	 10SRE, 10Data-Persistence, 10Infrastructure-Foundations: Re-IP db servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T354878#9563430 (10Marostegui)
[13:39:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2142.codfw.wmnet,db[1180,1213].eqiad.wmnet with reason: Silence for reboot T356240
[13:40:06] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2142.codfw.wmnet,db[1180,1213].eqiad.wmnet with reason: Silence for reboot T356240
[13:40:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 - depooling db1180 db1213 db2142', diff saved to https://phabricator.wikimedia.org/P57551 and previous config saved to /var/cache/conftool/dbconfig/20240221-134015-arnaudb.json
[13:40:23] <Dreamy_Jazz>	 !log Re-started MediaModeration scanning script using `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30-no-render-now.txt` - See T351400
[13:40:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:40:28] <stashbot>	 T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400
[13:40:45] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1180.eqiad.wmnet
[13:41:05] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1213.eqiad.wmnet
[13:41:17] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2142.codfw.wmnet
[13:41:42] <wikibugs>	 (03PS1) 10Muehlenhoff: Drop not obsolete motd for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/1005511 (https://phabricator.wikimedia.org/T353419)
[13:42:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, see inline" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm)
[13:42:46] <wikibugs>	 (03PS2) 10Muehlenhoff: Drop obsolete motd for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/1005511 (https://phabricator.wikimedia.org/T353419)
[13:44:17] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1213.eqiad.wmnet
[13:45:20] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1180.eqiad.wmnet
[13:45:30] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1033: move to single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1005513 (https://phabricator.wikimedia.org/T319184)
[13:46:06] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57552 and previous config saved to /var/cache/conftool/dbconfig/20240221-134605-arnaudb.json
[13:46:25] <jinxer-wm>	 (SystemdUnitFailed) firing: ferm.service on kubernetes2016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:46:39] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2142.codfw.wmnet
[13:47:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57553 and previous config saved to /var/cache/conftool/dbconfig/20240221-134724-arnaudb.json
[13:47:41] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:48:33] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[13:49:15] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] cloudvirt1033: move to single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1005513 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[13:49:50] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1033: move to single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1005513 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez)
[13:50:10] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T357189)', diff saved to https://phabricator.wikimedia.org/P57554 and previous config saved to /var/cache/conftool/dbconfig/20240221-135009-arnaudb.json
[13:50:11] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[13:50:16] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[13:50:25] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[13:50:32] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T357189)', diff saved to https://phabricator.wikimedia.org/P57555 and previous config saved to /var/cache/conftool/dbconfig/20240221-135031-arnaudb.json
[13:50:43] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bookworm
[13:50:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563463 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1033.eqiad.wmnet with OS...
[13:51:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:51:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) ferm.service on kubernetes2016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:52:05] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2016 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:56:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[13:59:14] <topranks>	 !log adding IRB anycast interface on private1-a-codfw vlan to lsw1-a4-codfw 
[13:59:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2057:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2057 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:59:56] * Lucas_WMDE will not be available during the backport+config window btw
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1400)
[14:00:05] <jouncebot>	 koi, anzx, and hoo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:52] * TheresNoTime can't deploy in this window today, sorry!
[14:01:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57556 and previous config saved to /var/cache/conftool/dbconfig/20240221-140110-arnaudb.json
[14:01:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:01:17] <koi>	 :(
[14:01:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T357189)', diff saved to https://phabricator.wikimedia.org/P57557 and previous config saved to /var/cache/conftool/dbconfig/20240221-140120-arnaudb.json
[14:01:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) ferm.service on kubernetes2016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:01:44] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[14:02:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57558 and previous config saved to /var/cache/conftool/dbconfig/20240221-140229-arnaudb.json
[14:03:16] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw2297 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:04:22] <TheresNoTime>	 lemme see if I can move things around
[14:05:02] <TheresNoTime>	 Okay, I can deploy. koi your patch is first
[14:05:08] <wikibugs>	 (03PS4) 10Samtar: zhwiki: Create group ipblock-exempt-grantor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005109 (https://phabricator.wikimedia.org/T357991) (owner: 10Stang)
[14:05:53] <logmsgbot>	 !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host apt2002.wikimedia.org with OS bookworm
[14:05:53] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host apt2002.wikimedia.org
[14:05:56] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:05:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613#9563494 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host apt2002.wikimedia.org with OS bookworm executed with errors: - apt2002 (**FAIL**)   - Removed...
[14:06:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:06:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:07:40] <logmsgbot>	 !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage
[14:08:25] <jinxer-wm>	 (SystemdUnitFailed) firing: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:08:25] <claime>	 !log restarted ferm.service on kubernetes2055.codfw.wmnet mw2440.codfw.wmnet mw2297.codfw.wmnet kubernetes2016.codfw.wmnet - T354855
[14:08:27] <TheresNoTime>	 I'm just waiting a moment because of those jobqueue errors, there's quite a few
[14:08:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:30] <stashbot>	 T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855
[14:10:20] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage
[14:10:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005109 (https://phabricator.wikimedia.org/T357991) (owner: 10Stang)
[14:10:51] <jinxer-wm>	 (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:11:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:11:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: (5) httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:11:42] <wikibugs>	 (03PS3) 10JMeybohm: New upstream version v1.0.0-8522c38 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616)
[14:11:47] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: Create group ipblock-exempt-grantor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005109 (https://phabricator.wikimedia.org/T357991) (owner: 10Stang)
[14:12:11] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:1005109|zhwiki: Create group ipblock-exempt-grantor (T357991)]]
[14:12:19] <stashbot>	 T357991: Create ipblock exempt granter group on zhwiki - https://phabricator.wikimedia.org/T357991
[14:12:23] <wikibugs>	 (03CR) 10JMeybohm: New upstream version v1.0.0-8522c38 (031 comment) [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm)
[14:13:00] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9563515 (10Jhancock.wm)
[14:13:38] <claime>	 These jobqueue errors started exactly when Dreamy_Jazz re-started the MediaModeration scanning script
[14:13:41] <logmsgbot>	 !log samtar@deploy2002 stang and samtar: Backport for [[gerrit:1005109|zhwiki: Create group ipblock-exempt-grantor (T357991)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:13:44] <TheresNoTime>	 koi: ready for testing on mwdebug
[14:13:49] <koi>	 looking
[14:14:07] <claime>	 I wonder if there's a link, hnowlan did you see that kind of correlation before when we had error spikes on the jobrunners
[14:14:20] <claime>	 It's all "Could not enqueue jobs" errors
[14:14:36] <wikibugs>	 (03CR) 10JMeybohm: "recheck" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm)
[14:15:04] <koi>	 TheresNoTime, lgtm
[14:15:07] <wikibugs>	 (03PS1) 10Brouberol: Add a sidecar pod to superset for serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis)
[14:15:08] <wikibugs>	 10SRE, 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T357445#9563522 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm alerts cleared
[14:15:10] <wikibugs>	 (03CR) 10Brouberol: "LGTM, except for some required image tag updates in the helmfile values." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis)
[14:15:15] <logmsgbot>	 !log samtar@deploy2002 stang and samtar: Continuing with sync
[14:15:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T355609)', diff saved to https://phabricator.wikimedia.org/P57559 and previous config saved to /var/cache/conftool/dbconfig/20240221-141523-marostegui.json
[14:15:29] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[14:15:34] <TheresNoTime>	 anzx: you're up next, just noticed that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1005085 is a WIP?
[14:15:51] <jinxer-wm>	 (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[14:15:55] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:16:14] <wikibugs>	 (03PS3) 10Anzx: cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005085
[14:16:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57560 and previous config saved to /var/cache/conftool/dbconfig/20240221-141615-arnaudb.json
[14:16:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P57561 and previous config saved to /var/cache/conftool/dbconfig/20240221-141627-arnaudb.json
[14:16:43] <anzx>	 TheresNoTime: now marked it as active 
[14:16:50] <TheresNoTime>	 ack :)
[14:16:58] <wikibugs>	 (03PS4) 10Samtar: mywiki: create portal and draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990077 (https://phabricator.wikimedia.org/T352424) (owner: 10Anzx)
[14:17:03] <wikibugs>	 (03PS4) 10Samtar: cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005085 (owner: 10Anzx)
[14:17:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57562 and previous config saved to /var/cache/conftool/dbconfig/20240221-141734-arnaudb.json
[14:19:40] <wikibugs>	 (03PS4) 10JMeybohm: New upstream version v1.0.0-8522c38 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616)
[14:20:43] <Dreamy_Jazz>	 claime: Where is the data for errors with the job queue?
[14:20:53] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new apt server in codfw - jmm@cumin2002 - T331613"
[14:20:58] <stashbot>	 T331613: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613
[14:21:05] <Dreamy_Jazz>	 Oh it might be logstash?
[14:21:16] <claime>	 Dreamy_Jazz: https://logstash.wikimedia.org/goto/215d37acfb142f299fd51816688e0ea6
[14:21:18] <claime>	 yep
[14:21:30] <Dreamy_Jazz>	 I hadn't seen anything on the grafana dashboards
[14:21:36] <Dreamy_Jazz>	 So wondered where it was
[14:21:38] <Dreamy_Jazz>	 Thanks.
[14:21:59] <claime>	 I'm not seeing anything in the jobqueue dashboard either
[14:22:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new apt server in codfw - jmm@cumin2002 - T331613"
[14:22:05] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2016 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:22:28] <Dreamy_Jazz>	 The MediaModeration scanning script is running fine according to the dashboard for it and the events should only be occurring on commonswiki if it was that script.
[14:22:49] <Dreamy_Jazz>	 https://grafana.wikimedia.org/d/STSXVVdSk/mediamoderation-photodna-stats?orgId=1&refresh=5m
[14:22:58] <claime>	 Dreamy_Jazz: ack
[14:23:17] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:1005109|zhwiki: Create group ipblock-exempt-grantor (T357991)]] (duration: 11m 05s)
[14:23:21] <TheresNoTime>	 koi: live
[14:23:22] <stashbot>	 T357991: Create ipblock exempt granter group on zhwiki - https://phabricator.wikimedia.org/T357991
[14:23:28] <koi>	 ty
[14:23:29] <hnowlan>	 sorry, looking now - wonder if this is a rerun of the eventgate issues we saw yesterday
[14:23:30] <claime>	 I was basing it on timing 
[14:23:49] <wikibugs>	 10ops-codfw, 10serviceops: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380#9563553 (10Jhancock.wm) SR185570210 requested replacement disk from dell
[14:24:06] <claime>	 hnowlan: did they manifest in some way on the eventgate grafana dashboard?
[14:24:06] <TheresNoTime>	 I intend to continue deploying, is that okay?
[14:24:18] <hnowlan>	 claime: annoyingly no, only on the envoy telemetry
[14:24:20] <claime>	 TheresNoTime: yeah yeah go ahead
[14:24:23] <hnowlan>	 looks clear on the eventgate side
[14:24:23] <TheresNoTime>	 :)
[14:24:30] <TheresNoTime>	 anzx: going to run your two patches together
[14:24:36] <anzx>	 Ok
[14:24:40] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host es2026.codfw.wmnet with OS bookworm
[14:24:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005085 (owner: 10Anzx)
[14:24:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990077 (https://phabricator.wikimedia.org/T352424) (owner: 10Anzx)
[14:25:15] <wikibugs>	 (03CR) 10Ssingh: "Sorry, I forgot this in the earlier review: we will need you to be in the ops group as well here." [puppet] - 10https://gerrit.wikimedia.org/r/1005122 (owner: 10CDobbins)
[14:25:37] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: db2097 rebooted itself - https://phabricator.wikimedia.org/T357878#9563556 (10Jhancock.wm) The last maintenance I'm aware of on that machine was on the 15th. We migrated the server to the new leaf switch. I am not aware of any reason it would have be...
[14:25:44] <wikibugs>	 (03Merged) 10jenkins-bot: cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005085 (owner: 10Anzx)
[14:25:46] <wikibugs>	 (03Merged) 10jenkins-bot: mywiki: create portal and draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990077 (https://phabricator.wikimedia.org/T352424) (owner: 10Anzx)
[14:25:56] <wikibugs>	 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357944#9563560 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact.
[14:26:14] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:1005085|cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon]], [[gerrit:990077|mywiki: create portal and draft namespace (T352424)]]
[14:26:19] <stashbot>	 T352424: Create Portal and Draft namespaces in mywiki - https://phabricator.wikimedia.org/T352424
[14:26:29] <wikibugs>	 (03PS17) 10MVernon: convert-disks: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[14:26:59] <wikibugs>	 (03CR) 10MVernon: convert-disks: update cookbook to reimage ms-be with new partition schema (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[14:27:04] <hnowlan>	 looks like the spiking jobrunner errors are mostly cirrussearch related 
[14:27:43] <logmsgbot>	 !log samtar@deploy2002 samtar and anzx: Backport for [[gerrit:1005085|cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon]], [[gerrit:990077|mywiki: create portal and draft namespace (T352424)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:27:53] <anzx>	 TheresNoTime: testing 
[14:27:57] <TheresNoTime>	 ack
[14:28:26] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: db2097 rebooted itself - https://phabricator.wikimedia.org/T357878#9563594 (10jcrespo) 05Open→03Resolved Thanks for the reply, the reboot happened on the 17, so no relation to that.  Host has been repopulated from backups, new stale backups gener...
[14:28:38] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:28:41] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2046.codfw.wmnet, kubernetes2007.codfw.wmnet, mw2420.codfw.wmnet, mw2378.codfw.wmnet, kubernetes2032.codfw.wmnet, mw2312.codfw.wmnet, mw2356.codfw.wmnet, mw2423.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, mw2421.codfw.wmnet, kubernetes2048.codfw.wmnet, kubernetes2028.codfw.wmnet, m
[14:28:41] <icinga-wm_>	 fw.wmnet, mw2437.codfw.wmnet, mw2445.codfw.wmnet, mw2381.codfw.wmnet, mw2435.codfw.wmnet, kubernetes2018.codfw.wmnet, mw2318.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2023.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2005.codfw.wmnet, mw2366.codfw.wmnet, mw2425.codfw.wmnet, mw2430.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2041.codfw.wmnet, kubernetes2053.codfw.wmnet, mw2354.codfw.wmnet, kubernetes2057.codfw.wmnet,
[14:28:41] <icinga-wm_>	 es2060.codfw.wmnet, mw2350.codfw.wmnet, kubernetes2058.codfw.wmnet, mw2282.codfw.wmnet, kubernetes2054.codfw.wmnet, kubernetes2020.codfw.wmnet, mw2436.codfw.wmnet, mw2310.codfw.wmnet, k https://wikitech.wikimedia.org/wiki/PyBal
[14:28:43] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet are marked down but pooled: thumbor_8800: Servers mw2424.codfw.wmnet, mw2420.codfw.wmnet, mw2378.codfw.wmnet, mw2294.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, kubernetes2034.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2048.codfw.wmnet, kubernetes2016.codfw.w
[14:28:43] <icinga-wm_>	 435.codfw.wmnet, kubernetes2018.codfw.wmnet, mw2297.codfw.wmnet, kubernetes2050.codfw.wmnet, mw2431.codfw.wmnet, kubernetes2055.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2025.codfw.wmnet, kubernetes2030.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2054.codfw.wmnet, mw2434.codfw.wmnet, kubernetes2020.codfw.wmnet, mw2449.codfw.wmnet, mw2368.codfw.wmnet, mw2356.codfw.wmnet, mw2429.codfw.wmnet, mw24
[14:28:43] <icinga-wm_>	 wmnet, kubernetes2042.codfw.wmnet, kubernetes2013.codfw.wmnet, mw2406.codfw.wmnet, mw2267.codfw.wmnet, kubernetes2044.codfw.wmnet, mw2317.codfw.wmnet, kubernetes2051.codfw.wmnet, mw2380 https://wikitech.wikimedia.org/wiki/PyBal
[14:28:55] <sukhe>	 er
[14:28:56] <TheresNoTime>	 ....
[14:28:57] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:28:58] <anzx>	 TheresNoTime: looks good 
[14:29:00] <claime>	 yikes
[14:29:05] <sukhe>	 page!
[14:29:19] <sukhe>	 I have ACKed
[14:29:25] <TheresNoTime>	 anzx: not going to continue the sync at the moment per ^
[14:29:30] <sukhe>	 yeah thanks
[14:29:38] <hnowlan>	 queues are full 
[14:29:41] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:29:43] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:29:44] <hnowlan>	 in codfw
[14:29:59] <claime>	 thumbor queues?
[14:30:02] <hnowlan>	 qps is up 10x or so 
[14:30:03] <hnowlan>	 yeah
[14:30:19] <hnowlan>	 for the short term I'll add more replicas
[14:30:23] <jelto>	 thanks for acking, I'm also looking
[14:30:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P57563 and previous config saved to /var/cache/conftool/dbconfig/20240221-143030-marostegui.json
[14:30:34] <sukhe>	 hnowlan: thanks and looks like quite the spike
[14:30:55] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517
[14:31:00] <hnowlan>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1005517
[14:31:06] <hnowlan>	 heh bit redundant
[14:31:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] thumbor: add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517 (owner: 10Hnowlan)
[14:31:17] <claime>	 big ghostscript spike apparently
[14:31:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] thumbor: add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517 (owner: 10Hnowlan)
[14:31:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57564 and previous config saved to /var/cache/conftool/dbconfig/20240221-143120-arnaudb.json
[14:31:27] <hnowlan>	 can someone check the jobqueue for a spike also 
[14:31:28] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517 (owner: 10Hnowlan)
[14:31:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P57565 and previous config saved to /var/cache/conftool/dbconfig/20240221-143133-arnaudb.json
[14:31:36] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] convert-disks: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[14:32:10] <cdanis>	 hnowlan: not sure if this helps but the increased / expensive codfw thumbor traffic looks to be about 80% ghostscript 20% djvu
[14:32:20] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517 (owner: 10Hnowlan)
[14:32:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57566 and previous config saved to /var/cache/conftool/dbconfig/20240221-143239-arnaudb.json
[14:32:49] <hnowlan>	 that probably points to something automated 
[14:33:09] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517 (owner: 10Hnowlan)
[14:33:12] <anzx>	 TheresNoTime: ok will wait, would be possible to take a look at T356686
[14:33:12] <stashbot>	 T356686: or.wikipedia - Allowing only logged-in users with over 10 edits to create new articles - https://phabricator.wikimedia.org/T356686
[14:33:15] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw2297 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:33:18] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[14:33:21] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[14:33:30] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[14:33:38] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:33:50] <hnowlan>	 yeah queues already back down
[14:33:57] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:34:15] <TheresNoTime>	 anzx: I'll put T356686 on my todo for later :)
[14:34:21] <anzx>	 Thanks 
[14:34:22] <jelto>	 and pa.ge resolved again.
[14:34:25] <sukhe>	 thanks hnowlan
[14:34:41] <jinxer-wm>	 (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[14:34:46] <claime>	 I'm not seeing a job spike
[14:34:51] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[14:34:56] <hnowlan>	 yeah mediamoderation jobs didn't spike or anything
[14:35:06] <hnowlan>	 possibly some kind of commons bulk upload maybe 
[14:35:44] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] "OK! I will update here when this is deployed everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui)
[14:35:53] <wikibugs>	 (03Merged) 10jenkins-bot: convert-disks: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[14:36:05] <wikibugs>	 (03CR) 10Marostegui: "Thank you - much appreciated" [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui)
[14:36:18] <TheresNoTime>	 sukhe: can I continue with the deployment window?
[14:36:39] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005511 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff)
[14:36:55] <icinga-wm_>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:37:07] <claime>	 thumbor is still processing a lot of images but the queues are ok with the replicas increase
[14:37:16] <claime>	 I'd say you can go ahead TheresNoTime 
[14:37:18] <sukhe>	 ok
[14:37:21] <TheresNoTime>	 ack 
[14:37:24] <logmsgbot>	 !log samtar@deploy2002 samtar and anzx: Continuing with sync
[14:37:32] <sukhe>	 I was going to say wait a bit but should be fine 
[14:37:36] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-a-codfw - cmooney@cumin1002"
[14:38:27] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-a-codfw - cmooney@cumin1002"
[14:38:27] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:38:37] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9563663 (10Pppery)
[14:39:41] <jinxer-wm>	 (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported
[14:40:02] <logmsgbot>	 !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1033.eqiad.wmnet with OS bookworm
[14:40:10] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Issues reimaging servers in codfw - https://phabricator.wikimedia.org/T358001#9563665 (10Jhancock.wm) @hnowlan I've replaced the network cable on both of these. These are both connected to a 1G switch so there is no SFP to replace in this case.   If this does not fix the iss...
[14:40:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563666 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1033.eqiad.wmnet with OS book...
[14:42:04] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2026.codfw.wmnet with reason: host reimage
[14:42:26] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] convert-disks: update cookbook to reimage ms-be with new partition schema (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond)
[14:43:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:43:36] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[14:44:32] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[14:44:50] <hnowlan>	 I suspect the thumbor choke was a bunch of djvu/pdf files being thumbnailed all at once 
[14:44:56] <hnowlan>	 I dunno what causes a surge like that though 
[14:44:59] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2026.codfw.wmnet with reason: host reimage
[14:45:29] <claime>	 it's trending up again 
[14:45:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P57567 and previous config saved to /var/cache/conftool/dbconfig/20240221-144536-marostegui.json
[14:46:37] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:1005085|cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon]], [[gerrit:990077|mywiki: create portal and draft namespace (T352424)]] (duration: 20m 23s)
[14:46:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T357189)', diff saved to https://phabricator.wikimedia.org/P57568 and previous config saved to /var/cache/conftool/dbconfig/20240221-144641-arnaudb.json
[14:46:43] <stashbot>	 T352424: Create Portal and Draft namespaces in mywiki - https://phabricator.wikimedia.org/T352424
[14:46:43] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[14:46:43] <TheresNoTime>	 anzx: live, going to run those namespaceDupes now
[14:46:48] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[14:46:54] <icinga-wm_>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:46:56] <claime>	 hnowlan: thumbnailrender didn't spike, but run duration is going up
[14:46:56] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[14:47:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T357189)', diff saved to https://phabricator.wikimedia.org/P57569 and previous config saved to /var/cache/conftool/dbconfig/20240221-144702-arnaudb.json
[14:47:35] <TheresNoTime>	 !log [samtar@mwmaint2002 ~]$ mwscript namespaceDupes.php --wiki hewikinews --fix #T349581
[14:47:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:40] <stashbot>	 T349581: Create draft namespace and add namespaces aliases for hewikinews - https://phabricator.wikimedia.org/T349581
[14:47:57] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:48:03] <claime>	 here we go again
[14:48:05] <hnowlan>	 claime: yeah, thumbnailrender doesn't get called for sub-pages of doc
[14:48:07] <jelto>	 5xx rate is going up again for thumbor
[14:48:14] <hnowlan>	 last apply failed for thumbor btw 
[14:48:27] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[14:48:29] <hnowlan>	 trying again 
[14:48:33] <jelto>	 not enough resources?
[14:48:36] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:48:52] * kamila_ here if you need more hands
[14:48:57] <hnowlan>	 quota exceeded 
[14:49:04] <jelto>	 yes that's what I mean
[14:49:05] <sukhe>	 ACKed in
[14:49:06] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[14:49:07] <sukhe>	 again
[14:49:28] <anzx>	 TheresNoTime: also namespacedupes for mywiki, thanks
[14:49:28] <TheresNoTime>	 Would y'all like me to stop the handful of namespaceDupes runs I need to?
[14:49:30] <hnowlan>	 we can reintroduce expensive format throttling 
[14:51:23] <wikibugs>	 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 2 others: Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392#9563708 (10akosiaris) 05Open→03In progress p:05Triage→03Medium
[14:51:27] <claime>	 I don't think namespaceDupes has an impact on thumbnailing, does it?
[14:51:31] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9563710 (10akosiaris)
[14:51:53] <claime>	 hnowlan: we may need to :/
[14:51:57] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: reenable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005519
[14:52:09] <hnowlan>	 it's broken but it's broken in our favour
[14:52:16] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] thumbor: reenable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005519 (owner: 10Hnowlan)
[14:52:32] <TheresNoTime>	 anzx: all runs complete
[14:52:36] <TheresNoTime>	 (inc. mywiki)
[14:52:42] <hnowlan>	 I would very much like to know what is causing these spikes 
[14:52:44] <jelto>	 5xx going down again, similar to last time
[14:52:54] <anzx>	 TheresNoTime: thank you 
[14:52:57] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[14:52:57] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:52:59] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[14:53:14] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: reenable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005519 (owner: 10Hnowlan)
[14:53:17] <TheresNoTime>	 hoo: would it be okay for you to reschedule your deployment? We're running over and ideally I think no more deploys would be good
[14:53:35] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-a-codfw - cmooney@cumin1002"
[14:53:57] <claime>	 Yeah, I need to depool some stuff for the upcoming network migration as well, so if we could reschedule that deployment for a later window it'd be great
[14:54:08] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T357189)', diff saved to https://phabricator.wikimedia.org/P57570 and previous config saved to /var/cache/conftool/dbconfig/20240221-145407-arnaudb.json
[14:54:13] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[14:54:21] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: reenable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005519 (owner: 10Hnowlan)
[14:54:26] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-a-codfw - cmooney@cumin1002"
[14:54:26] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:54:58] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[14:55:15] <hoo>	 TheresNoTime: Sure… I guess I can go for the morning SWAT tomorrow
[14:55:28] <TheresNoTime>	 Appreciate it, thank you :-) 
[14:55:34] <claime>	 thanks :)
[14:55:49] <TheresNoTime>	 !log UTC afternoon backport window done
[14:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:16] <topranks>	 !log adding IRB anycast interface on private1-b-codfw vlan to spine and leaf switches codfw row B 
[14:57:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Drop obsolete motd for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/1005511 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff)
[14:58:03] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: reduce per-pod memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005520
[14:58:19] <hnowlan>	 ^ if someone has a sec, will make scale-ups easier
[14:58:32] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] thumbor: reduce per-pod memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005520 (owner: 10Hnowlan)
[14:58:37] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:58:39] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] thumbor: reduce per-pod memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005520 (owner: 10Hnowlan)
[14:59:04] <claime>	 Gimme a sec to drain a couple k8s nodes before deploying hnowlan please
[14:59:06] <claime>	 !log Draining kubernetes2025.codfw.wmnet kubernetes2026.codfw.wmnet for codfw A8 network migration - T355874
[14:59:08] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: reduce per-pod memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005520 (owner: 10Hnowlan)
[14:59:14] <hnowlan>	 claime: ack 
[15:00:03] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: reduce per-pod memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005520 (owner: 10Hnowlan)
[15:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1500)
[15:00:13] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[15:00:29] <claime>	 hnowlan: all good
[15:00:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T355609)', diff saved to https://phabricator.wikimedia.org/P57571 and previous config saved to /var/cache/conftool/dbconfig/20240221-150043-marostegui.json
[15:00:46] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance
[15:00:59] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance
[15:01:01] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance
[15:01:03] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance
[15:01:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T355609)', diff saved to https://phabricator.wikimedia.org/P57572 and previous config saved to /var/cache/conftool/dbconfig/20240221-150109-marostegui.json
[15:01:09] <claime>	 !log Depooling parse2004.codfw.wmnet parse2005.codfw.wmnet for codfw A8 network migration - T355874
[15:01:46] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[15:02:25] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=parse200(4|5).*
[15:06:14] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2026.codfw.wmnet with OS bookworm
[15:07:27] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[15:07:41] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch backup1001 to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005436 (owner: 10Muehlenhoff)
[15:07:59] <wikibugs>	 (03PS11) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[15:09:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn)
[15:09:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P57573 and previous config saved to /var/cache/conftool/dbconfig/20240221-150914-arnaudb.json
[15:09:25] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[15:09:43] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-b-codfw - cmooney@cumin1002"
[15:10:12] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[15:10:34] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-b-codfw - cmooney@cumin1002"
[15:10:35] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:12:09] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[15:12:22] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[15:12:23] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch backup2001 to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005437 (owner: 10Muehlenhoff)
[15:13:07] <wikibugs>	 (03PS1) 10Marostegui: Revert "es2026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005474
[15:15:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "es2026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005474 (owner: 10Marostegui)
[15:18:27] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] "Views on all clouddb servers dropped." [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui)
[15:18:55] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:18:57] <wikibugs>	 (03CR) 10Marostegui: "thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui)
[15:19:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57574 and previous config saved to /var/cache/conftool/dbconfig/20240221-151909-root.json
[15:19:32] <wikibugs>	 (03PS2) 10Majavah: wikireplicas: maintain-views: try depooling host on lock failure [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427)
[15:20:18] <wikibugs>	 (03CR) 10Majavah: "So the depool+retry part in this actually works. What does not work is closing connections, either the script or (my preference) the HAPro" [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah)
[15:21:06] <wikibugs>	 10SRE, 10ops-codfw, 10Cassandra, 10decommission-hardware: decommission restbase20[13-20] - https://phabricator.wikimedia.org/T356695#9563766 (10Jhancock.wm) a:03Jhancock.wm
[15:21:16] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host db2137.codfw.wmnet with OS bookworm
[15:23:07] <wikibugs>	 (03PS12) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[15:23:37] <jinxer-wm>	 (JobUnavailable) resolved: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:24:01] <wikibugs>	 (03PS1) 10Muehlenhoff: Explicitly configure apt2002 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1005524 (https://phabricator.wikimedia.org/T331613)
[15:24:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P57575 and previous config saved to /var/cache/conftool/dbconfig/20240221-152420-arnaudb.json
[15:26:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, just a nit inline" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm)
[15:28:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T355609)', diff saved to https://phabricator.wikimedia.org/P57576 and previous config saved to /var/cache/conftool/dbconfig/20240221-152826-marostegui.json
[15:28:33] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[15:28:36] <wikibugs>	 (03CR) 10Btullis: "Looks good overall. Couple of nitpicks." [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn)
[15:32:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Explicitly configure apt2002 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1005524 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff)
[15:34:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57577 and previous config saved to /var/cache/conftool/dbconfig/20240221-153414-root.json
[15:35:04] <wikibugs>	 (03PS1) 10CDanis: WIP: jaeger: include oauth config in Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005546 (https://phabricator.wikimedia.org/T358111)
[15:37:18] <icinga-wm_>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:38:58] <wikibugs>	 (03CR) 10CDanis: "I'm going to send this patch upstream, but I wanted a quick check for nitpicks from you two first." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005546 (https://phabricator.wikimedia.org/T358111) (owner: 10CDanis)
[15:39:15] <wikibugs>	 (03PS13) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[15:39:27] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T357189)', diff saved to https://phabricator.wikimedia.org/P57578 and previous config saved to /var/cache/conftool/dbconfig/20240221-153926-arnaudb.json
[15:39:29] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[15:39:42] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[15:39:53] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[15:40:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'T355874 - depooling db2146 db2106', diff saved to https://phabricator.wikimedia.org/P57579 and previous config saved to /var/cache/conftool/dbconfig/20240221-154056-arnaudb.json
[15:40:58] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:25:00 on db2146.codfw.wmnet with reason: T355874 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw
[15:41:01] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on db2146.codfw.wmnet with reason: T355874 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw
[15:41:02] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:25:00 on db2106.codfw.wmnet with reason: T355874 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw
[15:41:13] <stashbot>	 T355874: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874
[15:41:15] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on db2106.codfw.wmnet with reason: T355874 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw
[15:42:40] <wikibugs>	 (03PS1) 10Fabfur: haproxy: configure extended logging (preparatory for Benthos) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105)
[15:43:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P57580 and previous config saved to /var/cache/conftool/dbconfig/20240221-154333-marostegui.json
[15:44:48] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2137.codfw.wmnet with reason: host reimage
[15:46:05] <wikibugs>	 (03CR) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn)
[15:46:32] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[15:46:47] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[15:47:04] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1004680 (owner: 10Filippo Giunchedi)
[15:47:40] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2137.codfw.wmnet with reason: host reimage
[15:49:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57581 and previous config saved to /var/cache/conftool/dbconfig/20240221-154918-root.json
[15:49:34] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004735 (https://phabricator.wikimedia.org/T357907) (owner: 10Clément Goubert)
[15:50:48] <wikibugs>	 (03Merged) 10jenkins-bot: api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004735 (https://phabricator.wikimedia.org/T357907) (owner: 10Clément Goubert)
[15:51:48] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[15:52:04] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[15:52:13] <wikibugs>	 (03PS5) 10JMeybohm: New package version 1.0.0+git20221110-1 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616)
[15:52:43] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9563891 (10WMDE-leszek) I approve the request on WMDE's behalf.
[15:54:44] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[15:54:57] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] New package version 1.0.0+git20221110-1 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm)
[15:55:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[15:55:23] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[15:55:36] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[15:56:06] <wikibugs>	 10ops-codfw, 10DC-Ops: db2137 and es2026 don't get an IP via PXE boot - https://phabricator.wikimedia.org/T357951#9563902 (10wiki_willy) ++ @Jhancock.wm for visibility and in case any onsite support is needed
[15:57:59] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a8-codfw.mgmt with reason: prepping for server uplink migration codfw rack a8
[15:58:16] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a8-codfw.mgmt with reason: prepping for server uplink migration codfw rack a8
[15:58:22] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874#9563911 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c42ddc7f-d7d7-4ebc-9852-d3a5c7882e71) set by cmoon...
[15:58:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P57582 and previous config saved to /var/cache/conftool/dbconfig/20240221-155839-marostegui.json
[15:58:59] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: Migrating servers in codfw rack A7 to lsw1-a7-codfw
[15:59:06] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: Migrating servers in codfw rack A7 to lsw1-a7-codfw
[15:59:13] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874#9563912 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=da675508-2cc3-4974-a4ca-677deefc2dff) set by cmoon...
[16:00:05] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "api-gateway: Finish migration to mw-on-k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005475
[16:01:10] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] Revert "api-gateway: Finish migration to mw-on-k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005475 (owner: 10Clément Goubert)
[16:02:00] <topranks>	 !log Commencing network maintenance migrating servers to new switch codfw rack A8 T355874
[16:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:06] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "api-gateway: Finish migration to mw-on-k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005475 (owner: 10Clément Goubert)
[16:02:19] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005527
[16:02:23] <stashbot>	 T355874: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874
[16:03:03] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] etcd: disable the diff output for client config with passwords [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway)
[16:03:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply
[16:03:39] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply
[16:04:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57583 and previous config saved to /var/cache/conftool/dbconfig/20240221-160423-root.json
[16:04:25] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply
[16:04:40] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply
[16:04:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[16:05:05] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance
[16:05:11] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2108 (T357189)', diff saved to https://phabricator.wikimedia.org/P57584 and previous config saved to /var/cache/conftool/dbconfig/20240221-160511-arnaudb.json
[16:05:37] <jayme>	 !log imported prometheus-rsyslog-exporter 1.0.0+git20221110-1 to buster,bullseye,bookworm - T357616
[16:06:06] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[16:06:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:25] <stashbot>	 T357616: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616
[16:09:08] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2137.codfw.wmnet with OS bookworm
[16:09:12] <wikibugs>	 10ops-codfw, 10DC-Ops: db2137 and es2026 don't get an IP via PXE boot - https://phabricator.wikimedia.org/T357951#9564030 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host db2137.codfw.wmnet with OS bookworm completed: - db2137 (**WARN**)   - Removed from Puppet...
[16:10:41] <wikibugs>	 10ops-codfw, 10DC-Ops: db2137 and es2026 don't get an IP via PXE boot - https://phabricator.wikimedia.org/T357951#9564034 (10Marostegui) 05Open→03Resolved a:03cmooney All good, both hosts were reimaged fine. Thanks @cmooney for taking the time to explain and fix the issue.
[16:10:55] <icinga-wm_>	 PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[16:11:29] <jynus>	 I can reach it
[16:11:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57585 and previous config saved to /var/cache/conftool/dbconfig/20240221-161129-arnaudb.json
[16:11:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57586 and previous config saved to /var/cache/conftool/dbconfig/20240221-161136-arnaudb.json
[16:12:01] <jynus>	 let me check from the alert host
[16:12:10] <wikibugs>	 (03PS14) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[16:12:20] <sukhe>	 IIRC we saw this go down and recover. the question is why though. (I know we don't run this)
[16:13:14] <jynus>	 so probably something with the outside network
[16:13:16] <jynus>	 ?
[16:13:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T355609)', diff saved to https://phabricator.wikimedia.org/P57587 and previous config saved to /var/cache/conftool/dbconfig/20240221-161345-marostegui.json
[16:13:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance
[16:13:50] <sukhe>	 jynus: this is hosted by Rackspace IIRC
[16:14:01] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[16:14:02] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance
[16:14:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T355609)', diff saved to https://phabricator.wikimedia.org/P57588 and previous config saved to /var/cache/conftool/dbconfig/20240221-161407-marostegui.json
[16:16:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T357189)', diff saved to https://phabricator.wikimedia.org/P57589 and previous config saved to /var/cache/conftool/dbconfig/20240221-161615-arnaudb.json
[16:16:33] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[16:16:54] <wikibugs>	 (03PS2) 10Fabfur: haproxy: configure extended logging (preparatory for Benthos) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105)
[16:17:01] <icinga-wm_>	 RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.26 ms
[16:18:11] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] logstash_checker.py: Add ability to check all MediaWiki canaries at once [puppet] - 10https://gerrit.wikimedia.org/r/1003885 (https://phabricator.wikimedia.org/T357402) (owner: 10Ahmon Dancy)
[16:18:15] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1420/console" [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur)
[16:18:34] <jynus>	 alert1001:~$ ping wikitech-static.wikimedia.org is now working
[16:19:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57590 and previous config saved to /var/cache/conftool/dbconfig/20240221-161928-root.json
[16:21:03] <icinga-wm_>	 PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[16:22:07] <wikibugs>	 (03PS3) 10Fabfur: haproxy: configure extended logging (preparatory for Benthos) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105)
[16:22:41] <wikibugs>	 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9564123 (10Clement_Goubert)
[16:23:47] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1421/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur)
[16:24:07] <wikibugs>	 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9564137 (10Clement_Goubert) p:05Triage→03Medium
[16:24:40] <claime>	 !log Repooling parse2004.codfw.wmnet parse2005.codfw.wmnet following codfw A8 network migration - T355874
[16:24:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:50] <logmsgbot>	 !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=parse200(4|5).*
[16:24:54] <stashbot>	 T355874: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874
[16:24:58] <jynus>	 I confirm it is a rackspace issue, as it is the last hop tha fails, not anything inbetween
[16:25:18] <claime>	 !log Uncordoning kubernetes2025.codfw.wmnet kubernetes2026.codfw.wmnet following codfw A8 network migration - T355874
[16:25:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:26:00] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874#9564146 (10cmooney) All hosts moved without issue, thanks Jenn!
[16:26:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57591 and previous config saved to /var/cache/conftool/dbconfig/20240221-162635-arnaudb.json
[16:26:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57592 and previous config saved to /var/cache/conftool/dbconfig/20240221-162641-arnaudb.json
[16:29:29] <wikibugs>	 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9564163 (10Clement_Goubert)
[16:31:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P57593 and previous config saved to /var/cache/conftool/dbconfig/20240221-163122-arnaudb.json
[16:34:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57594 and previous config saved to /var/cache/conftool/dbconfig/20240221-163433-root.json
[16:35:52] <icinga-wm_>	 RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.31 ms
[16:41:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57595 and previous config saved to /var/cache/conftool/dbconfig/20240221-164140-arnaudb.json
[16:41:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57596 and previous config saved to /var/cache/conftool/dbconfig/20240221-164146-arnaudb.json
[16:41:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T355609)', diff saved to https://phabricator.wikimedia.org/P57597 and previous config saved to /var/cache/conftool/dbconfig/20240221-164150-marostegui.json
[16:42:18] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[16:46:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P57598 and previous config saved to /var/cache/conftool/dbconfig/20240221-164628-arnaudb.json
[16:46:54] <wikibugs>	 (03CR) 10C. Scott Ananian: Turn on Parsoid read views by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) (owner: 10C. Scott Ananian)
[16:46:58] <wikibugs>	 (03PS5) 10C. Scott Ananian: Turn on Parsoid read views by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566)
[16:47:28] <wikibugs>	 (03PS1) 10Ssingh: P:dns::auth: update confd keys to reflect new schema [puppet] - 10https://gerrit.wikimedia.org/r/1005559 (https://phabricator.wikimedia.org/T347054)
[16:49:36] <wikibugs>	 (03PS2) 10Ssingh: P:dns::auth: update confd keys to reflect new schema [puppet] - 10https://gerrit.wikimedia.org/r/1005559 (https://phabricator.wikimedia.org/T347054)
[16:51:22] <wikibugs>	 (03PS3) 10Ssingh: P:dns::auth: update confd keys to reflect new schema [puppet] - 10https://gerrit.wikimedia.org/r/1005559 (https://phabricator.wikimedia.org/T347054)
[16:52:32] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1423/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005559 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[16:55:25] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9564307 (10Clement_Goubert)
[16:56:02] <wikibugs>	 (03CR) 10Volans: "Looks good! Hint for the failing test inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi)
[16:56:45] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57599 and previous config saved to /var/cache/conftool/dbconfig/20240221-165644-arnaudb.json
[16:56:51] <wikibugs>	 (03PS2) 10Btullis: Add an nginx reverse proxy to superset to help with serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890)
[16:56:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57600 and previous config saved to /var/cache/conftool/dbconfig/20240221-165651-arnaudb.json
[16:56:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P57601 and previous config saved to /var/cache/conftool/dbconfig/20240221-165657-marostegui.json
[16:57:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add an nginx reverse proxy to superset to help with serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis)
[16:58:22] <wikibugs>	 (03PS1) 10Jclark-ctr: add an-redacteddb1001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1005560 (https://phabricator.wikimedia.org/T355571)
[17:00:12] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] add an-redacteddb1001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1005560 (https://phabricator.wikimedia.org/T355571) (owner: 10Jclark-ctr)
[17:01:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T357189)', diff saved to https://phabricator.wikimedia.org/P57602 and previous config saved to /var/cache/conftool/dbconfig/20240221-170134-arnaudb.json
[17:01:37] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[17:01:51] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance
[17:01:57] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2120 (T357189)', diff saved to https://phabricator.wikimedia.org/P57603 and previous config saved to /var/cache/conftool/dbconfig/20240221-170157-arnaudb.json
[17:01:58] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[17:09:08] <logmsgbot>	 !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-redacteddb1001.eqiad.wmnet with OS bullseye
[17:09:24] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9564382 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bullseye
[17:12:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P57604 and previous config saved to /var/cache/conftool/dbconfig/20240221-171203-marostegui.json
[17:14:51] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9564401 (10VRiley-WMF) a:03VRiley-WMF
[17:15:08] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9564397 (10Jclark-ctr) a:05Jclark-ctr→03VRiley-WMF
[17:15:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T357189)', diff saved to https://phabricator.wikimedia.org/P57605 and previous config saved to /var/cache/conftool/dbconfig/20240221-171521-arnaudb.json
[17:15:27] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[17:18:52] <wikibugs>	 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408#9564424 (10Jclark-ctr) 05Open→03Resolved closing ticket 7 days no faults
[17:27:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T355609)', diff saved to https://phabricator.wikimedia.org/P57606 and previous config saved to /var/cache/conftool/dbconfig/20240221-172709-marostegui.json
[17:27:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance
[17:27:18] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[17:27:25] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance
[17:27:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T355609)', diff saved to https://phabricator.wikimedia.org/P57607 and previous config saved to /var/cache/conftool/dbconfig/20240221-172731-marostegui.json
[17:30:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P57608 and previous config saved to /var/cache/conftool/dbconfig/20240221-173028-arnaudb.json
[17:34:22] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9564556 (10VRiley-WMF)
[17:45:03] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9564592 (10VRiley-WMF) 05Open→03In progress
[17:45:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9564594 (10Jclark-ctr) a:05VRiley-WMF→03BTullis @BTullis   this is a custom configuration and i am not having any luck with imaging  20 disk raid10.  if you can asisst thank you
[17:45:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9564597 (10Jclark-ctr)
[17:45:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P57609 and previous config saved to /var/cache/conftool/dbconfig/20240221-174534-arnaudb.json
[17:48:43] <wikibugs>	 (03PS4) 10Kamila Součková: Create a shellbox deployment for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309)
[17:48:45] <wikibugs>	 (03CR) 10Kamila Součková: "I thought about it and decided to go with video because video files typically contain audio, so it's a superset in my mind. I thought abou" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková)
[17:49:18] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[17:49:53] <wikibugs>	 (03PS5) 10Kamila Součková: Create a shellbox deployment for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309)
[17:50:18] <wikibugs>	 (03PS6) 10Kamila Součková: Create a shellbox deployment for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309)
[17:56:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T355609)', diff saved to https://phabricator.wikimedia.org/P57610 and previous config saved to /var/cache/conftool/dbconfig/20240221-175601-marostegui.json
[17:56:07] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[17:59:56] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2057:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2057 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[18:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1800)
[18:00:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T357189)', diff saved to https://phabricator.wikimedia.org/P57611 and previous config saved to /var/cache/conftool/dbconfig/20240221-180041-arnaudb.json
[18:00:44] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[18:00:50] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[18:00:57] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance
[18:01:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T357189)', diff saved to https://phabricator.wikimedia.org/P57612 and previous config saved to /var/cache/conftool/dbconfig/20240221-180103-arnaudb.json
[18:03:19] <wikibugs>	 (03PS3) 10Btullis: Add an nginx reverse proxy to superset to help with serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890)
[18:04:23] <wikibugs>	 (03CR) 10Btullis: Add an nginx reverse proxy to superset to help with serving static assets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis)
[18:04:49] <wikibugs>	 (03CR) 10Ssingh: fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[18:09:11] <wikibugs>	 (03CR) 10BCornwall: "I'm a little skeptical that we need a script for checking package versions. Don't we already have monitoring for continuous Puppet runs? T" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[18:11:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P57613 and previous config saved to /var/cache/conftool/dbconfig/20240221-181107-marostegui.json
[18:12:01] <wikibugs>	 (03CR) 10Ssingh: "I am not sure how Puppet would be able to do that and also to send out alerts on IRC after checking and comparing the versions. If you hav" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[18:15:09] <wikibugs>	 (03PS4) 10Btullis: Add an nginx reverse proxy to superset to help with serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890)
[18:17:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T357189)', diff saved to https://phabricator.wikimedia.org/P57614 and previous config saved to /var/cache/conftool/dbconfig/20240221-181729-arnaudb.json
[18:17:39] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[18:18:39] <wikibugs>	 (03CR) 10BCornwall: "I was referring to that more broad alert along the lines of "Puppet changing every run". If we specify/install a version here but somethin" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[18:19:19] <wikibugs>	 (03PS1) 10Joal: Absent some reportupdater systemd-timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/1005565 (https://phabricator.wikimedia.org/T357419)
[18:20:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Absent some reportupdater systemd-timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/1005565 (https://phabricator.wikimedia.org/T357419) (owner: 10Joal)
[18:20:42] <wikibugs>	 (03CR) 10Ssingh: "Oh that way. Yeah but we don't want to have Puppet control over the installations of varnish. Doing so clears the cache and is generally t" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[18:26:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P57615 and previous config saved to /var/cache/conftool/dbconfig/20240221-182614-marostegui.json
[18:28:20] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9564755 (10VRiley-WMF)
[18:29:48] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9561242 (10VRiley-WMF) These servers have been unracked and ran the decommission script on them.
[18:30:03] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9564778 (10VRiley-WMF) 05In progress→03Resolved
[18:31:56] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "-1 on introducing a new Icinga check, anything new should be in Prometheus. However I wonder whether it'd be a better idea to enforce the " [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[18:32:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P57616 and previous config saved to /var/cache/conftool/dbconfig/20240221-183236-arnaudb.json
[18:35:47] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9564792 (10VRiley-WMF) 05Open→03In progress
[18:38:26] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9564807 (10VRiley-WMF)
[18:41:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T355609)', diff saved to https://phabricator.wikimedia.org/P57617 and previous config saved to /var/cache/conftool/dbconfig/20240221-184120-marostegui.json
[18:41:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2188.codfw.wmnet with reason: Maintenance
[18:41:30] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[18:41:30] <wikibugs>	 (03PS1) 10Jdlrobson: Remove Japanese Wikipedia from projects sharing user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005569 (https://phabricator.wikimedia.org/T301212)
[18:41:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2188.codfw.wmnet with reason: Maintenance
[18:41:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T355609)', diff saved to https://phabricator.wikimedia.org/P57618 and previous config saved to /var/cache/conftool/dbconfig/20240221-184144-marostegui.json
[18:42:25] <wikibugs>	 (03CR) 10Ssingh: "That already exists in the varnishkafka deb package. This task is about making sure that the correct versions (individually) are installed" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[18:43:26] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:44:12] <wikibugs>	 (03CR) 10RLazarus: [C: 03+2] k8s-controller-sidecars: Bump the pod's memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005219 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[18:46:55] <wikibugs>	 (03Merged) 10jenkins-bot: k8s-controller-sidecars: Bump the pod's memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005219 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus)
[18:47:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P57619 and previous config saved to /var/cache/conftool/dbconfig/20240221-184743-arnaudb.json
[18:48:36] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[18:48:58] <wikibugs>	 (03CR) 10Ssingh: "I am not a big fan of this patch as well but I don't think "Puppet changing every run" and a message as broad as that is a good solution. " [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[18:49:22] <wikibugs>	 (03PS1) 10Jdlrobson: Enable night mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005570 (https://phabricator.wikimedia.org/T357759)
[18:50:02] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "The `Depends` field is enforced by Apt during package installation time, and it will refuse to install or upgrade any package in a way tha" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[18:53:01] <wikibugs>	 (03CR) 10Ssingh: "The requirement here is that individual packages (starting with these but maybe more) should adhere to a fixed version definition. If we t" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[18:55:29] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "My understanding is that the issue this was trying to detect was that a `varnishkafka` version installed would not be compatible with the " [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[18:59:23] <wikibugs>	 (03CR) 10Ssingh: "Right but only if we are talking about ordering/dependency, in which case I agree that varnishkafka should (does) specify a stricter order" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[19:00:05] <jouncebot>	 jeena and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1900).
[19:02:29] <jeena>	 The train is blocked currently
[19:02:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T357189)', diff saved to https://phabricator.wikimedia.org/P57620 and previous config saved to /var/cache/conftool/dbconfig/20240221-190249-arnaudb.json
[19:02:52] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[19:02:56] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[19:03:05] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance
[19:03:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T357189)', diff saved to https://phabricator.wikimedia.org/P57621 and previous config saved to /var/cache/conftool/dbconfig/20240221-190311-arnaudb.json
[19:06:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T355609)', diff saved to https://phabricator.wikimedia.org/P57622 and previous config saved to /var/cache/conftool/dbconfig/20240221-190637-marostegui.json
[19:06:53] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[19:07:18] <wikibugs>	 (03PS1) 10Bking: rdf-streaming-updater: restore from savepoint (WIP) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005572 (https://phabricator.wikimedia.org/T348685)
[19:08:36] <wikibugs>	 (03CR) 10BCornwall: "IMO the alerting should not be on making sure arbitrary numbers match expectations but rather that the application itself is behaving as e" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[19:11:36] <wikibugs>	 (03PS2) 10Joal: Absent some reportupdater systemd-timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/1005565 (https://phabricator.wikimedia.org/T357419)
[19:12:23] <wikibugs>	 (03PS1) 10Cathal Mooney: WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421)
[19:12:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Absent some reportupdater systemd-timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/1005565 (https://phabricator.wikimedia.org/T357419) (owner: 10Joal)
[19:16:28] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T357189)', diff saved to https://phabricator.wikimedia.org/P57623 and previous config saved to /var/cache/conftool/dbconfig/20240221-191628-arnaudb.json
[19:16:38] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[19:18:14] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] fifo-log-demux: Decouple service from nginx/ats (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[19:20:31] <wikibugs>	 (03PS3) 10Joal: Absent some reportupdater systemd-timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/1005565 (https://phabricator.wikimedia.org/T357419)
[19:21:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P57624 and previous config saved to /var/cache/conftool/dbconfig/20240221-192144-marostegui.json
[19:23:36] <jinxer-wm>	 (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:27:25] <wikibugs>	 (03CR) 10Ssingh: fifo-log-demux: Decouple service from nginx/ats (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall)
[19:31:36] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P57625 and previous config saved to /var/cache/conftool/dbconfig/20240221-193135-arnaudb.json
[19:31:48] <wikibugs>	 (03CR) 10Ssingh: "Puppet does not upgrade the varnish package (or any other basically?) for us so it wouldn't notice anything if we uploaded an incorrect ve" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh)
[19:34:46] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] c-cqlsh is now deprecated; long live cqlsh-instance [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1004235 (owner: 10Eevans)
[19:36:10] <wikibugs>	 (03CR) 10Herron: [C: 03+1] grafana: provision thanos-downsample datasources [puppet] - 10https://gerrit.wikimedia.org/r/1004680 (owner: 10Filippo Giunchedi)
[19:36:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P57626 and previous config saved to /var/cache/conftool/dbconfig/20240221-193650-marostegui.json
[19:36:52] <wikibugs>	 (03CR) 10Herron: [C: 03+1] grafana: provision thanos-downsample datasources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1004680 (owner: 10Filippo Giunchedi)
[19:38:38] <wikibugs>	 (03PS15) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[19:38:41] <inflatador>	 !log bking@deploy2002 deleting old flink data from thanos-swift T348685
[19:38:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:47] <stashbot>	 T348685: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685
[19:44:41] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: CentralAuthHooks::onGetUserBlock: Only run for reg. users [extensions/CentralAuth] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005481 (https://phabricator.wikimedia.org/T358112)
[19:45:25] <MatmaRex>	 jeena: brennen: if you'd like to unblock the train, this can be backported ^
[19:46:39] <jeena>	 thanks, I will do that
[19:46:42] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P57627 and previous config saved to /var/cache/conftool/dbconfig/20240221-194641-arnaudb.json
[19:48:01] <wikibugs>	 10SRE, 10User-aborrero: reimage cookbook: failure when updating netbox data from puppetdb on cloudvirt1033 - https://phabricator.wikimedia.org/T358099#9564998 (10cmooney) p:05Triage→03Low a:03cmooney Thanks @aborrero.  Yeah something strange happening, I began to look at this earlier and got the same thi...
[19:50:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005481 (https://phabricator.wikimedia.org/T358112) (owner: 10Bartosz Dziewoński)
[19:51:57] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T355609)', diff saved to https://phabricator.wikimedia.org/P57628 and previous config saved to /var/cache/conftool/dbconfig/20240221-195157-marostegui.json
[19:52:04] <stashbot>	 T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609
[19:56:48] <wikibugs>	 (03Merged) 10jenkins-bot: CentralAuthHooks::onGetUserBlock: Only run for reg. users [extensions/CentralAuth] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005481 (https://phabricator.wikimedia.org/T358112) (owner: 10Bartosz Dziewoński)
[19:57:00] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9565036 (10VRiley-WMF)
[19:57:14] <logmsgbot>	 !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:1005481|CentralAuthHooks::onGetUserBlock: Only run for reg. users (T358112)]]
[19:57:19] <stashbot>	 T358112: Special:Contributions for IP ranges fails with InvalidArgumentException, due to CentralAuth - https://phabricator.wikimedia.org/T358112
[19:58:43] <wikibugs>	 (03PS16) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[19:58:45] <logmsgbot>	 !log jhuneidi@deploy2002 jhuneidi and matmarex: Backport for [[gerrit:1005481|CentralAuthHooks::onGetUserBlock: Only run for reg. users (T358112)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[19:59:23] <MatmaRex>	 jeena: thanks, looks fixed for me on mw.org
[20:01:48] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T357189)', diff saved to https://phabricator.wikimedia.org/P57629 and previous config saved to /var/cache/conftool/dbconfig/20240221-200148-arnaudb.json
[20:01:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[20:02:03] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance
[20:02:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T357189)', diff saved to https://phabricator.wikimedia.org/P57630 and previous config saved to /var/cache/conftool/dbconfig/20240221-200209-arnaudb.json
[20:02:10] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[20:03:14] <logmsgbot>	 !log jhuneidi@deploy2002 jhuneidi and matmarex: Continuing with sync
[20:03:26] <jeena>	 MatmaRex: thanks for checking!
[20:07:44] <wikibugs>	 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, and 2 others: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402#9565070 (10CodeReviewBot) dancy opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/219  Check bare metal and mw-...
[20:11:23] <logmsgbot>	 !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:1005481|CentralAuthHooks::onGetUserBlock: Only run for reg. users (T358112)]] (duration: 14m 09s)
[20:11:45] <stashbot>	 T358112: Special:Contributions for IP ranges fails with InvalidArgumentException, due to CentralAuth - https://phabricator.wikimedia.org/T358112
[20:12:56] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005583 (https://phabricator.wikimedia.org/T354437)
[20:13:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005583 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot)
[20:13:41] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005583 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot)
[20:14:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T357189)', diff saved to https://phabricator.wikimedia.org/P57631 and previous config saved to /var/cache/conftool/dbconfig/20240221-201400-arnaudb.json
[20:14:18] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[20:16:54] <logmsgbot>	 !log jhuneidi@deploy2002 scap failed: average error rate on 4/4 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details)
[20:17:11] <jeena>	 :O
[20:17:38] <brennen>	 oh boy
[20:17:53] <jeena>	 I've never had this happen before so...not sure how to proceed
[20:18:10] <jeena>	 roll back?
[20:19:00] <taavi>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003885 was just merged and seems very related
[20:19:08] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9565099 (10VRiley-WMF)
[20:19:09] <taavi>	 jeena: did it give any more detail in the console?
[20:19:15] <jeena>	 oh that
[20:19:20] <sukhe>	 logstash has the error details
[20:19:47] <jeena>	 there are quite a few exceptions
[20:20:48] <jeena>	 looks like timeouts mostly though
[20:21:56] <brennen>	 i wonder if adding the mw-on-k8s canaries just caught timeouts that have been happening at every deploy?
[20:22:01] * brennen fumbles around in logstash
[20:22:37] * Zippybonzo wishes they had logstash access :(
[20:24:45] <taavi>	 I must clearly be missing something, but I'm not seeing any massive increase of errors anywhere
[20:25:49] * Zippybonzo doesn't know what errors to look for so can't really look for any but doesn't notice anything breaking
[20:26:05] <jeena>	 maybe it was just a problem with logstash? All the exceptions from scap are this: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search
[20:26:46] <taavi>	 that seems to be a problem about _querying_ logstash, not that the error rate in logstash has increased
[20:27:06] <taavi>	 which supports my theory that the recent logstash_checker patch broke something
[20:27:11] <brennen>	 yeah, that's a believable failure mode given the change here
[20:27:20] <brennen>	 cc: dancy, thcipriani 
[20:27:23] <jeena>	 yeah, I wasn't thinking that the error rate had increased
[20:28:00] <jeena>	 sorry if I made it sound like that
[20:28:27] <brennen>	 ah, gotcha, re: timeouts.
[20:29:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P57632 and previous config saved to /var/cache/conftool/dbconfig/20240221-202906-arnaudb.json
[20:29:12] <taavi>	 yeah
[20:29:23] <brennen>	 (i suppose it's also possible that there's no bug with the checker as such and the requests really did just timeout.)
[20:30:15] <taavi>	 anyhow, given that the error rate did not in fact skyrocket, continuing the deployment seems safe to me
[20:30:49] <brennen>	 probably, although i don't think we want to be operating without canaries in general.
[20:30:57] <taavi>	 yeah
[20:31:56] <taavi>	 so there was one backport after the checker patch was merged that succeeded, and now this deployment that failed
[20:31:59] <sukhe>	 there is still the question of why it was timing out
[20:32:04] <sukhe>	 s/was/is
[20:32:17] <jeena>	 So as far as I understand it there's not action to take to continue with deployment, it is already deployed
[20:32:41] <jeena>	 umm yeah although I don't think the backport should have affected this https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1005481
[20:32:44] * dancy reads
[20:33:03] <brennen>	 jeena: i don't think it will have continued beyond canaries.
[20:33:12] <taavi>	 jeena: I see wmf.18 when visiting Special:Version on commons, so definitely not deployed everywhere
[20:33:17] <jeena>	 oh okay
[20:33:36] <jeena>	 I was looking at the versions page
[20:34:06] <cscott>	 it's on officewiki fwiw
[20:34:13] <taavi>	 officewiki is group0
[20:34:15] <dancy>	 The mods I made to logstash_checker.py should not be causing this since scap itself doesn't have the necessary changes to activate the new behavior.
[20:34:35] <cscott>	 taavi i learn something new every day.  i thought it was group1.
[20:34:39] <taavi>	 my understanding of the change in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003885 is that it it needs a related config change somewhere to actually take effect
[20:34:48] <jeena>	 shall I just do a re-run?
[20:35:40] <dancy>	 If scap ran to completion before, re-running isn't necessary 
[20:35:54] <taavi>	 I don't think it did?
[20:36:00] <jeena>	 it failed at the canary checks
[20:36:02] <taavi>	 it failed on the canary check phase
[20:36:14] <dancy>	 Ok..yes re-run
[20:36:50] <jeena>	 okay, trying again
[20:36:58] <thcipriani>	 looking at scap logs
[20:37:03] <dancy>	 When I get back to my desk I'll run logstash_checker.py manually to see how it behaves.
[20:38:28] <thcipriani>	 looks like there was a socket timeout
[20:38:47] <thcipriani>	 > __main__.CheckServiceError: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search
[20:39:13] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[20:39:19] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[20:39:58] <thcipriani>	 and that socket timeout caused all canaries to fail
[20:41:20] <wikibugs>	 (03PS17) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396)
[20:44:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P57633 and previous config saved to /var/cache/conftool/dbconfig/20240221-204415-arnaudb.json
[20:44:25] <brennen>	 makes sense.
[20:45:08] <thcipriani>	 logstash_checker.py still seems to work in the general case, afaict, running it from the command line.
[20:45:33] <thcipriani>	 network hiccup that we've never had before seems strange though
[20:46:12] <logmsgbot>	 !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.19  refs T354437
[20:46:17] <stashbot>	 T354437: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437
[20:49:06] <jeena>	 canary check passed
[20:50:09] <thcipriani>	 yeah, running logstash_checker.py manually seems fine, too.
[20:51:05] <brennen>	 i'm not 100% sure this hasn't happened before.
[20:51:17] <brennen>	 definitely not often, but it kind of jogs a memory.
[20:53:33] <wikibugs>	 (03PS1) 10Eevans: restbase: provision restbase1034-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005590 (https://phabricator.wikimedia.org/T354560)
[20:53:35] <wikibugs>	 (03PS1) 10Eevans: restbase: provision restbase1035-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005591 (https://phabricator.wikimedia.org/T354560)
[20:53:37] <wikibugs>	 (03PS1) 10Eevans: restbase: provision restbase1036-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005592 (https://phabricator.wikimedia.org/T354560)
[20:53:39] <wikibugs>	 (03PS1) 10Eevans: restbase: provision restbase1037-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005593 (https://phabricator.wikimedia.org/T354560)
[20:53:41] <wikibugs>	 (03PS1) 10Eevans: restbase: provision restbase1038-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005594 (https://phabricator.wikimedia.org/T354560)
[20:53:43] <wikibugs>	 (03PS1) 10Eevans: restbase: provision restbase1039-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005595 (https://phabricator.wikimedia.org/T354560)
[20:53:45] <wikibugs>	 (03PS1) 10Eevans: restbase: provision restbase1040-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005596 (https://phabricator.wikimedia.org/T354560)
[20:53:47] <wikibugs>	 (03PS1) 10Eevans: restbase: provision restbase1041-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005597 (https://phabricator.wikimedia.org/T354560)
[20:53:49] <wikibugs>	 (03PS1) 10Eevans: restbase: provision restbase1042-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005598 (https://phabricator.wikimedia.org/T354560)
[20:54:48] <logmsgbot>	 !log jhuneidi@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.19  refs T354437 (duration: 08m 35s)
[20:55:03] <stashbot>	 T354437: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437
[20:55:05] <jeena>	 thanks for the help everyone
[20:55:35] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893#9565196 (10Eevans) 05Open→03Resolved >>! In T354893#9563034, @Volans wrote: > @Eevans yes, we've done it already in T305568#7992643 :( > I've created the records for 3 cassandra...
[20:56:16] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9565199 (10VRiley-WMF) 05In progress→03Resolved
[20:59:22] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T357189)', diff saved to https://phabricator.wikimedia.org/P57634 and previous config saved to /var/cache/conftool/dbconfig/20240221-205922-arnaudb.json
[20:59:25] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[20:59:29] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[20:59:38] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance
[20:59:40] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[20:59:55] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[21:00:01] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T357189)', diff saved to https://phabricator.wikimedia.org/P57635 and previous config saved to /var/cache/conftool/dbconfig/20240221-210001-arnaudb.json
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T2100).
[21:00:04] <jouncebot>	 cscott, Jdlrobson, and anzx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:08] <wikibugs>	 (03PS2) 10Anzx: cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005476 (https://phabricator.wikimedia.org/T357978)
[21:00:58] <icinga-wm_>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:01:59] <Jdlrobson>	 here
[21:02:11] <anzx>	 o/
[21:02:26] <brennen>	 please hold one second
[21:02:41] <brennen>	 jeena: might want to look at that CentralAuth error real quick
[21:02:45] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147#9565223 (10odimitrijevic) Approved
[21:02:49] <brennen>	 (just noticed it in logspam-watch)
[21:02:58] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097#9565224 (10odimitrijevic) Approved
[21:03:09] <jeena>	 looking
[21:03:20] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics-privatedata-users for jwheeler - https://phabricator.wikimedia.org/T357731#9565225 (10odimitrijevic) Approved
[21:03:36] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:04:10] <icinga-wm_>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:04:15] <jeena>	 hmm, weird
[21:04:20] <dancy>	 brennen: T143982, T144033
[21:04:24] <stashbot>	 T143982: scap on beta cluster does not run anymore due to logstash being down - https://phabricator.wikimedia.org/T143982
[21:04:24] <stashbot>	 T144033: handle logstash timeouts separately from spikes in errors reported by logstash - https://phabricator.wikimedia.org/T144033
[21:04:34] <dancy>	 T143973
[21:04:34] <stashbot>	 T143973: beta-scap-eqiad failing: Timeout on connection while downloading deployment-logstash2.deployment-prep.eqiad.wmflabs:9200/logstash-*/_search - https://phabricator.wikimedia.org/T143973
[21:05:15] <brennen>	 jeena: not a huge spike of it, and all the inputs are spam.  probably worth flagging but i'm guessing it's ok for backports to go ahead.
[21:05:24] * brennen disappears into a meeting.
[21:05:32] <jeena>	 yeah I thought it looked like spam too
[21:05:42] <cscott>	 i'm here.
[21:06:25] <jeena>	 I can run the backports if no one is available
[21:08:53] <jeena>	 okay cscott yours is first in the list so I'll go ahead with that one
[21:09:09] <cscott>	 cool!
[21:09:18] <cscott>	 should be straightforward, i'll get setup to test
[21:09:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) (owner: 10C. Scott Ananian)
[21:10:40] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T357189)', diff saved to https://phabricator.wikimedia.org/P57636 and previous config saved to /var/cache/conftool/dbconfig/20240221-211039-arnaudb.json
[21:10:42] <wikibugs>	 (03Merged) 10jenkins-bot: Turn on Parsoid read views by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) (owner: 10C. Scott Ananian)
[21:10:59] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[21:11:05] <logmsgbot>	 !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:999062|Turn on Parsoid read views by default on officewiki (T355566)]]
[21:11:22] <stashbot>	 T355566: Use Parsoid for read views on OfficeWiki by default - https://phabricator.wikimedia.org/T355566
[21:12:12] <dancy>	 jeena: When you have a moment, please copy-and-paste the relevant portion of the scap output into the description of T144033.
[21:12:12] <stashbot>	 T144033: handle logstash timeouts separately from spikes in errors reported by logstash - https://phabricator.wikimedia.org/T144033
[21:12:40] <logmsgbot>	 !log jhuneidi@deploy2002 cscott and jhuneidi: Backport for [[gerrit:999062|Turn on Parsoid read views by default on officewiki (T355566)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:12:41] <dancy>	 I've been working in that area of the code recently so now's the time to fix it.
[21:14:35] <jeena>	 dancy: done
[21:14:51] <jeena>	 cscott: ready for you to check on mwdebug
[21:15:58] <icinga-wm_>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[21:16:38] <Jdlrobson>	 jeena: FYI one of mine  https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1005570?usp=search is just a beta cluster change 
[21:16:48] <dancy>	 jeena: Thanks!
[21:17:03] <cscott>	 jeena ok checking
[21:17:12] <rzl>	 fyi backporteers -- you're about to see some SAL noise from a helmfile deploy I'm running, but it's unimpactful to anything you're doing
[21:17:29] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[21:17:30] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncmonitor1001.eqiad.wmnet with OS bookworm
[21:17:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710#9565283 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncmonitor1001.eqiad.wmnet with OS bookworm
[21:17:49] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[21:18:06] <cscott>	 jeena looks good, ok to continue
[21:18:10] <logmsgbot>	 !log jhuneidi@deploy2002 cscott and jhuneidi: Continuing with sync
[21:18:22] <jeena>	 Jdlrobson: I can do both yours next
[21:18:36] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[21:18:54] <icinga-wm_>	 PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 44 probes of 804 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:19:00] <logmsgbot>	 !log rzl@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[21:19:43] <dancy>	 jeena: https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/219 will take care of some of the terribleness.
[21:20:10] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005570 (https://phabricator.wikimedia.org/T357759) (owner: 10Jdlrobson)
[21:20:28] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005569 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson)
[21:21:22] <wikibugs>	 (03Merged) 10jenkins-bot: Remove Japanese Wikipedia from projects sharing user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005569 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson)
[21:21:27] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1034-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005590 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans)
[21:21:34] <wikibugs>	 (03Merged) 10jenkins-bot: Enable night mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005570 (https://phabricator.wikimedia.org/T357759) (owner: 10Jdlrobson)
[21:24:14] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[21:24:28] <logmsgbot>	 !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[21:25:46] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P57637 and previous config saved to /var/cache/conftool/dbconfig/20240221-212546-arnaudb.json
[21:26:24] <logmsgbot>	 !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:999062|Turn on Parsoid read views by default on officewiki (T355566)]] (duration: 15m 19s)
[21:26:31] <stashbot>	 T355566: Use Parsoid for read views on OfficeWiki by default - https://phabricator.wikimedia.org/T355566
[21:27:24] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncmonitor1001.eqiad.wmnet with reason: host reimage
[21:27:31] <logmsgbot>	 !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:1005569|Remove Japanese Wikipedia from projects sharing user scripts (T301212)]], [[gerrit:1005570|Enable night mode on beta cluster (T357759)]]
[21:27:47] <jeena>	 Jdlrobson: I've started your backports
[21:27:56] <stashbot>	 T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212
[21:27:56] <stashbot>	 T357759: Deploy night mode on the minerva skin on test wiki - https://phabricator.wikimedia.org/T357759
[21:28:29] <Jdlrobson>	 jeena: thanks
[21:29:00] <logmsgbot>	 !log jhuneidi@deploy2002 jdlrobson and jhuneidi: Backport for [[gerrit:1005569|Remove Japanese Wikipedia from projects sharing user scripts (T301212)]], [[gerrit:1005570|Enable night mode on beta cluster (T357759)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:31:06] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncmonitor1001.eqiad.wmnet with reason: host reimage
[21:31:53] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[21:32:11] <logmsgbot>	 !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[21:32:38] <jeena>	 Jdlrobson: any checks you need to do?
[21:33:21] <Jdlrobson>	 jeena: yep shouldnt take long doing now
[21:33:32] <jeena>	 👍
[21:33:57] <icinga-wm_>	 RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 33 probes of 804 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:34:40] <Jdlrobson>	 jeena: yep looks good please sync
[21:34:44] <jeena>	 thanks!
[21:34:48] <logmsgbot>	 !log jhuneidi@deploy2002 jdlrobson and jhuneidi: Continuing with sync
[21:37:07] <jeena>	 anzx: are you around?
[21:37:21] <anzx>	 jeena: yes
[21:37:50] <jeena>	 okay, I'm going to go ahead and +2 your change
[21:38:39] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1385 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[21:39:18] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005476 (https://phabricator.wikimedia.org/T357978) (owner: 10Anzx)
[21:39:27] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on mw1377 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[21:39:32] <Dreamy_Jazz>	 I have a config patch I'd like to deploy once everyone else has deployed their changes. I can self-serve.
[21:40:01] <wikibugs>	 (03Merged) 10jenkins-bot: cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005476 (https://phabricator.wikimedia.org/T357978) (owner: 10Anzx)
[21:40:52] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P57638 and previous config saved to /var/cache/conftool/dbconfig/20240221-214052-arnaudb.json
[21:41:15] <jeena>	 Dreamy_Jazz: I'll ping you when done
[21:41:21] <Dreamy_Jazz>	 Thanks!
[21:42:56] <logmsgbot>	 !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:1005569|Remove Japanese Wikipedia from projects sharing user scripts (T301212)]], [[gerrit:1005570|Enable night mode on beta cluster (T357759)]] (duration: 15m 25s)
[21:43:04] <stashbot>	 T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212
[21:43:07] <stashbot>	 T357759: Deploy night mode on the minerva skin on test wiki - https://phabricator.wikimedia.org/T357759
[21:43:43] <logmsgbot>	 !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:1005476|cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon (T357978)]]
[21:43:48] <stashbot>	 T357978: Lift IP cap for WikiGap Editathon - https://phabricator.wikimedia.org/T357978
[21:43:52] <anzx>	 jeena: nothing to test, you can sync it
[21:43:58] <jeena>	 okay
[21:44:11] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncmonitor1001.eqiad.wmnet with OS bookworm
[21:44:15] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710#9565349 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncmonitor1001.eqiad.wmnet with OS bookworm completed: - ncmoni...
[21:45:11] <logmsgbot>	 !log jhuneidi@deploy2002 anzx and jhuneidi: Backport for [[gerrit:1005476|cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon (T357978)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:46:04] <logmsgbot>	 !log jhuneidi@deploy2002 anzx and jhuneidi: Continuing with sync
[21:46:32] <Jdlrobson>	 Thanks jeena 
[21:46:44] <jeena>	 You're welcome!
[21:48:09] <wikibugs>	 (03PS1) 10Dreamy Jazz: Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005607 (https://phabricator.wikimedia.org/T356923)
[21:51:37] <wikibugs>	 (03PS2) 10Dreamy Jazz: Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005607 (https://phabricator.wikimedia.org/T356923)
[21:51:51] <urandom>	 !log boostrapping Cassandra, restbase1034-{a,b,c} — T354560
[21:51:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:57] <stashbot>	 T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560
[21:52:41] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1034.eqiad.wmnet with reason: Bootstrapping — T354560
[21:52:55] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1034.eqiad.wmnet with reason: Bootstrapping — T354560
[21:54:22] <wikibugs>	 (03PS1) 10Eevans: restbase: (phony) keys & certs for missing hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1005608 (https://phabricator.wikimedia.org/T354560)
[21:54:30] <logmsgbot>	 !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:1005476|cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon (T357978)]] (duration: 10m 47s)
[21:54:34] <anzx>	 jeena: thanks 
[21:54:37] <stashbot>	 T357978: Lift IP cap for WikiGap Editathon - https://phabricator.wikimedia.org/T357978
[21:54:46] <jeena>	 np
[21:54:56] <wikibugs>	 (03PS2) 10Eevans: restbase: (phony) keys & certs for missing/new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1005608 (https://phabricator.wikimedia.org/T354560)
[21:54:59] <jeena>	 Dreamy_Jazz: backports are finished
[21:55:05] <Dreamy_Jazz>	 Thanks!
[21:55:29] <wikibugs>	 10SRE-swift-storage, 10MediaWiki-Uploading, 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9565400 (10Bawolff) maybe what is happening is that two assemble jobs are running at...
[21:55:59] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T357189)', diff saved to https://phabricator.wikimedia.org/P57639 and previous config saved to /var/cache/conftool/dbconfig/20240221-215558-arnaudb.json
[21:56:00] <wikibugs>	 (03PS3) 10Dreamy Jazz: Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005607 (https://phabricator.wikimedia.org/T356923)
[21:56:01] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[21:56:04] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[21:56:14] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[21:56:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T357189)', diff saved to https://phabricator.wikimedia.org/P57640 and previous config saved to /var/cache/conftool/dbconfig/20240221-215620-arnaudb.json
[21:57:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005607 (https://phabricator.wikimedia.org/T356923) (owner: 10Dreamy Jazz)
[21:58:03] <wikibugs>	 (03Merged) 10jenkins-bot: Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005607 (https://phabricator.wikimedia.org/T356923) (owner: 10Dreamy Jazz)
[21:58:25] <logmsgbot>	 !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:1005607|Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis (T356923 T356924)]]
[21:58:34] <stashbot>	 T356923: Create a configuration value to control whether global account blocks are enabled - https://phabricator.wikimedia.org/T356923
[21:58:34] <stashbot>	 T356924: Deploy global account blocks to WMF wikis - https://phabricator.wikimedia.org/T356924
[21:58:54] <wikibugs>	 (03PS1) 10Ahmon Dancy: logstash_checker.py: Exit 10 if over error threshold [puppet] - 10https://gerrit.wikimedia.org/r/1005610 (https://phabricator.wikimedia.org/T144033)
[21:59:58] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1005607|Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis (T356923 T356924)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[22:00:04] <jouncebot>	 Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T2200)
[22:00:43] <logmsgbot>	 !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync
[22:02:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2041*,elastic2042*,elastic2057*,elastic2063*,elastic2064*,elastic2077*,elastic2078*,elastic2092*,elastic2093*,elastic2094* for switch maintenance - bking@cumin2002 - T355860
[22:02:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2041*,elastic2042*,elastic2057*,elastic2063*,elastic2064*,elastic2077*,elastic2078*,elastic2092*,elastic2093*,elastic2094* for switch maintenance - bking@cumin2002 - T355860
[22:02:42] <stashbot>	 T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860
[22:04:40] <jinxer-wm>	 (KubernetesRsyslogDown) firing: rsyslog on kubernetes2057:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2057 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[22:08:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T357189)', diff saved to https://phabricator.wikimedia.org/P57641 and previous config saved to /var/cache/conftool/dbconfig/20240221-220807-arnaudb.json
[22:08:15] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[22:08:39] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1385 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[22:08:42] <logmsgbot>	 !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:1005607|Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis (T356923 T356924)]] (duration: 10m 16s)
[22:08:51] <stashbot>	 T356923: Create a configuration value to control whether global account blocks are enabled - https://phabricator.wikimedia.org/T356923
[22:08:52] <stashbot>	 T356924: Deploy global account blocks to WMF wikis - https://phabricator.wikimedia.org/T356924
[22:09:27] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on mw1377 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[22:10:40] <ryankemper>	 !log [WDQS] T355868 Depooling `wdqs2024`, `wdqs2014,` `wdqs2010` in anticipation of row maintenance
[22:10:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:10:49] <stashbot>	 T355868: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868
[22:12:08] <Dreamy_Jazz>	 !log Evening UTC backport window done
[22:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:17:22] <logmsgbot>	 !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@8a290df]: new allowlisted endpoints for wdqs
[22:18:58] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] restbase: (phony) keys & certs for missing/new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1005608 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans)
[22:20:36] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:20:37] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:23:14] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P57642 and previous config saved to /var/cache/conftool/dbconfig/20240221-222313-arnaudb.json
[22:25:44] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:25:50] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:29:04] <icinga-wm_>	 PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[22:29:22] <logmsgbot>	 !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@8a290df]: new allowlisted endpoints for wdqs (duration: 11m 59s)
[22:37:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:38:20] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P57643 and previous config saved to /var/cache/conftool/dbconfig/20240221-223819-arnaudb.json
[22:41:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on mw1405:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:42:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on irc1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:43:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:43:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on chartmuseum1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:43:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on parse1015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:43:54] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on apt2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:45:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710#9565580 (10BCornwall) 05Open→03Resolved Thanks for the nudge. Puppet is applying now.
[22:47:49] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:50:57] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:51:04] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:51:48] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1366:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:53:26] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T357189)', diff saved to https://phabricator.wikimedia.org/P57644 and previous config saved to /var/cache/conftool/dbconfig/20240221-225326-arnaudb.json
[22:53:29] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[22:53:38] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[22:53:44] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance
[22:53:48] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on chartmuseum1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:53:51] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T357189)', diff saved to https://phabricator.wikimedia.org/P57645 and previous config saved to /var/cache/conftool/dbconfig/20240221-225350-arnaudb.json
[22:55:19] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "I think this is ok" [puppet] - 10https://gerrit.wikimedia.org/r/1004082 (owner: 10Majavah)
[22:58:48] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on parse1015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[22:58:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on parse1022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:03:48] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on parse1019:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:13:48] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on parse1019:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:20:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2078-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[23:22:49] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:22:49] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on irc1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:23:50] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on apt2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:24:38] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:24:45] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:26:50] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T357189)', diff saved to https://phabricator.wikimedia.org/P57646 and previous config saved to /var/cache/conftool/dbconfig/20240221-232649-arnaudb.json
[23:26:55] <stashbot>	 T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189
[23:30:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (2) Elasticsearch instance elastic2063-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[23:35:18] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on irc1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:35:39] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (3) Elasticsearch instance elastic2063-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[23:37:05] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[23:37:11] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+1] logstash_checker.py: Exit 10 if over error threshold [puppet] - 10https://gerrit.wikimedia.org/r/1005610 (https://phabricator.wikimedia.org/T144033) (owner: 10Ahmon Dancy)
[23:37:12] <logmsgbot>	 !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[23:37:48] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1426:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:41:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on conf1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:41:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P57647 and previous config saved to /var/cache/conftool/dbconfig/20240221-234156-arnaudb.json
[23:42:48] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1426:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:47:49] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1426:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:52:48] <jinxer-wm>	 (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1426:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:56:48] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on maps1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:56:49] <jinxer-wm>	 (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1366:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[23:57:04] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P57648 and previous config saved to /var/cache/conftool/dbconfig/20240221-235703-arnaudb.json