[00:00:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [00:05:13] (03PS2) 10Stang: zhwiki: Create group ipblock-exempt-grantor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005109 (https://phabricator.wikimedia.org/T357991) [00:07:13] (03PS3) 10Stang: zhwiki: Create group ipblock-exempt-grantor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005109 (https://phabricator.wikimedia.org/T357991) [00:27:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [00:27:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [00:39:08] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1004703 [00:39:14] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1004703 (owner: 10TrainBranchBot) [00:50:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893#9562035 (10Eevans) >>! In T354893#9551121, @Eevans wrote: > @Jclark-ctr it looks like these hosts weren't allocated the additional IP addresses, do you know what is required to assi... [00:59:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1004703 (owner: 10TrainBranchBot) [01:22:02] 10SRE, 10Data-Engineering, 10Data-Platform-SRE, 10serviceops-radar, 10Event-Platform: Configuration Management for Kafka settings - https://phabricator.wikimedia.org/T276088#9562078 (10Ottomata) Just came across https://www.jikkou.io/docs/tutorials/get_started/ . Worth a look! - https://www.jikkou.io/do... [01:42:07] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:42:41] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:42:45] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:42:45] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:42:45] PROBLEM - Router interfaces on cr2-drmrs is CRITICAL: CRITICAL: host 185.15.58.129, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:44:08] (03CR) 10Ssingh: fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [01:55:17] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:55:49] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 61, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:55:49] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:55:49] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:55:51] RECOVERY - BFD status on cr2-eqdfw is OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [01:59:48] (03CR) 10RLazarus: [C: 03+2] Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [02:00:40] (03Merged) 10jenkins-bot: Helm chart for k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988847 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [02:01:03] (03CR) 10RLazarus: [C: 03+2] admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [02:03:55] (03Merged) 10jenkins-bot: admin_ng: Install k8s-controller-sidecars [deployment-charts] - 10https://gerrit.wikimedia.org/r/988848 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [02:10:10] !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [02:11:03] !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [02:20:45] !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [02:20:54] !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [02:22:37] !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [02:23:01] !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [02:23:13] RECOVERY - snapshot of s6 in codfw on backupmon1001 is OK: Last snapshot for s6 at codfw (db2097) taken on 2024-02-21 01:21:06 (622 GiB, +0.7 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:28:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2205.mgmt.codfw.wmnet with reboot policy FORCED [02:29:15] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2204.mgmt.codfw.wmnet with reboot policy FORCED [02:34:19] (03PS1) 10RLazarus: k8s-controller-sidecars: Add missing namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005212 (https://phabricator.wikimedia.org/T348284) [02:38:36] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:39:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2205.mgmt.codfw.wmnet with reboot policy FORCED [02:41:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:49:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2204.mgmt.codfw.wmnet with reboot policy FORCED [02:50:10] (03PS2) 10RLazarus: k8s-controller-sidecars: Add missing namespaces [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005212 (https://phabricator.wikimedia.org/T348284) [02:51:06] (03PS3) 10RLazarus: k8s-controller-sidecars: Add missing namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005212 (https://phabricator.wikimedia.org/T348284) [02:56:16] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [02:58:25] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2206 to codfw - jhancock@cumin2002" [02:59:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2206 to codfw - jhancock@cumin2002" [02:59:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:00:23] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2206.mgmt.codfw.wmnet with reboot policy FORCED [03:00:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2207.mgmt.codfw.wmnet with reboot policy FORCED [03:00:42] (03CR) 10RLazarus: [C: 03+2] "Self-merging this just so I don't leave the diffs unapplied overnight" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005212 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [03:00:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2208.mgmt.codfw.wmnet with reboot policy FORCED [03:00:53] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2207.mgmt.codfw.wmnet with reboot policy FORCED [03:00:54] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db2208.mgmt.codfw.wmnet with reboot policy FORCED [03:01:04] (PuppetDisabled) firing: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [03:01:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2207.mgmt.codfw.wmnet with reboot policy FORCED [03:03:32] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2208.mgmt.codfw.wmnet with reboot policy FORCED [03:03:36] (03Merged) 10jenkins-bot: k8s-controller-sidecars: Add missing namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005212 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [03:07:47] (03CR) 10Samwilson: [C: 03+1] InitialiseSettings: Enable Edit Recovery on 3 projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004736 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar) [03:11:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:13:36] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:15:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2207.mgmt.codfw.wmnet with reboot policy FORCED [03:21:19] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9562208 (10Bugreporter) We need to move the task subscribation and assignments. After to prevent confusion we may consider disabling the brion account. [03:21:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2206.mgmt.codfw.wmnet with reboot policy FORCED [03:24:20] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9562210 (10Bugreporter) Alternatively we can keep the bvibber Phab account as the personal account and rename brion to something like bvibber-wmf so much le... [03:25:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2208.mgmt.codfw.wmnet with reboot policy FORCED [03:26:18] !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [03:26:29] !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [03:27:42] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [03:28:11] !log rzl@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [03:29:37] !log rzl@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [03:29:39] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2209 to codfw - jhancock@cumin2002" [03:30:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2209 to codfw - jhancock@cumin2002" [03:30:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:31:14] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2209.mgmt.codfw.wmnet with reboot policy FORCED [03:33:36] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [03:35:30] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2210 to codfw - jhancock@cumin2002" [03:36:20] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2210 to codfw - jhancock@cumin2002" [03:36:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:37:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2210.mgmt.codfw.wmnet with reboot policy FORCED [03:39:55] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [03:41:47] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2211 to codfw - jhancock@cumin2002" [03:42:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2211 to codfw - jhancock@cumin2002" [03:42:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:51:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2210.mgmt.codfw.wmnet with reboot policy FORCED [03:52:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2211.mgmt.codfw.wmnet with reboot policy FORCED [03:52:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2209.mgmt.codfw.wmnet with reboot policy FORCED [03:53:31] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [03:54:29] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [03:55:20] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [03:55:42] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2212 to codfw - jhancock@cumin2002" [03:56:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2212 to codfw - jhancock@cumin2002" [03:56:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [03:57:08] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [03:58:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2212.mgmt.codfw.wmnet with reboot policy FORCED [03:59:17] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2213 to codfw - jhancock@cumin2002" [04:00:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2213 to codfw - jhancock@cumin2002" [04:00:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:06:40] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [04:08:59] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2214 to codfw - jhancock@cumin2002" [04:09:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2214 to codfw - jhancock@cumin2002" [04:09:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:10:21] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2214.mgmt.codfw.wmnet with reboot policy FORCED [04:12:47] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [04:14:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2211.mgmt.codfw.wmnet with reboot policy FORCED [04:14:41] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2215 to codfw - jhancock@cumin2002" [04:15:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2215 to codfw - jhancock@cumin2002" [04:15:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:15:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2215.mgmt.codfw.wmnet with reboot policy FORCED [04:18:01] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [04:18:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2212.mgmt.codfw.wmnet with reboot policy FORCED [04:19:56] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2216 to codfw - jhancock@cumin2002" [04:20:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2216 to codfw - jhancock@cumin2002" [04:20:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:21:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2213.mgmt.codfw.wmnet with reboot policy FORCED [04:21:26] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2216.mgmt.codfw.wmnet with reboot policy FORCED [04:22:33] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [04:23:34] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2214.mgmt.codfw.wmnet with reboot policy FORCED [04:24:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2217 to codfw - jhancock@cumin2002" [04:25:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2217 to codfw - jhancock@cumin2002" [04:25:29] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:25:57] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2217.mgmt.codfw.wmnet with reboot policy FORCED [04:27:47] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [04:29:53] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2218 to codfw - jhancock@cumin2002" [04:30:36] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [04:30:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2218 to codfw - jhancock@cumin2002" [04:30:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:31:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2218.mgmt.codfw.wmnet with reboot policy FORCED [04:31:56] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [04:32:46] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [04:34:44] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2219 to codfw - jhancock@cumin2002" [04:35:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2219 to codfw - jhancock@cumin2002" [04:35:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:36:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2219.mgmt.codfw.wmnet with reboot policy FORCED [04:36:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2215.mgmt.codfw.wmnet with reboot policy FORCED [04:39:06] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [04:41:04] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2220 to codfw - jhancock@cumin2002" [04:41:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2220 to codfw - jhancock@cumin2002" [04:41:55] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [04:42:28] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2220.mgmt.codfw.wmnet with reboot policy FORCED [04:43:09] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2216.mgmt.codfw.wmnet with reboot policy FORCED [04:51:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2218.mgmt.codfw.wmnet with reboot policy FORCED [04:52:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2217.mgmt.codfw.wmnet with reboot policy FORCED [04:58:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2219.mgmt.codfw.wmnet with reboot policy FORCED [05:00:47] RECOVERY - snapshot of s2 in codfw on backupmon1001 is OK: Last snapshot for s2 at codfw (db2097) taken on 2024-02-21 04:07:31 (1020 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:02:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2220.mgmt.codfw.wmnet with reboot policy FORCED [05:06:20] * kart_ deploying MinT [05:06:29] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2024-02-20-062448-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/995170 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry) [05:07:46] (03Merged) 10jenkins-bot: Update MinT to 2024-02-20-062448-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/995170 (https://phabricator.wikimedia.org/T333969) (owner: 10KartikMistry) [05:09:09] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:13:02] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [05:13:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance [05:13:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance [05:14:58] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [05:15:24] (03PS1) 10Marostegui: db2167: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1005217 (https://phabricator.wikimedia.org/T354826) [05:20:52] (03CR) 10Marostegui: [C: 03+2] db2167: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1005217 (https://phabricator.wikimedia.org/T354826) (owner: 10Marostegui) [05:21:07] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [05:21:41] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s2 [05:21:46] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s7 [05:23:01] (03PS1) 10Marostegui: clouddb1018: Upgrade to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005218 (https://phabricator.wikimedia.org/T356838) [05:23:22] (03CR) 10Marostegui: "The host is already depooled." [puppet] - 10https://gerrit.wikimedia.org/r/1005218 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [05:25:03] (03PS1) 10RLazarus: k8s-controller-sidecars: Bump the pod's memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005219 (https://phabricator.wikimedia.org/T348284) [05:33:56] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [05:37:43] (03PS1) 10Marostegui: es1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005221 (https://phabricator.wikimedia.org/T358080) [05:38:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1026 T358080', diff saved to https://phabricator.wikimedia.org/P57434 and previous config saved to /var/cache/conftool/dbconfig/20240221-053822-root.json [05:38:28] T358080: Upgrade es2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358080 [05:39:04] (03CR) 10Marostegui: [C: 03+2] es1026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005221 (https://phabricator.wikimedia.org/T358080) (owner: 10Marostegui) [05:39:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1026.eqiad.wmnet with OS bookworm [05:41:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [05:41:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2103.codfw.wmnet with reason: Maintenance [05:41:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2103 (T355609)', diff saved to https://phabricator.wikimedia.org/P57435 and previous config saved to /var/cache/conftool/dbconfig/20240221-054136-marostegui.json [05:41:42] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [05:42:09] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [05:45:00] !log Updated MinT to 2024-02-20-062448-production (T333969, T354666) [05:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:08] T333969: Enable Opus models for languages lacking other Machine Translation options - https://phabricator.wikimedia.org/T333969 [05:45:08] T354666: Enable MADLAD-400 in MinT test instance for Wikipedia languages not supported by other services - https://phabricator.wikimedia.org/T354666 [05:53:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1026.eqiad.wmnet with reason: host reimage [05:55:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1026.eqiad.wmnet with reason: host reimage [05:58:37] (03PS1) 10Marostegui: Revert "es1026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005083 [06:01:45] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:09:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T355609)', diff saved to https://phabricator.wikimedia.org/P57436 and previous config saved to /var/cache/conftool/dbconfig/20240221-060928-marostegui.json [06:09:35] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:11:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1026.eqiad.wmnet with OS bookworm [06:11:57] (03CR) 10Marostegui: [C: 03+2] Revert "es1026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005083 (owner: 10Marostegui) [06:13:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 1%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57437 and previous config saved to /var/cache/conftool/dbconfig/20240221-061325-root.json [06:24:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P57438 and previous config saved to /var/cache/conftool/dbconfig/20240221-062434-marostegui.json [06:29:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57439 and previous config saved to /var/cache/conftool/dbconfig/20240221-062928-root.json [06:39:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103', diff saved to https://phabricator.wikimedia.org/P57440 and previous config saved to /var/cache/conftool/dbconfig/20240221-063940-marostegui.json [06:44:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57441 and previous config saved to /var/cache/conftool/dbconfig/20240221-064433-root.json [06:46:58] RECOVERY - snapshot of x1 in codfw on backupmon1001 is OK: Last snapshot for x1 at codfw (db2097) taken on 2024-02-21 06:04:25 (481 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [06:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:51:45] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:54:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2103 (T355609)', diff saved to https://phabricator.wikimedia.org/P57442 and previous config saved to /var/cache/conftool/dbconfig/20240221-065447-marostegui.json [06:54:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2116.codfw.wmnet with reason: Maintenance [06:54:53] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:55:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2116.codfw.wmnet with reason: Maintenance [06:55:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T355609)', diff saved to https://phabricator.wikimedia.org/P57443 and previous config saved to /var/cache/conftool/dbconfig/20240221-065508-marostegui.json [06:59:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57444 and previous config saved to /var/cache/conftool/dbconfig/20240221-065938-root.json [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T0700) [07:01:04] (PuppetDisabled) firing: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [07:04:22] (PoolcounterFullQueues) firing: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:05:30] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:06:30] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:09:21] (PoolcounterFullQueues) resolved: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [07:14:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57445 and previous config saved to /var/cache/conftool/dbconfig/20240221-071443-root.json [07:15:32] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:15:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:22:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T355609)', diff saved to https://phabricator.wikimedia.org/P57446 and previous config saved to /var/cache/conftool/dbconfig/20240221-072255-marostegui.json [07:23:01] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:27:02] (03CR) 10Muehlenhoff: [C: 03+2] profile::mariadb::wmf_root_client: Remove cumin1001 from allow list [puppet] - 10https://gerrit.wikimedia.org/r/1005106 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [07:29:05] (03PS1) 10Muehlenhoff: Remove cumin1001 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1005401 (https://phabricator.wikimedia.org/T353419) [07:29:07] (03CR) 10Dom Walden: [C: 03+1] beta: Switch block schema to read-new/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998625 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [07:29:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57447 and previous config saved to /var/cache/conftool/dbconfig/20240221-072948-root.json [07:32:46] (03PS1) 10Muehlenhoff: Configure cluster::management for Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005402 (https://phabricator.wikimedia.org/T349619) [07:38:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P57448 and previous config saved to /var/cache/conftool/dbconfig/20240221-073801-marostegui.json [07:44:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57449 and previous config saved to /var/cache/conftool/dbconfig/20240221-074452-root.json [07:51:45] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:53:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P57450 and previous config saved to /var/cache/conftool/dbconfig/20240221-075307-marostegui.json [08:00:05] Amir1 and Urbanecm: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:03:55] (03PS1) 10Samwilson: CommonSettings: Set $wgWikisourceHttpProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005434 (https://phabricator.wikimedia.org/T357857) [08:04:33] (03CR) 10Majavah: [C: 03+1] clouddb1018: Upgrade to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005218 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [08:06:36] (03CR) 10Muehlenhoff: [C: 03+2] Configure cluster::management for Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005402 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [08:08:09] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9562410 (10MoritzMuehlenhoff) [08:08:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T355609)', diff saved to https://phabricator.wikimedia.org/P57451 and previous config saved to /var/cache/conftool/dbconfig/20240221-080814-marostegui.json [08:08:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2130.codfw.wmnet with reason: Maintenance [08:08:20] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:08:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2130.codfw.wmnet with reason: Maintenance [08:08:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2130 (T355609)', diff saved to https://phabricator.wikimedia.org/P57452 and previous config saved to /var/cache/conftool/dbconfig/20240221-080836-marostegui.json [08:09:32] (03PS1) 10Samwilson: InitializeSettings: Add Wikisource logging channel to prod and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005435 (https://phabricator.wikimedia.org/T357857) [08:15:34] (03PS1) 10Muehlenhoff: Switch backup1001 to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005436 [08:17:56] (03PS1) 10Muehlenhoff: Switch backup2001 to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005437 [08:19:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:20:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [08:20:08] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:20:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:20:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1158 (T357189)', diff saved to https://phabricator.wikimedia.org/P57454 and previous config saved to /var/cache/conftool/dbconfig/20240221-082029-arnaudb.json [08:20:38] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [08:21:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2180,2188-2190].codfw.wmnet with reason: Silence for reboot T356240 [08:21:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2180,2188-2190].codfw.wmnet with reason: Silence for reboot T356240 [08:21:45] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [08:22:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 db2188 db2189 db2190 depool for T356240', diff saved to https://phabricator.wikimedia.org/P57455 and previous config saved to /var/cache/conftool/dbconfig/20240221-082219-arnaudb.json [08:23:15] (MediaWikiLatencyExceeded) firing: p75 latency high: eqiad mw-api-int (k8s) 1.18s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:23:33] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2180.codfw.wmnet [08:23:33] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2188.codfw.wmnet [08:23:34] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2190.codfw.wmnet [08:23:34] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2189.codfw.wmnet [08:28:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2188.codfw.wmnet [08:28:15] (MediaWikiLatencyExceeded) resolved: p75 latency high: eqiad mw-api-int (k8s) 1.091s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:28:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2189.codfw.wmnet [08:28:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T357189)', diff saved to https://phabricator.wikimedia.org/P57456 and previous config saved to /var/cache/conftool/dbconfig/20240221-082818-arnaudb.json [08:28:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2190.codfw.wmnet [08:28:24] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [08:28:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2180.codfw.wmnet [08:29:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57457 and previous config saved to /var/cache/conftool/dbconfig/20240221-082935-arnaudb.json [08:29:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57458 and previous config saved to /var/cache/conftool/dbconfig/20240221-082955-arnaudb.json [08:30:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57459 and previous config saved to /var/cache/conftool/dbconfig/20240221-083006-arnaudb.json [08:30:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57460 and previous config saved to /var/cache/conftool/dbconfig/20240221-083016-arnaudb.json [08:36:57] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts sretest2005.codfw.wmnet [08:37:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T355609)', diff saved to https://phabricator.wikimedia.org/P57461 and previous config saved to /var/cache/conftool/dbconfig/20240221-083731-marostegui.json [08:37:37] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:41:44] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [08:43:06] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:43:06] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts sretest2005.codfw.wmnet [08:43:16] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152#9562456 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin1002 for hosts: `sretest2005.codfw.wmnet` - sretest2005.codfw... [08:43:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P57462 and previous config saved to /var/cache/conftool/dbconfig/20240221-084325-arnaudb.json [08:44:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57463 and previous config saved to /var/cache/conftool/dbconfig/20240221-084440-arnaudb.json [08:45:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57464 and previous config saved to /var/cache/conftool/dbconfig/20240221-084459-arnaudb.json [08:45:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57465 and previous config saved to /var/cache/conftool/dbconfig/20240221-084511-arnaudb.json [08:45:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57466 and previous config saved to /var/cache/conftool/dbconfig/20240221-084521-arnaudb.json [08:45:58] 10SRE-swift-storage, 10MediaWiki-Uploading, 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9562459 (10Bawolff) FWIW, i've been investigating this. It does seem to be happening... [08:52:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P57467 and previous config saved to /var/cache/conftool/dbconfig/20240221-085238-marostegui.json [08:58:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158', diff saved to https://phabricator.wikimedia.org/P57468 and previous config saved to /var/cache/conftool/dbconfig/20240221-085830-arnaudb.json [08:58:57] (03CR) 10Marostegui: [C: 03+2] clouddb1018: Upgrade to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1005218 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [08:59:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57469 and previous config saved to /var/cache/conftool/dbconfig/20240221-085944-arnaudb.json [09:00:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57470 and previous config saved to /var/cache/conftool/dbconfig/20240221-090004-arnaudb.json [09:00:12] !log Restarted CI Jenkins on contint2002 to update the timestamper plugin [09:00:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57471 and previous config saved to /var/cache/conftool/dbconfig/20240221-090016-arnaudb.json [09:00:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57472 and previous config saved to /var/cache/conftool/dbconfig/20240221-090026-arnaudb.json [09:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:50] 10SRE: Improve automation for the vendor maintenance calendar - https://phabricator.wikimedia.org/T357630#9562472 (10Peachey88) [09:05:49] (PuppetDisabled) resolved: Puppet disabled on ganeti2033:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=ganeti&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [09:06:14] (03PS1) 10Marostegui: mariadb: Remove pif_edits views [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) [09:06:19] 10SRE: Improve automation for the vendor maintenance calendar - https://phabricator.wikimedia.org/T357630#9562486 (10ayounsi) See also {T230835} Putting my clinic duty hat on: @andrea.denisse please assign a subteam and a priority. [09:06:50] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s7 [09:06:54] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s2 [09:07:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P57473 and previous config saved to /var/cache/conftool/dbconfig/20240221-090744-marostegui.json [09:09:46] (03PS1) 10Marostegui: es1030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005439 (https://phabricator.wikimedia.org/T358080) [09:09:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1030 T358080', diff saved to https://phabricator.wikimedia.org/P57474 and previous config saved to /var/cache/conftool/dbconfig/20240221-090957-root.json [09:10:03] T358080: Upgrade es2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358080 [09:10:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1030.eqiad.wmnet with OS bookworm [09:10:58] (03CR) 10Marostegui: [C: 03+2] es1030: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005439 (https://phabricator.wikimedia.org/T358080) (owner: 10Marostegui) [09:13:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1158 (T357189)', diff saved to https://phabricator.wikimedia.org/P57475 and previous config saved to /var/cache/conftool/dbconfig/20240221-091337-arnaudb.json [09:13:39] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:13:44] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:13:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [09:13:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1170 (T357189)', diff saved to https://phabricator.wikimedia.org/P57476 and previous config saved to /var/cache/conftool/dbconfig/20240221-091358-arnaudb.json [09:14:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2180 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57477 and previous config saved to /var/cache/conftool/dbconfig/20240221-091449-arnaudb.json [09:15:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2188 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57478 and previous config saved to /var/cache/conftool/dbconfig/20240221-091509-arnaudb.json [09:15:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2189 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57479 and previous config saved to /var/cache/conftool/dbconfig/20240221-091521-arnaudb.json [09:15:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2190 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57480 and previous config saved to /var/cache/conftool/dbconfig/20240221-091531-arnaudb.json [09:22:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T355609)', diff saved to https://phabricator.wikimedia.org/P57481 and previous config saved to /var/cache/conftool/dbconfig/20240221-092251-marostegui.json [09:22:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [09:22:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T357189)', diff saved to https://phabricator.wikimedia.org/P57482 and previous config saved to /var/cache/conftool/dbconfig/20240221-092256-arnaudb.json [09:22:57] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:23:06] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:23:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2141.codfw.wmnet with reason: Maintenance [09:24:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1030.eqiad.wmnet with reason: host reimage [09:26:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1030.eqiad.wmnet with reason: host reimage [09:38:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P57484 and previous config saved to /var/cache/conftool/dbconfig/20240221-093802-arnaudb.json [09:40:59] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS bookworm [09:42:28] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1030.eqiad.wmnet with OS bookworm [09:43:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 1%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57485 and previous config saved to /var/cache/conftool/dbconfig/20240221-094319-root.json [09:44:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:45:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:45:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T355609)', diff saved to https://phabricator.wikimedia.org/P57486 and previous config saved to /var/cache/conftool/dbconfig/20240221-094516-marostegui.json [09:45:24] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:47:03] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 10Phabricator, 10Patch-For-Review: Migrate dev user accounts for bvibber - https://phabricator.wikimedia.org/T358044#9562598 (10MoritzMuehlenhoff) @bvibber Renaming the user name for SSH access will leave files in the old home inacessible (we don't ne... [09:47:58] (03PS3) 10Alexandros Kosiaris: conftool: Add mw-parsoid stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1004151 (https://phabricator.wikimedia.org/T357392) [09:48:02] (03PS4) 10Alexandros Kosiaris: service::catalog: Add mw-parsoid service [puppet] - 10https://gerrit.wikimedia.org/r/1004152 (https://phabricator.wikimedia.org/T357392) [09:48:04] (03PS4) 10Alexandros Kosiaris: mw-parsoid: Add LVS backends on wikikube servers [puppet] - 10https://gerrit.wikimedia.org/r/1004153 (https://phabricator.wikimedia.org/T357392) [09:48:06] (03PS4) 10Alexandros Kosiaris: mw-parsoid: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1004154 (https://phabricator.wikimedia.org/T357392) [09:48:08] (03PS4) 10Alexandros Kosiaris: mw-parsoid: Switch to production and have it page [puppet] - 10https://gerrit.wikimedia.org/r/1004155 (https://phabricator.wikimedia.org/T357392) [09:49:13] (03CR) 10Muehlenhoff: [C: 04-1] "Needs some changes, see comments oinline" [puppet] - 10https://gerrit.wikimedia.org/r/1005441 (https://phabricator.wikimedia.org/T358044) (owner: 10Ayounsi) [09:49:40] (KubernetesAPINotScrapable) firing: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [09:49:58] (03CR) 10JMeybohm: [C: 03+1] k8s-controller-sidecars: Bump the pod's memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005219 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [09:50:42] 10sre-alert-triage, 10SRE Observability (FY2023/2024-Q3): Alert in need of triage: AlertLintProblem (instance localhost:9123) - https://phabricator.wikimedia.org/T354255#9562611 (10LSobanski) There are now four other similar alerts that are over a month old: Linting problems found for EnvoyRuntimeAdminOverrid... [09:52:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2031 T358080', diff saved to https://phabricator.wikimedia.org/P57487 and previous config saved to /var/cache/conftool/dbconfig/20240221-095205-root.json [09:52:11] T358080: Upgrade es2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358080 [09:53:07] (03PS1) 10Marostegui: es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005448 (https://phabricator.wikimedia.org/T358080) [09:53:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170', diff saved to https://phabricator.wikimedia.org/P57488 and previous config saved to /var/cache/conftool/dbconfig/20240221-095309-arnaudb.json [09:53:34] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2031.codfw.wmnet with OS bookworm [09:53:56] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [09:54:27] (03CR) 10Marostegui: [C: 03+2] es2031: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1005448 (https://phabricator.wikimedia.org/T358080) (owner: 10Marostegui) [09:55:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/999715 (owner: 10Muehlenhoff) [09:56:16] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [09:58:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57489 and previous config saved to /var/cache/conftool/dbconfig/20240221-095823-root.json [09:59:54] (03PS1) 10JMeybohm: kafka_shipper: Name omkafka actions to ingest metrics [puppet] - 10https://gerrit.wikimedia.org/r/1005449 (https://phabricator.wikimedia.org/T357616) [10:05:05] (03CR) 10Ayounsi: [V: 03+1 C: 03+2] "reimage of sretest1003 worked fine." [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:05:45] (03CR) 10JMeybohm: [C: 03+1] deployment_server: add mw-mcrouter service 1 [puppet] - 10https://gerrit.wikimedia.org/r/979339 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:05:48] (03CR) 10JMeybohm: [C: 03+1] Add namespace for mw-mcrouter service 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979340 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [10:08:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170 (T357189)', diff saved to https://phabricator.wikimedia.org/P57490 and previous config saved to /var/cache/conftool/dbconfig/20240221-100815-arnaudb.json [10:08:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:08:21] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:08:29] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:08:31] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1171.eqiad.wmnet with reason: Maintenance [10:08:35] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:08:56] (03PS1) 10Ayounsi: Routed Ganeti: move the tap v4 IP to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) [10:09:15] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:09:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, see also https://github.com/prometheus-community/rsyslog_exporter/pull/12#issuecomment-1956303298 for metrics that will be added" [puppet] - 10https://gerrit.wikimedia.org/r/1005449 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [10:10:41] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:11:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T355609)', diff saved to https://phabricator.wikimedia.org/P57491 and previous config saved to /var/cache/conftool/dbconfig/20240221-101111-marostegui.json [10:11:17] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:11:25] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:11:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51453 bytes in 4.147 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:12:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS bookworm [10:12:07] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:12:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2031.codfw.wmnet with reason: host reimage [10:13:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57492 and previous config saved to /var/cache/conftool/dbconfig/20240221-101328-root.json [10:14:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2031.codfw.wmnet with reason: host reimage [10:15:12] (03PS2) 10Ayounsi: Routed Ganeti: move the tap v4 IP to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) [10:15:32] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:16:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:16:39] (03CR) 10Clément Goubert: [C: 03+1] logstash_checker.py: Add ability to check all MediaWiki canaries at once [puppet] - 10https://gerrit.wikimedia.org/r/1003885 (https://phabricator.wikimedia.org/T357402) (owner: 10Ahmon Dancy) [10:16:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1174.eqiad.wmnet with reason: Maintenance [10:16:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1174 (T357189)', diff saved to https://phabricator.wikimedia.org/P57493 and previous config saved to /var/cache/conftool/dbconfig/20240221-101646-arnaudb.json [10:16:52] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:18:27] (03PS3) 10Ayounsi: Routed Ganeti: move the tap v4 IP to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) [10:20:58] (03PS1) 10Ayounsi: Add .vscode to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/1005451 [10:22:27] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005451 (owner: 10Ayounsi) [10:22:51] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:24:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T357189)', diff saved to https://phabricator.wikimedia.org/P57494 and previous config saved to /var/cache/conftool/dbconfig/20240221-102432-arnaudb.json [10:24:38] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:25:45] (03CR) 10Fabfur: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1005451 (owner: 10Ayounsi) [10:26:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P57495 and previous config saved to /var/cache/conftool/dbconfig/20240221-102618-marostegui.json [10:26:39] (03CR) 10Ayounsi: [C: 03+2] Add .vscode to .gitignore [puppet] - 10https://gerrit.wikimedia.org/r/1005451 (owner: 10Ayounsi) [10:28:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57496 and previous config saved to /var/cache/conftool/dbconfig/20240221-102833-root.json [10:31:43] (03CR) 10Hnowlan: [C: 03+2] changeprop: clean up k8s jobrunner references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004066 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:32:41] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [10:32:58] (03Merged) 10jenkins-bot: changeprop: clean up k8s jobrunner references [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004066 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:32:58] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [10:34:16] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [10:34:32] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [10:34:44] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/999715 (owner: 10Muehlenhoff) [10:35:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2031.codfw.wmnet with OS bookworm [10:35:56] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [10:36:27] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [10:36:35] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [10:37:04] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [10:39:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P57497 and previous config saved to /var/cache/conftool/dbconfig/20240221-103938-arnaudb.json [10:41:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P57498 and previous config saved to /var/cache/conftool/dbconfig/20240221-104124-marostegui.json [10:42:20] (03PS1) 10Marostegui: Revert "es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005466 [10:43:04] (03CR) 10Alexandros Kosiaris: [C: 03+2] service::catalog: Add mw-parsoid service [puppet] - 10https://gerrit.wikimedia.org/r/1004152 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [10:43:12] (03CR) 10Alexandros Kosiaris: [C: 03+2] conftool: Add mw-parsoid stanzas [puppet] - 10https://gerrit.wikimedia.org/r/1004151 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [10:43:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] mw-parsoid: Add LVS backends on wikikube servers [puppet] - 10https://gerrit.wikimedia.org/r/1004153 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [10:43:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57499 and previous config saved to /var/cache/conftool/dbconfig/20240221-104339-root.json [10:44:05] (03PS2) 10Hnowlan: users: add jwheeler to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1004187 (https://phabricator.wikimedia.org/T357731) [10:44:07] (03CR) 10Marostegui: [C: 03+2] Revert "es2031: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005466 (owner: 10Marostegui) [10:45:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57500 and previous config saved to /var/cache/conftool/dbconfig/20240221-104526-root.json [10:48:35] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [10:51:20] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1004187 (https://phabricator.wikimedia.org/T357731) (owner: 10Hnowlan) [10:52:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: nova-compute: persist compute node id [puppet] - 10https://gerrit.wikimedia.org/r/1005065 (https://phabricator.wikimedia.org/T357631) (owner: 10Arturo Borrero Gonzalez) [10:54:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P57501 and previous config saved to /var/cache/conftool/dbconfig/20240221-105445-arnaudb.json [10:55:55] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005457 (https://phabricator.wikimedia.org/T356736) [10:56:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T355609)', diff saved to https://phabricator.wikimedia.org/P57502 and previous config saved to /var/cache/conftool/dbconfig/20240221-105630-marostegui.json [10:56:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance [10:56:36] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:56:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2146.codfw.wmnet with reason: Maintenance [10:56:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T355609)', diff saved to https://phabricator.wikimedia.org/P57503 and previous config saved to /var/cache/conftool/dbconfig/20240221-105654-marostegui.json [10:58:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57504 and previous config saved to /var/cache/conftool/dbconfig/20240221-105844-root.json [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1100) [11:00:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57505 and previous config saved to /var/cache/conftool/dbconfig/20240221-110031-root.json [11:01:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver2002.codfw.wmnet [11:02:01] (03CR) 10Tchanders: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005457 (https://phabricator.wikimedia.org/T356736) (owner: 10STran) [11:02:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:03:03] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005457 (https://phabricator.wikimedia.org/T356736) (owner: 10STran) [11:03:25] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:04:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver2002.codfw.wmnet [11:05:04] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [11:05:50] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [11:06:14] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetserver1001.eqiad.wmnet [11:07:10] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [11:07:25] (SystemdUnitFailed) firing: (2) httpbb_kubernetes_mw-parsoid_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:08:11] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [11:08:35] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [11:08:35] (SystemdUnitFailed) firing: (3) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetserver1001.eqiad.wmnet [11:09:29] the httpbb mw-parsoid alerts can be ignored for now. I am still in the process of setting up the service. I didn't expect them to fire though. [11:09:45] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [11:09:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T357189)', diff saved to https://phabricator.wikimedia.org/P57506 and previous config saved to /var/cache/conftool/dbconfig/20240221-110951-arnaudb.json [11:09:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [11:09:54] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [11:09:56] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [11:09:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db[2191-2193].codfw.wmnet,db1151.eqiad.wmnet with reason: Silence for reboot T356240 [11:10:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1191.eqiad.wmnet with reason: Maintenance [11:10:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1191 (T357189)', diff saved to https://phabricator.wikimedia.org/P57507 and previous config saved to /var/cache/conftool/dbconfig/20240221-111012-arnaudb.json [11:10:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db[2191-2193].codfw.wmnet,db1151.eqiad.wmnet with reason: Silence for reboot T356240 [11:10:22] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [11:10:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 - depooling db2191 db2192 db2193 db1151', diff saved to https://phabricator.wikimedia.org/P57508 and previous config saved to /var/cache/conftool/dbconfig/20240221-111023-arnaudb.json [11:11:43] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2191.codfw.wmnet [11:11:44] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1151.eqiad.wmnet [11:11:44] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2193.codfw.wmnet [11:11:45] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2192.codfw.wmnet [11:13:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1030 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57510 and previous config saved to /var/cache/conftool/dbconfig/20240221-111348-root.json [11:15:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57511 and previous config saved to /var/cache/conftool/dbconfig/20240221-111536-root.json [11:16:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2193.codfw.wmnet [11:16:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2192.codfw.wmnet [11:16:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2191.codfw.wmnet [11:17:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1151.eqiad.wmnet [11:18:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T357189)', diff saved to https://phabricator.wikimedia.org/P57512 and previous config saved to /var/cache/conftool/dbconfig/20240221-111805-arnaudb.json [11:18:14] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [11:18:35] (SystemdUnitFailed) firing: (3) generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:18:54] (03CR) 10Majavah: [C: 03+1] Add gitreview configuration [software/bitu] - 10https://gerrit.wikimedia.org/r/997809 (https://phabricator.wikimedia.org/T355180) (owner: 10Slyngshede) [11:20:10] (03PS1) 10Jelto: etherpad: make exporter and blackbox checks configurable [puppet] - 10https://gerrit.wikimedia.org/r/1005458 (https://phabricator.wikimedia.org/T316421) [11:24:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T355609)', diff saved to https://phabricator.wikimedia.org/P57513 and previous config saved to /var/cache/conftool/dbconfig/20240221-112408-marostegui.json [11:24:15] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:24:37] (03PS2) 10Jelto: etherpad: make exporter and blackbox checks configurable [puppet] - 10https://gerrit.wikimedia.org/r/1005458 (https://phabricator.wikimedia.org/T316421) [11:24:59] (03CR) 10Hnowlan: [C: 03+1] c-cqlsh is now deprecated; long live cqlsh-instance [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1004235 (owner: 10Eevans) [11:30:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57514 and previous config saved to /var/cache/conftool/dbconfig/20240221-113041-root.json [11:32:09] !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Added cassandra IPs for restbase10[34-42] - volans@cumin1002" [11:32:35] !log volans@cumin1002 END (ERROR) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=97) generate netbox hiera data: "Added cassandra IPs for restbase10[34-42] - volans@cumin1002" [11:32:52] !log volans@cumin1002 START - Cookbook sre.dns.netbox [11:33:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P57515 and previous config saved to /var/cache/conftool/dbconfig/20240221-113311-arnaudb.json [11:34:26] jouncebot: nowandnext [11:34:26] For the next 0 hour(s) and 25 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1100) [11:34:27] In 2 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1400) [11:35:12] !log volans@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Added cassandra IPs for restbase10[34-42] - volans@cumin1002" [11:36:34] !log volans@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Added cassandra IPs for restbase10[34-42] - volans@cumin1002" [11:36:34] !log volans@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:39:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P57516 and previous config saved to /var/cache/conftool/dbconfig/20240221-113914-marostegui.json [11:45:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57517 and previous config saved to /var/cache/conftool/dbconfig/20240221-114546-root.json [11:48:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P57518 and previous config saved to /var/cache/conftool/dbconfig/20240221-114817-arnaudb.json [11:48:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57519 and previous config saved to /var/cache/conftool/dbconfig/20240221-114856-arnaudb.json [11:49:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57520 and previous config saved to /var/cache/conftool/dbconfig/20240221-114909-arnaudb.json [11:49:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57521 and previous config saved to /var/cache/conftool/dbconfig/20240221-114925-arnaudb.json [11:54:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P57522 and previous config saved to /var/cache/conftool/dbconfig/20240221-115421-marostegui.json [11:56:34] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:56:58] PROBLEM - PyBal IPVS diff check on lvs2014 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:57:38] PROBLEM - PyBal IPVS diff check on lvs2013 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:58:30] PROBLEM - PyBal connections to etcd on lvs2013 is CRITICAL: CRITICAL: 80 connections established with conf2004.codfw.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [11:58:44] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [11:59:27] PROBLEM - PyBal connections to etcd on lvs1019 is CRITICAL: CRITICAL: 84 connections established with conf1007.eqiad.wmnet:4001 (min=85) https://wikitech.wikimedia.org/wiki/PyBal [11:59:27] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 114 connections established with conf1007.eqiad.wmnet:4001 (min=115) https://wikitech.wikimedia.org/wiki/PyBal [11:59:39] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:59:40] (KubernetesAPINotScrapable) resolved: (2) k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [12:00:13] PROBLEM - PyBal connections to etcd on lvs2014 is CRITICAL: CRITICAL: 98 connections established with conf2004.codfw.wmnet:4001 (min=99) https://wikitech.wikimedia.org/wiki/PyBal [12:00:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57523 and previous config saved to /var/cache/conftool/dbconfig/20240221-120051-root.json [12:01:13] !log restart pybal on lvs1020 to pickup mw-parsoid service. T357392 [12:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:37] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:01:43] T357392: Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392 [12:02:01] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:02:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2026 T358080', diff saved to https://phabricator.wikimedia.org/P57524 and previous config saved to /var/cache/conftool/dbconfig/20240221-120202-root.json [12:02:21] !log restart pybal on lvs2014 to pickup mw-parsoid service. T357392 [12:02:23] T358080: Upgrade es2 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358080 [12:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T357189)', diff saved to https://phabricator.wikimedia.org/P57525 and previous config saved to /var/cache/conftool/dbconfig/20240221-120324-arnaudb.json [12:03:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [12:03:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1194.eqiad.wmnet with reason: Maintenance [12:03:41] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:03:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2026.codfw.wmnet with OS bookworm [12:03:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1194 (T357189)', diff saved to https://phabricator.wikimedia.org/P57526 and previous config saved to /var/cache/conftool/dbconfig/20240221-120345-arnaudb.json [12:04:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57527 and previous config saved to /var/cache/conftool/dbconfig/20240221-120401-arnaudb.json [12:04:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57528 and previous config saved to /var/cache/conftool/dbconfig/20240221-120414-arnaudb.json [12:04:25] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bookworm [12:04:27] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 115 connections established with conf1007.eqiad.wmnet:4001 (min=115) https://wikitech.wikimedia.org/wiki/PyBal [12:04:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57529 and previous config saved to /var/cache/conftool/dbconfig/20240221-120429-arnaudb.json [12:05:13] RECOVERY - PyBal connections to etcd on lvs2014 is OK: OK: 99 connections established with conf2004.codfw.wmnet:4001 (min=99) https://wikitech.wikimedia.org/wiki/PyBal [12:07:49] Deploying fix for cxserver.. [12:08:29] RECOVERY - PyBal connections to etcd on lvs2013 is OK: OK: 81 connections established with conf2004.codfw.wmnet:4001 (min=81) https://wikitech.wikimedia.org/wiki/PyBal [12:08:43] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:09:27] RECOVERY - PyBal connections to etcd on lvs1019 is OK: OK: 85 connections established with conf1007.eqiad.wmnet:4001 (min=85) https://wikitech.wikimedia.org/wiki/PyBal [12:09:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T355609)', diff saved to https://phabricator.wikimedia.org/P57530 and previous config saved to /var/cache/conftool/dbconfig/20240221-120927-marostegui.json [12:09:28] !log aborrero@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host cloudvirt1033 [12:09:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance [12:09:34] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:09:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2153.codfw.wmnet with reason: Maintenance [12:09:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2153 (T355609)', diff saved to https://phabricator.wikimedia.org/P57531 and previous config saved to /var/cache/conftool/dbconfig/20240221-120949-marostegui.json [12:09:55] !log aborrero@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudvirt1033 [12:10:19] !log restart pybal on lvs2013, lvs 1019 to pickup mw-parsoid service. T357392 [12:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:28] T357392: Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392 [12:10:37] (03Merged) 10jenkins-bot: mw-page-content-change-enrich: Switch to mw-api-int-async [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004156 (https://phabricator.wikimedia.org/T357785) (owner: 10Clément Goubert) [12:10:40] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [12:11:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T357189)', diff saved to https://phabricator.wikimedia.org/P57532 and previous config saved to /var/cache/conftool/dbconfig/20240221-121129-arnaudb.json [12:11:36] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:12:19] !log mw-page-content-change-enrich: Switch to mw-api-int-async - T357785 [12:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:27] T357785: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785 [12:12:37] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:13:00] 10SRE, 10Data-Engineering, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9563142 (10BTullis) I'm happy for this change to go ahead. I'll keep an eye on the [[https://grafana-rw.wikimedia.org/d/K9x0c4aVk/flink... [12:13:04] (03CR) 10Muehlenhoff: "All backup-related hosts are fully migrated to Puppet 7 already :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1005437 (owner: 10Muehlenhoff) [12:13:12] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [12:13:17] (03PS5) 10Jelto: etherpad: make exporter and blackbox checks configurable [puppet] - 10https://gerrit.wikimedia.org/r/1005458 (https://phabricator.wikimedia.org/T316421) [12:13:27] (03Merged) 10jenkins-bot: Ganeti: pass the v4 and v6 IPs to the VM as fw_cfg [software/spicerack] - 10https://gerrit.wikimedia.org/r/1003491 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [12:13:42] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:14:24] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:14:45] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:15:07] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2026.codfw.wmnet with OS bookworm [12:15:23] (03CR) 10David Caro: toolforge: k8s: Do not log secrets to Puppet log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005488 (owner: 10Majavah) [12:15:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2026.codfw.wmnet with OS bookworm [12:15:34] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [12:15:36] (03CR) 10David Caro: [C: 03+1] "Just the question, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005488 (owner: 10Majavah) [12:15:57] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:16:05] (03CR) 10Majavah: [C: 03+2] toolforge: k8s: Do not log secrets to Puppet log (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1005488 (owner: 10Majavah) [12:16:09] (03PS2) 10Alexandros Kosiaris: Add mw-parsoid [dns] - 10https://gerrit.wikimedia.org/r/1004138 (https://phabricator.wikimedia.org/T357392) [12:16:31] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1005458 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [12:18:14] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [12:18:46] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [12:18:49] (03CR) 10Tim Starling: [C: 03+1] CommonSettings: Set $wgWikisourceHttpProxy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005434 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson) [12:19:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add mw-parsoid [dns] - 10https://gerrit.wikimedia.org/r/1004138 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [12:19:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57533 and previous config saved to /var/cache/conftool/dbconfig/20240221-121906-arnaudb.json [12:19:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57534 and previous config saved to /var/cache/conftool/dbconfig/20240221-121918-arnaudb.json [12:19:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57535 and previous config saved to /var/cache/conftool/dbconfig/20240221-121934-arnaudb.json [12:19:45] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [12:19:49] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [12:19:59] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [12:20:02] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [12:20:11] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:20:36] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [12:21:22] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2026.codfw.wmnet with OS bookworm [12:21:43] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage [12:21:47] (03CR) 10Tim Starling: [C: 03+1] InitializeSettings: Add Wikisource logging channel to prod and labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005435 (https://phabricator.wikimedia.org/T357857) (owner: 10Samwilson) [12:22:50] !log Updated cxserver to 2024-02-21-112101-production (T357769) [12:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:55] T357769: cxserver "fetch segmented page content" API endpoint doesn't work for space-separated multi-word titles - https://phabricator.wikimedia.org/T357769 [12:22:59] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netbox, and 2 others: Netbox: Add support for our complex host network setups in provision script - https://phabricator.wikimedia.org/T346428#9563208 (10ayounsi) {T358096} for the Cassandra/extra IPs usecase. [12:24:07] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage [12:24:21] !log akosiaris@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-parsoid,name=codfw [12:26:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P57536 and previous config saved to /var/cache/conftool/dbconfig/20240221-122636-arnaudb.json [12:26:42] (03CR) 10Jcrespo: [C: 03+1] "Thank you, please deploy at will" [puppet] - 10https://gerrit.wikimedia.org/r/1005437 (owner: 10Muehlenhoff) [12:30:35] (03PS5) 10Alexandros Kosiaris: mw-parsoid: Switch to production and have it page [puppet] - 10https://gerrit.wikimedia.org/r/1004155 (https://phabricator.wikimedia.org/T357392) [12:30:38] (03CR) 10Majavah: [C: 03+1] "Will you take care of dropping the views too or should I?" [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [12:30:42] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] mw-parsoid: Switch to production and have it page [puppet] - 10https://gerrit.wikimedia.org/r/1004155 (https://phabricator.wikimedia.org/T357392) (owner: 10Alexandros Kosiaris) [12:31:50] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9563252 (10Clement_Goubert) [12:33:10] 10SRE, 10Data-Engineering, 10MW-on-K8s, 10serviceops: Migrate mw-page-content-change-enrich to mw-api-int - https://phabricator.wikimedia.org/T357785#9563249 (10Clement_Goubert) 05In progress→03Resolved I can confirm that mw-page-content-enrich now requests from mw-api-int (blue) and not the appserver... [12:33:59] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9563260 (10MoritzMuehlenhoff) [12:34:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2191 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57537 and previous config saved to /var/cache/conftool/dbconfig/20240221-123410-arnaudb.json [12:34:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57538 and previous config saved to /var/cache/conftool/dbconfig/20240221-123423-arnaudb.json [12:34:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2193 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57539 and previous config saved to /var/cache/conftool/dbconfig/20240221-123439-arnaudb.json [12:36:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T355609)', diff saved to https://phabricator.wikimedia.org/P57540 and previous config saved to /var/cache/conftool/dbconfig/20240221-123615-marostegui.json [12:36:28] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:36:45] (03PS1) 10Muehlenhoff: acmechief: Remove obsolete entries from apt record [puppet] - 10https://gerrit.wikimedia.org/r/1005498 (https://phabricator.wikimedia.org/T331613) [12:36:56] (03PS2) 10Muehlenhoff: acmechief: Remove obsolete entries from apt record [puppet] - 10https://gerrit.wikimedia.org/r/1005498 (https://phabricator.wikimedia.org/T331613) [12:38:18] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Figure out next steps for cergen in Puppet setup - https://phabricator.wikimedia.org/T357750#9563263 (10MoritzMuehlenhoff) [12:39:38] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:41:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P57541 and previous config saved to /var/cache/conftool/dbconfig/20240221-124142-arnaudb.json [12:44:14] (03CR) 10Clément Goubert: [C: 03+2] sre.hosts.reimage: Fix dry-run failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1005112 (owner: 10Clément Goubert) [12:46:38] (03CR) 10Marostegui: "I'd prefer if you merge this and drop the views too." [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [12:48:08] jouncebot: nowandnext [12:48:08] No deployments scheduled for the next 1 hour(s) and 11 minute(s) [12:48:08] In 1 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1400) [12:48:49] (03Merged) 10jenkins-bot: sre.hosts.reimage: Fix dry-run failure [cookbooks] - 10https://gerrit.wikimedia.org/r/1005112 (owner: 10Clément Goubert) [12:49:16] (03PS4) 10Samtar: InitialiseSettings: Enable Edit Recovery on 3 projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004736 (https://phabricator.wikimedia.org/T355548) [12:51:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P57542 and previous config saved to /var/cache/conftool/dbconfig/20240221-125121-marostegui.json [12:51:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004736 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar) [12:52:18] (03PS2) 10Tim Starling: beta: Switch block schema to read-new/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998625 (https://phabricator.wikimedia.org/T355034) [12:52:36] (03Merged) 10jenkins-bot: InitialiseSettings: Enable Edit Recovery on 3 projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004736 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar) [12:52:47] (03CR) 10Tim Starling: [C: 03+2] beta: Switch block schema to read-new/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998625 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [12:52:54] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1033.eqiad.wmnet with OS bookworm [12:53:05] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563291 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1033.eqiad.wmnet with OS book... [12:53:29] (03Merged) 10jenkins-bot: beta: Switch block schema to read-new/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/998625 (https://phabricator.wikimedia.org/T355034) (owner: 10Tim Starling) [12:53:55] !log samtar@deploy2002 Started scap: Backport for [[gerrit:1004736|InitialiseSettings: Enable Edit Recovery on 3 projects (T355548)]] [12:54:00] T355548: Edit Recovery deployment - https://phabricator.wikimedia.org/T355548 [12:54:36] (03CR) 10Ayounsi: [C: 03+1] acmechief: Remove obsolete entries from apt record [puppet] - 10https://gerrit.wikimedia.org/r/1005498 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [12:55:33] !log samtar@deploy2002 samtar: Backport for [[gerrit:1004736|InitialiseSettings: Enable Edit Recovery on 3 projects (T355548)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:55:41] * TheresNoTime testing, a few minutes [12:56:32] 10SRE, 10User-aborrero: reimage cookbook: failure when - https://phabricator.wikimedia.org/T358099#9563302 (10aborrero) [12:56:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T357189)', diff saved to https://phabricator.wikimedia.org/P57543 and previous config saved to /var/cache/conftool/dbconfig/20240221-125648-arnaudb.json [12:56:51] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [12:56:54] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:57:03] 10SRE, 10User-aborrero: reimage cookbook: failure when updating netbox data from puppetdb on cloudvirt1033 - https://phabricator.wikimedia.org/T358099#9563314 (10aborrero) [12:57:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [12:57:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T357189)', diff saved to https://phabricator.wikimedia.org/P57544 and previous config saved to /var/cache/conftool/dbconfig/20240221-125711-arnaudb.json [12:57:54] !log T357007 Running mwscript /home/daimona/GenerateInvitationList.php --wiki=metawiki --listfile=/home/daimona/list.txt (same as current master) [12:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:59] T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007 [13:00:15] !log samtar@deploy2002 samtar: Continuing with sync [13:02:48] !log aborrero@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "cloudvirt1033 - aborrero@cumin1002" [13:03:24] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:03:37] !log aborrero@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "cloudvirt1033 - aborrero@cumin1002" [13:04:01] (03CR) 10Muehlenhoff: [C: 03+2] acmechief: Remove obsolete entries from apt record [puppet] - 10https://gerrit.wikimedia.org/r/1005498 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [13:04:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T357189)', diff saved to https://phabricator.wikimedia.org/P57545 and previous config saved to /var/cache/conftool/dbconfig/20240221-130450-arnaudb.json [13:05:08] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:06:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P57546 and previous config saved to /var/cache/conftool/dbconfig/20240221-130628-marostegui.json [13:07:52] (03PS1) 10JMeybohm: New upstream version v1.0.0-8522c38 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) [13:07:55] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:32] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:1004736|InitialiseSettings: Enable Edit Recovery on 3 projects (T355548)]] (duration: 14m 36s) [13:08:43] T355548: Edit Recovery deployment - https://phabricator.wikimedia.org/T355548 [13:08:49] (03PS2) 10JMeybohm: New upstream version v1.0.0-8522c38 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) [13:11:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [13:11:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [13:11:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host apt2002.wikimedia.org [13:11:41] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:13:47] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM apt2002.wikimedia.org - jmm@cumin2002" [13:13:50] (03CR) 10JMeybohm: "Not by this, though. This will enable scaping of rsyslog_action metrics for omkafka actions we define." [puppet] - 10https://gerrit.wikimedia.org/r/1005449 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [13:14:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM apt2002.wikimedia.org - jmm@cumin2002" [13:14:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:14:39] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache apt2002.wikimedia.org on all recursors [13:14:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) apt2002.wikimedia.org on all recursors [13:14:44] (03CR) 10JMeybohm: "For (way to broad) PCC, see: https://puppet-compiler.wmflabs.org/output/1005449/1413/" [puppet] - 10https://gerrit.wikimedia.org/r/1005449 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [13:15:10] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM apt2002.wikimedia.org - jmm@cumin2002" [13:16:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM apt2002.wikimedia.org - jmm@cumin2002" [13:16:05] (03CR) 10JMeybohm: [C: 03+2] kafka_shipper: Name omkafka actions to ingest metrics [puppet] - 10https://gerrit.wikimedia.org/r/1005449 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [13:18:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host apt2002.wikimedia.org with OS bookworm [13:19:09] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613#9563370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host apt2002.wikimedia.org with OS bookworm [13:19:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P57547 and previous config saved to /var/cache/conftool/dbconfig/20240221-131957-arnaudb.json [13:21:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T355609)', diff saved to https://phabricator.wikimedia.org/P57548 and previous config saved to /var/cache/conftool/dbconfig/20240221-132134-marostegui.json [13:21:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [13:21:41] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [13:21:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [13:21:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170 (T355609)', diff saved to https://phabricator.wikimedia.org/P57549 and previous config saved to /var/cache/conftool/dbconfig/20240221-132156-marostegui.json [13:22:16] !log cmooney@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [13:22:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [13:32:02] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on apt2002.wikimedia.org with reason: host reimage [13:34:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on apt2002.wikimedia.org with reason: host reimage [13:35:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P57550 and previous config saved to /var/cache/conftool/dbconfig/20240221-133503-arnaudb.json [13:37:08] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563419 (10aborrero) [13:37:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove cumin1001 from list of Cumin masters [puppet] - 10https://gerrit.wikimedia.org/r/1005401 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [13:38:15] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#8416726 (10aborrero) [13:39:47] 10SRE, 10Data-Persistence, 10Infrastructure-Foundations: Re-IP db servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T354878#9563430 (10Marostegui) [13:39:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on db2142.codfw.wmnet,db[1180,1213].eqiad.wmnet with reason: Silence for reboot T356240 [13:40:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on db2142.codfw.wmnet,db[1180,1213].eqiad.wmnet with reason: Silence for reboot T356240 [13:40:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T356240 - depooling db1180 db1213 db2142', diff saved to https://phabricator.wikimedia.org/P57551 and previous config saved to /var/cache/conftool/dbconfig/20240221-134015-arnaudb.json [13:40:23] !log Re-started MediaModeration scanning script using `mwscript extensions/MediaModeration/maintenance/scanFilesInScanTable.php --wiki=commonswiki --use-jobqueue --sleep 30 --verbose 2>&1 | tee ~/scan-files-in-scan-table-commonswiki-sleep-30-no-render-now.txt` - See T351400 [13:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:28] T351400: Run the maintenance script scanning images in mediamoderation_scan on WMF wikis - https://phabricator.wikimedia.org/T351400 [13:40:45] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1180.eqiad.wmnet [13:41:05] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db1213.eqiad.wmnet [13:41:17] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2142.codfw.wmnet [13:41:42] (03PS1) 10Muehlenhoff: Drop not obsolete motd for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/1005511 (https://phabricator.wikimedia.org/T353419) [13:42:38] (03CR) 10Filippo Giunchedi: "LGTM, see inline" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [13:42:46] (03PS2) 10Muehlenhoff: Drop obsolete motd for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/1005511 (https://phabricator.wikimedia.org/T353419) [13:44:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1213.eqiad.wmnet [13:45:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db1180.eqiad.wmnet [13:45:30] (03PS1) 10Arturo Borrero Gonzalez: cloudvirt1033: move to single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1005513 (https://phabricator.wikimedia.org/T319184) [13:46:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57552 and previous config saved to /var/cache/conftool/dbconfig/20240221-134605-arnaudb.json [13:46:25] (SystemdUnitFailed) firing: ferm.service on kubernetes2016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2142.codfw.wmnet [13:47:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57553 and previous config saved to /var/cache/conftool/dbconfig/20240221-134724-arnaudb.json [13:47:41] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:48:33] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:49:15] (03CR) 10David Caro: [C: 03+1] cloudvirt1033: move to single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1005513 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [13:49:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvirt1033: move to single NIC setup [puppet] - 10https://gerrit.wikimedia.org/r/1005513 (https://phabricator.wikimedia.org/T319184) (owner: 10Arturo Borrero Gonzalez) [13:50:10] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T357189)', diff saved to https://phabricator.wikimedia.org/P57554 and previous config saved to /var/cache/conftool/dbconfig/20240221-135009-arnaudb.json [13:50:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [13:50:16] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:50:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [13:50:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T357189)', diff saved to https://phabricator.wikimedia.org/P57555 and previous config saved to /var/cache/conftool/dbconfig/20240221-135031-arnaudb.json [13:50:43] !log aborrero@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1033.eqiad.wmnet with OS bookworm [13:50:54] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563463 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by aborrero@cumin1002 for host cloudvirt1033.eqiad.wmnet with OS... [13:51:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:51:25] (SystemdUnitFailed) firing: (3) ferm.service on kubernetes2016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:52:05] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes2016 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:56:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:59:14] !log adding IRB anycast interface on private1-a-codfw vlan to lsw1-a4-codfw [13:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:40] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2057:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2057 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:59:56] * Lucas_WMDE will not be available during the backport+config window btw [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1400) [14:00:05] koi, anzx, and hoo: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:52] * TheresNoTime can't deploy in this window today, sorry! [14:01:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57556 and previous config saved to /var/cache/conftool/dbconfig/20240221-140110-arnaudb.json [14:01:15] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:01:17] :( [14:01:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T357189)', diff saved to https://phabricator.wikimedia.org/P57557 and previous config saved to /var/cache/conftool/dbconfig/20240221-140120-arnaudb.json [14:01:25] (SystemdUnitFailed) firing: (4) ferm.service on kubernetes2016:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:01:44] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:02:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57558 and previous config saved to /var/cache/conftool/dbconfig/20240221-140229-arnaudb.json [14:03:16] PROBLEM - Check whether ferm is active by checking the default input chain on mw2297 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:04:22] lemme see if I can move things around [14:05:02] Okay, I can deploy. koi your patch is first [14:05:08] (03PS4) 10Samtar: zhwiki: Create group ipblock-exempt-grantor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005109 (https://phabricator.wikimedia.org/T357991) (owner: 10Stang) [14:05:53] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host apt2002.wikimedia.org with OS bookworm [14:05:53] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host apt2002.wikimedia.org [14:05:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:05:58] 10SRE, 10Infrastructure-Foundations: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613#9563494 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host apt2002.wikimedia.org with OS bookworm executed with errors: - apt2002 (**FAIL**) - Removed... [14:06:15] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:06:25] (SystemdUnitFailed) firing: (5) httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:07:40] !log aborrero@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage [14:08:25] (SystemdUnitFailed) firing: send_tile_invalidations.service on maps2009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:08:25] !log restarted ferm.service on kubernetes2055.codfw.wmnet mw2440.codfw.wmnet mw2297.codfw.wmnet kubernetes2016.codfw.wmnet - T354855 [14:08:27] I'm just waiting a moment because of those jobqueue errors, there's quite a few [14:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:30] T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 [14:10:20] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1033.eqiad.wmnet with reason: host reimage [14:10:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005109 (https://phabricator.wikimedia.org/T357991) (owner: 10Stang) [14:10:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:11:15] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:11:25] (SystemdUnitFailed) resolved: (5) httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:42] (03PS3) 10JMeybohm: New upstream version v1.0.0-8522c38 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) [14:11:47] (03Merged) 10jenkins-bot: zhwiki: Create group ipblock-exempt-grantor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005109 (https://phabricator.wikimedia.org/T357991) (owner: 10Stang) [14:12:11] !log samtar@deploy2002 Started scap: Backport for [[gerrit:1005109|zhwiki: Create group ipblock-exempt-grantor (T357991)]] [14:12:19] T357991: Create ipblock exempt granter group on zhwiki - https://phabricator.wikimedia.org/T357991 [14:12:23] (03CR) 10JMeybohm: New upstream version v1.0.0-8522c38 (031 comment) [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [14:13:00] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9563515 (10Jhancock.wm) [14:13:38] These jobqueue errors started exactly when Dreamy_Jazz re-started the MediaModeration scanning script [14:13:41] !log samtar@deploy2002 stang and samtar: Backport for [[gerrit:1005109|zhwiki: Create group ipblock-exempt-grantor (T357991)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:13:44] koi: ready for testing on mwdebug [14:13:49] looking [14:14:07] I wonder if there's a link, hnowlan did you see that kind of correlation before when we had error spikes on the jobrunners [14:14:20] It's all "Could not enqueue jobs" errors [14:14:36] (03CR) 10JMeybohm: "recheck" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [14:15:04] TheresNoTime, lgtm [14:15:07] (03PS1) 10Brouberol: Add a sidecar pod to superset for serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [14:15:08] 10SRE, 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T357445#9563522 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm alerts cleared [14:15:10] (03CR) 10Brouberol: "LGTM, except for some required image tag updates in the helmfile values." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [14:15:15] !log samtar@deploy2002 stang and samtar: Continuing with sync [14:15:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T355609)', diff saved to https://phabricator.wikimedia.org/P57559 and previous config saved to /var/cache/conftool/dbconfig/20240221-141523-marostegui.json [14:15:29] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:15:34] anzx: you're up next, just noticed that https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1005085 is a WIP? [14:15:51] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:15:55] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:16:14] (03PS3) 10Anzx: cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005085 [14:16:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57560 and previous config saved to /var/cache/conftool/dbconfig/20240221-141615-arnaudb.json [14:16:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P57561 and previous config saved to /var/cache/conftool/dbconfig/20240221-141627-arnaudb.json [14:16:43] TheresNoTime: now marked it as active [14:16:50] ack :) [14:16:58] (03PS4) 10Samtar: mywiki: create portal and draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990077 (https://phabricator.wikimedia.org/T352424) (owner: 10Anzx) [14:17:03] (03PS4) 10Samtar: cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005085 (owner: 10Anzx) [14:17:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57562 and previous config saved to /var/cache/conftool/dbconfig/20240221-141734-arnaudb.json [14:19:40] (03PS4) 10JMeybohm: New upstream version v1.0.0-8522c38 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) [14:20:43] claime: Where is the data for errors with the job queue? [14:20:53] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "new apt server in codfw - jmm@cumin2002 - T331613" [14:20:58] T331613: Migrate apt repository to bookworm - https://phabricator.wikimedia.org/T331613 [14:21:05] Oh it might be logstash? [14:21:16] Dreamy_Jazz: https://logstash.wikimedia.org/goto/215d37acfb142f299fd51816688e0ea6 [14:21:18] yep [14:21:30] I hadn't seen anything on the grafana dashboards [14:21:36] So wondered where it was [14:21:38] Thanks. [14:21:59] I'm not seeing anything in the jobqueue dashboard either [14:22:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "new apt server in codfw - jmm@cumin2002 - T331613" [14:22:05] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2016 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:22:28] The MediaModeration scanning script is running fine according to the dashboard for it and the events should only be occurring on commonswiki if it was that script. [14:22:49] https://grafana.wikimedia.org/d/STSXVVdSk/mediamoderation-photodna-stats?orgId=1&refresh=5m [14:22:58] Dreamy_Jazz: ack [14:23:17] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:1005109|zhwiki: Create group ipblock-exempt-grantor (T357991)]] (duration: 11m 05s) [14:23:21] koi: live [14:23:22] T357991: Create ipblock exempt granter group on zhwiki - https://phabricator.wikimedia.org/T357991 [14:23:28] ty [14:23:29] sorry, looking now - wonder if this is a rerun of the eventgate issues we saw yesterday [14:23:30] I was basing it on timing [14:23:49] 10ops-codfw, 10serviceops: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380#9563553 (10Jhancock.wm) SR185570210 requested replacement disk from dell [14:24:06] hnowlan: did they manifest in some way on the eventgate grafana dashboard? [14:24:06] I intend to continue deploying, is that okay? [14:24:18] claime: annoyingly no, only on the envoy telemetry [14:24:20] TheresNoTime: yeah yeah go ahead [14:24:23] looks clear on the eventgate side [14:24:23] :) [14:24:30] anzx: going to run your two patches together [14:24:36] Ok [14:24:40] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host es2026.codfw.wmnet with OS bookworm [14:24:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005085 (owner: 10Anzx) [14:24:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990077 (https://phabricator.wikimedia.org/T352424) (owner: 10Anzx) [14:25:15] (03CR) 10Ssingh: "Sorry, I forgot this in the earlier review: we will need you to be in the ops group as well here." [puppet] - 10https://gerrit.wikimedia.org/r/1005122 (owner: 10CDobbins) [14:25:37] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: db2097 rebooted itself - https://phabricator.wikimedia.org/T357878#9563556 (10Jhancock.wm) The last maintenance I'm aware of on that machine was on the 15th. We migrated the server to the new leaf switch. I am not aware of any reason it would have be... [14:25:44] (03Merged) 10jenkins-bot: cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005085 (owner: 10Anzx) [14:25:46] (03Merged) 10jenkins-bot: mywiki: create portal and draft namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/990077 (https://phabricator.wikimedia.org/T352424) (owner: 10Anzx) [14:25:56] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T357944#9563560 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact. [14:26:14] !log samtar@deploy2002 Started scap: Backport for [[gerrit:1005085|cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon]], [[gerrit:990077|mywiki: create portal and draft namespace (T352424)]] [14:26:19] T352424: Create Portal and Draft namespaces in mywiki - https://phabricator.wikimedia.org/T352424 [14:26:29] (03PS17) 10MVernon: convert-disks: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:26:59] (03CR) 10MVernon: convert-disks: update cookbook to reimage ms-be with new partition schema (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:27:04] looks like the spiking jobrunner errors are mostly cirrussearch related [14:27:43] !log samtar@deploy2002 samtar and anzx: Backport for [[gerrit:1005085|cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon]], [[gerrit:990077|mywiki: create portal and draft namespace (T352424)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:27:53] TheresNoTime: testing [14:27:57] ack [14:28:26] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence-Backup: db2097 rebooted itself - https://phabricator.wikimedia.org/T357878#9563594 (10jcrespo) 05Open→03Resolved Thanks for the reply, the reboot happened on the 17, so no relation to that. Host has been repopulated from backups, new stale backups gener... [14:28:38] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:28:41] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2046.codfw.wmnet, kubernetes2007.codfw.wmnet, mw2420.codfw.wmnet, mw2378.codfw.wmnet, kubernetes2032.codfw.wmnet, mw2312.codfw.wmnet, mw2356.codfw.wmnet, mw2423.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, mw2421.codfw.wmnet, kubernetes2048.codfw.wmnet, kubernetes2028.codfw.wmnet, m [14:28:41] fw.wmnet, mw2437.codfw.wmnet, mw2445.codfw.wmnet, mw2381.codfw.wmnet, mw2435.codfw.wmnet, kubernetes2018.codfw.wmnet, mw2318.codfw.wmnet, kubernetes2050.codfw.wmnet, kubernetes2023.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2005.codfw.wmnet, mw2366.codfw.wmnet, mw2425.codfw.wmnet, mw2430.codfw.wmnet, kubernetes2042.codfw.wmnet, kubernetes2041.codfw.wmnet, kubernetes2053.codfw.wmnet, mw2354.codfw.wmnet, kubernetes2057.codfw.wmnet, [14:28:41] es2060.codfw.wmnet, mw2350.codfw.wmnet, kubernetes2058.codfw.wmnet, mw2282.codfw.wmnet, kubernetes2054.codfw.wmnet, kubernetes2020.codfw.wmnet, mw2436.codfw.wmnet, mw2310.codfw.wmnet, k https://wikitech.wikimedia.org/wiki/PyBal [14:28:43] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - swift-https_443: Servers ms-fe2010.codfw.wmnet are marked down but pooled: thumbor_8800: Servers mw2424.codfw.wmnet, mw2420.codfw.wmnet, mw2378.codfw.wmnet, mw2294.codfw.wmnet, kubernetes2024.codfw.wmnet, mw2447.codfw.wmnet, mw2370.codfw.wmnet, kubernetes2034.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2048.codfw.wmnet, kubernetes2016.codfw.w [14:28:43] 435.codfw.wmnet, kubernetes2018.codfw.wmnet, mw2297.codfw.wmnet, kubernetes2050.codfw.wmnet, mw2431.codfw.wmnet, kubernetes2055.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2025.codfw.wmnet, kubernetes2030.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2054.codfw.wmnet, mw2434.codfw.wmnet, kubernetes2020.codfw.wmnet, mw2449.codfw.wmnet, mw2368.codfw.wmnet, mw2356.codfw.wmnet, mw2429.codfw.wmnet, mw24 [14:28:43] wmnet, kubernetes2042.codfw.wmnet, kubernetes2013.codfw.wmnet, mw2406.codfw.wmnet, mw2267.codfw.wmnet, kubernetes2044.codfw.wmnet, mw2317.codfw.wmnet, kubernetes2051.codfw.wmnet, mw2380 https://wikitech.wikimedia.org/wiki/PyBal [14:28:55] er [14:28:56] .... [14:28:57] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:28:58] TheresNoTime: looks good [14:29:00] yikes [14:29:05] page! [14:29:19] I have ACKed [14:29:25] anzx: not going to continue the sync at the moment per ^ [14:29:30] yeah thanks [14:29:38] queues are full [14:29:41] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:29:43] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:29:44] in codfw [14:29:59] thumbor queues? [14:30:02] qps is up 10x or so [14:30:03] yeah [14:30:19] for the short term I'll add more replicas [14:30:23] thanks for acking, I'm also looking [14:30:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P57563 and previous config saved to /var/cache/conftool/dbconfig/20240221-143030-marostegui.json [14:30:34] hnowlan: thanks and looks like quite the spike [14:30:55] (03PS1) 10Hnowlan: thumbor: add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517 [14:31:00] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1005517 [14:31:06] heh bit redundant [14:31:08] (03CR) 10Ssingh: [C: 03+1] thumbor: add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517 (owner: 10Hnowlan) [14:31:17] big ghostscript spike apparently [14:31:19] (03CR) 10Alexandros Kosiaris: [C: 03+1] thumbor: add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517 (owner: 10Hnowlan) [14:31:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1213 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57564 and previous config saved to /var/cache/conftool/dbconfig/20240221-143120-arnaudb.json [14:31:27] can someone check the jobqueue for a spike also [14:31:28] (03CR) 10Jelto: [C: 03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517 (owner: 10Hnowlan) [14:31:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P57565 and previous config saved to /var/cache/conftool/dbconfig/20240221-143133-arnaudb.json [14:31:36] (03CR) 10MVernon: [C: 03+2] convert-disks: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:32:10] hnowlan: not sure if this helps but the increased / expensive codfw thumbor traffic looks to be about 80% ghostscript 20% djvu [14:32:20] (03CR) 10Hnowlan: [C: 03+2] thumbor: add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517 (owner: 10Hnowlan) [14:32:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57566 and previous config saved to /var/cache/conftool/dbconfig/20240221-143239-arnaudb.json [14:32:49] that probably points to something automated [14:33:09] (03Merged) 10jenkins-bot: thumbor: add replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005517 (owner: 10Hnowlan) [14:33:12] TheresNoTime: ok will wait, would be possible to take a look at T356686 [14:33:12] T356686: or.wikipedia - Allowing only logged-in users with over 10 edits to create new articles - https://phabricator.wikimedia.org/T356686 [14:33:15] RECOVERY - Check whether ferm is active by checking the default input chain on mw2297 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:33:18] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:33:21] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:33:30] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:33:38] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:33:50] yeah queues already back down [14:33:57] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:34:15] anzx: I'll put T356686 on my todo for later :) [14:34:21] Thanks [14:34:22] and pa.ge resolved again. [14:34:25] thanks hnowlan [14:34:41] (NELNotReported) firing: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [14:34:46] I'm not seeing a job spike [14:34:51] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:34:56] yeah mediamoderation jobs didn't spike or anything [14:35:06] possibly some kind of commons bulk upload maybe [14:35:44] (03CR) 10Majavah: [C: 03+2] "OK! I will update here when this is deployed everywhere" [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [14:35:53] (03Merged) 10jenkins-bot: convert-disks: update cookbook to reimage ms-be with new partition schema [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:36:05] (03CR) 10Marostegui: "Thank you - much appreciated" [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [14:36:18] sukhe: can I continue with the deployment window? [14:36:39] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1005511 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [14:36:55] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:37:07] thumbor is still processing a lot of images but the queues are ok with the replicas increase [14:37:16] I'd say you can go ahead TheresNoTime [14:37:18] ok [14:37:21] ack [14:37:24] !log samtar@deploy2002 samtar and anzx: Continuing with sync [14:37:32] I was going to say wait a bit but should be fine [14:37:36] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-a-codfw - cmooney@cumin1002" [14:38:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-a-codfw - cmooney@cumin1002" [14:38:27] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:38:37] (JobUnavailable) firing: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:24] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9563663 (10Pppery) [14:39:41] (NELNotReported) resolved: NEL metrics not reported - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELNotReported [14:40:02] !log aborrero@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1033.eqiad.wmnet with OS bookworm [14:40:10] 10SRE, 10ops-codfw, 10serviceops: Issues reimaging servers in codfw - https://phabricator.wikimedia.org/T358001#9563665 (10Jhancock.wm) @hnowlan I've replaced the network cable on both of these. These are both connected to a 1G switch so there is no SFP to replace in this case. If this does not fix the iss... [14:40:20] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10netops, and 2 others: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9563666 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by aborrero@cumin1002 for host cloudvirt1033.eqiad.wmnet with OS book... [14:42:04] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2026.codfw.wmnet with reason: host reimage [14:42:26] (03CR) 10MVernon: [C: 03+2] convert-disks: update cookbook to reimage ms-be with new partition schema (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/859470 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [14:43:25] (SystemdUnitFailed) firing: (2) send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:43:36] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:44:32] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:44:50] I suspect the thumbor choke was a bunch of djvu/pdf files being thumbnailed all at once [14:44:56] I dunno what causes a surge like that though [14:44:59] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2026.codfw.wmnet with reason: host reimage [14:45:29] it's trending up again [14:45:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P57567 and previous config saved to /var/cache/conftool/dbconfig/20240221-144536-marostegui.json [14:46:37] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:1005085|cswiki, commonswiki, enwiki: Lift IP cap for WikiGap Editathon]], [[gerrit:990077|mywiki: create portal and draft namespace (T352424)]] (duration: 20m 23s) [14:46:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T357189)', diff saved to https://phabricator.wikimedia.org/P57568 and previous config saved to /var/cache/conftool/dbconfig/20240221-144641-arnaudb.json [14:46:43] T352424: Create Portal and Draft namespaces in mywiki - https://phabricator.wikimedia.org/T352424 [14:46:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [14:46:43] anzx: live, going to run those namespaceDupes now [14:46:48] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:46:54] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:46:56] hnowlan: thumbnailrender didn't spike, but run duration is going up [14:46:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1236.eqiad.wmnet with reason: Maintenance [14:47:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1236 (T357189)', diff saved to https://phabricator.wikimedia.org/P57569 and previous config saved to /var/cache/conftool/dbconfig/20240221-144702-arnaudb.json [14:47:35] !log [samtar@mwmaint2002 ~]$ mwscript namespaceDupes.php --wiki hewikinews --fix #T349581 [14:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:40] T349581: Create draft namespace and add namespaces aliases for hewikinews - https://phabricator.wikimedia.org/T349581 [14:47:57] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:48:03] here we go again [14:48:05] claime: yeah, thumbnailrender doesn't get called for sub-pages of doc [14:48:07] 5xx rate is going up again for thumbor [14:48:14] last apply failed for thumbor btw [14:48:27] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:48:29] trying again [14:48:33] not enough resources? [14:48:36] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:48:52] * kamila_ here if you need more hands [14:48:57] quota exceeded [14:49:04] yes that's what I mean [14:49:05] ACKed in [14:49:06] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:49:07] again [14:49:28] TheresNoTime: also namespacedupes for mywiki, thanks [14:49:28] Would y'all like me to stop the handful of namespaceDupes runs I need to? [14:49:30] we can reintroduce expensive format throttling [14:51:23] 10SRE, 10Content-Transform-Team, 10MW-on-K8s, 10Traffic, and 2 others: Create parsoid mediawiki deployment - https://phabricator.wikimedia.org/T357392#9563708 (10akosiaris) 05Open→03In progress p:05Triage→03Medium [14:51:27] I don't think namespaceDupes has an impact on thumbnailing, does it? [14:51:31] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9563710 (10akosiaris) [14:51:53] hnowlan: we may need to :/ [14:51:57] (03PS1) 10Hnowlan: thumbor: reenable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005519 [14:52:09] it's broken but it's broken in our favour [14:52:16] (03CR) 10Clément Goubert: [C: 03+1] thumbor: reenable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005519 (owner: 10Hnowlan) [14:52:32] anzx: all runs complete [14:52:36] (inc. mywiki) [14:52:42] I would very much like to know what is causing these spikes [14:52:44] 5xx going down again, similar to last time [14:52:54] TheresNoTime: thank you [14:52:57] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:52:57] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:52:59] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [14:53:14] (03CR) 10Hnowlan: [C: 03+2] thumbor: reenable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005519 (owner: 10Hnowlan) [14:53:17] hoo: would it be okay for you to reschedule your deployment? We're running over and ideally I think no more deploys would be good [14:53:35] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-a-codfw - cmooney@cumin1002" [14:53:57] Yeah, I need to depool some stuff for the upcoming network migration as well, so if we could reschedule that deployment for a later window it'd be great [14:54:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T357189)', diff saved to https://phabricator.wikimedia.org/P57570 and previous config saved to /var/cache/conftool/dbconfig/20240221-145407-arnaudb.json [14:54:13] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:54:21] (03Merged) 10jenkins-bot: thumbor: reenable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005519 (owner: 10Hnowlan) [14:54:26] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-a-codfw - cmooney@cumin1002" [14:54:26] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:54:58] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [14:55:15] TheresNoTime: Sure… I guess I can go for the morning SWAT tomorrow [14:55:28] Appreciate it, thank you :-) [14:55:34] thanks :) [14:55:49] !log UTC afternoon backport window done [14:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:16] !log adding IRB anycast interface on private1-b-codfw vlan to spine and leaf switches codfw row B [14:57:46] (03CR) 10Muehlenhoff: [C: 03+2] Drop obsolete motd for cumin1001 [puppet] - 10https://gerrit.wikimedia.org/r/1005511 (https://phabricator.wikimedia.org/T353419) (owner: 10Muehlenhoff) [14:58:03] (03PS1) 10Hnowlan: thumbor: reduce per-pod memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005520 [14:58:19] ^ if someone has a sec, will make scale-ups easier [14:58:32] (03CR) 10Clément Goubert: [C: 03+1] thumbor: reduce per-pod memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005520 (owner: 10Hnowlan) [14:58:37] (JobUnavailable) firing: (5) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:39] (03CR) 10Ssingh: [C: 03+1] thumbor: reduce per-pod memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005520 (owner: 10Hnowlan) [14:59:04] Gimme a sec to drain a couple k8s nodes before deploying hnowlan please [14:59:06] !log Draining kubernetes2025.codfw.wmnet kubernetes2026.codfw.wmnet for codfw A8 network migration - T355874 [14:59:08] (03CR) 10Hnowlan: [C: 03+2] thumbor: reduce per-pod memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005520 (owner: 10Hnowlan) [14:59:14] claime: ack [15:00:03] (03Merged) 10jenkins-bot: thumbor: reduce per-pod memory limits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005520 (owner: 10Hnowlan) [15:00:04] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1500) [15:00:13] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:00:29] hnowlan: all good [15:00:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T355609)', diff saved to https://phabricator.wikimedia.org/P57571 and previous config saved to /var/cache/conftool/dbconfig/20240221-150043-marostegui.json [15:00:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [15:00:59] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2173.codfw.wmnet with reason: Maintenance [15:01:01] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:01:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2186.codfw.wmnet with reason: Maintenance [15:01:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2173 (T355609)', diff saved to https://phabricator.wikimedia.org/P57572 and previous config saved to /var/cache/conftool/dbconfig/20240221-150109-marostegui.json [15:01:09] !log Depooling parse2004.codfw.wmnet parse2005.codfw.wmnet for codfw A8 network migration - T355874 [15:01:46] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:02:25] !log cgoubert@cumin2002 conftool action : set/pooled=inactive; selector: name=parse200(4|5).* [15:06:14] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2026.codfw.wmnet with OS bookworm [15:07:27] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [15:07:41] (03CR) 10Muehlenhoff: [C: 03+2] Switch backup1001 to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005436 (owner: 10Muehlenhoff) [15:07:59] (03PS11) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [15:09:11] (03CR) 10CI reject: [V: 04-1] sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [15:09:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P57573 and previous config saved to /var/cache/conftool/dbconfig/20240221-150914-arnaudb.json [15:09:25] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:09:43] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-b-codfw - cmooney@cumin1002" [15:10:12] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:10:34] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for private1-b-codfw - cmooney@cumin1002" [15:10:35] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:12:09] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:12:22] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:12:23] (03CR) 10Muehlenhoff: [C: 03+2] Switch backup2001 to Puppet 7 on the role level [puppet] - 10https://gerrit.wikimedia.org/r/1005437 (owner: 10Muehlenhoff) [15:13:07] (03PS1) 10Marostegui: Revert "es2026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005474 [15:15:28] (03CR) 10Marostegui: [C: 03+2] Revert "es2026: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1005474 (owner: 10Marostegui) [15:18:27] (03CR) 10Majavah: [C: 03+2] "Views on all clouddb servers dropped." [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [15:18:55] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:57] (03CR) 10Marostegui: "thanks!!" [puppet] - 10https://gerrit.wikimedia.org/r/1005438 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [15:19:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57574 and previous config saved to /var/cache/conftool/dbconfig/20240221-151909-root.json [15:19:32] (03PS2) 10Majavah: wikireplicas: maintain-views: try depooling host on lock failure [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427) [15:20:18] (03CR) 10Majavah: "So the depool+retry part in this actually works. What does not work is closing connections, either the script or (my preference) the HAPro" [puppet] - 10https://gerrit.wikimedia.org/r/998356 (https://phabricator.wikimedia.org/T300427) (owner: 10Majavah) [15:21:06] 10SRE, 10ops-codfw, 10Cassandra, 10decommission-hardware: decommission restbase20[13-20] - https://phabricator.wikimedia.org/T356695#9563766 (10Jhancock.wm) a:03Jhancock.wm [15:21:16] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host db2137.codfw.wmnet with OS bookworm [15:23:07] (03PS12) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [15:23:37] (JobUnavailable) resolved: (4) Reduced availability for job thanos-query in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:01] (03PS1) 10Muehlenhoff: Explicitly configure apt2002 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1005524 (https://phabricator.wikimedia.org/T331613) [15:24:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236', diff saved to https://phabricator.wikimedia.org/P57575 and previous config saved to /var/cache/conftool/dbconfig/20240221-152420-arnaudb.json [15:26:46] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, just a nit inline" [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [15:28:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T355609)', diff saved to https://phabricator.wikimedia.org/P57576 and previous config saved to /var/cache/conftool/dbconfig/20240221-152826-marostegui.json [15:28:33] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:28:36] (03CR) 10Btullis: "Looks good overall. Couple of nitpicks." [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [15:32:09] (03CR) 10Muehlenhoff: [C: 03+2] Explicitly configure apt2002 for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1005524 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [15:34:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57577 and previous config saved to /var/cache/conftool/dbconfig/20240221-153414-root.json [15:35:04] (03PS1) 10CDanis: WIP: jaeger: include oauth config in Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005546 (https://phabricator.wikimedia.org/T358111) [15:37:18] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:38:58] (03CR) 10CDanis: "I'm going to send this patch upstream, but I wanted a quick check for nitpicks from you two first." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005546 (https://phabricator.wikimedia.org/T358111) (owner: 10CDanis) [15:39:15] (03PS13) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [15:39:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1236 (T357189)', diff saved to https://phabricator.wikimedia.org/P57578 and previous config saved to /var/cache/conftool/dbconfig/20240221-153926-arnaudb.json [15:39:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [15:39:42] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:39:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [15:40:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T355874 - depooling db2146 db2106', diff saved to https://phabricator.wikimedia.org/P57579 and previous config saved to /var/cache/conftool/dbconfig/20240221-154056-arnaudb.json [15:40:58] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:25:00 on db2146.codfw.wmnet with reason: T355874 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:41:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on db2146.codfw.wmnet with reason: T355874 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:41:02] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:25:00 on db2106.codfw.wmnet with reason: T355874 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:41:13] T355874: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874 [15:41:15] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:25:00 on db2106.codfw.wmnet with reason: T355874 - Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw [15:42:40] (03PS1) 10Fabfur: haproxy: configure extended logging (preparatory for Benthos) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) [15:43:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P57580 and previous config saved to /var/cache/conftool/dbconfig/20240221-154333-marostegui.json [15:44:48] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2137.codfw.wmnet with reason: host reimage [15:46:05] (03CR) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) (owner: 10ArielGlenn) [15:46:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:46:47] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance [15:47:04] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1004680 (owner: 10Filippo Giunchedi) [15:47:40] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2137.codfw.wmnet with reason: host reimage [15:49:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57581 and previous config saved to /var/cache/conftool/dbconfig/20240221-154918-root.json [15:49:34] (03CR) 10Clément Goubert: [C: 03+2] api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004735 (https://phabricator.wikimedia.org/T357907) (owner: 10Clément Goubert) [15:50:48] (03Merged) 10jenkins-bot: api-gateway: Finish migration to mw-on-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1004735 (https://phabricator.wikimedia.org/T357907) (owner: 10Clément Goubert) [15:51:48] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [15:52:04] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [15:52:13] (03PS5) 10JMeybohm: New package version 1.0.0+git20221110-1 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) [15:52:43] 10SRE, 10LDAP-Access-Requests: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9563891 (10WMDE-leszek) I approve the request on WMDE's behalf. [15:54:44] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [15:54:57] (03CR) 10JMeybohm: [C: 03+2] New package version 1.0.0+git20221110-1 [debs/prometheus-rsyslog-exporter] - 10https://gerrit.wikimedia.org/r/1005508 (https://phabricator.wikimedia.org/T357616) (owner: 10JMeybohm) [15:55:22] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [15:55:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [15:55:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance [15:56:06] 10ops-codfw, 10DC-Ops: db2137 and es2026 don't get an IP via PXE boot - https://phabricator.wikimedia.org/T357951#9563902 (10wiki_willy) ++ @Jhancock.wm for visibility and in case any onsite support is needed [15:57:59] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a8-codfw.mgmt with reason: prepping for server uplink migration codfw rack a8 [15:58:16] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-a-codfw,cr[1-2]-codfw,lsw1-a8-codfw.mgmt with reason: prepping for server uplink migration codfw rack a8 [15:58:22] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874#9563911 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c42ddc7f-d7d7-4ebc-9852-d3a5c7882e71) set by cmoon... [15:58:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P57582 and previous config saved to /var/cache/conftool/dbconfig/20240221-155839-marostegui.json [15:58:59] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 6 hosts with reason: Migrating servers in codfw rack A7 to lsw1-a7-codfw [15:59:06] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 6 hosts with reason: Migrating servers in codfw rack A7 to lsw1-a7-codfw [15:59:13] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874#9563912 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=da675508-2cc3-4974-a4ca-677deefc2dff) set by cmoon... [16:00:05] (03PS1) 10Clément Goubert: Revert "api-gateway: Finish migration to mw-on-k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005475 [16:01:10] (03CR) 10Clément Goubert: [C: 03+2] Revert "api-gateway: Finish migration to mw-on-k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005475 (owner: 10Clément Goubert) [16:02:00] !log Commencing network maintenance migrating servers to new switch codfw rack A8 T355874 [16:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:06] (03Merged) 10jenkins-bot: Revert "api-gateway: Finish migration to mw-on-k8s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005475 (owner: 10Clément Goubert) [16:02:19] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005527 [16:02:23] T355874: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874 [16:03:03] (03CR) 10JHathaway: [C: 03+2] etcd: disable the diff output for client config with passwords [puppet] - 10https://gerrit.wikimedia.org/r/1003112 (https://phabricator.wikimedia.org/T356459) (owner: 10JHathaway) [16:03:22] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [16:03:39] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [16:04:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57583 and previous config saved to /var/cache/conftool/dbconfig/20240221-160423-root.json [16:04:25] !log cgoubert@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [16:04:40] !log cgoubert@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [16:04:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [16:05:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2108.codfw.wmnet with reason: Maintenance [16:05:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2108 (T357189)', diff saved to https://phabricator.wikimedia.org/P57584 and previous config saved to /var/cache/conftool/dbconfig/20240221-160511-arnaudb.json [16:05:37] !log imported prometheus-rsyslog-exporter 1.0.0+git20221110-1 to buster,bullseye,bookworm - T357616 [16:06:06] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:25] T357616: Logs from containers sometimes not visible in logstash - https://phabricator.wikimedia.org/T357616 [16:09:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2137.codfw.wmnet with OS bookworm [16:09:12] 10ops-codfw, 10DC-Ops: db2137 and es2026 don't get an IP via PXE boot - https://phabricator.wikimedia.org/T357951#9564030 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host db2137.codfw.wmnet with OS bookworm completed: - db2137 (**WARN**) - Removed from Puppet... [16:10:41] 10ops-codfw, 10DC-Ops: db2137 and es2026 don't get an IP via PXE boot - https://phabricator.wikimedia.org/T357951#9564034 (10Marostegui) 05Open→03Resolved a:03cmooney All good, both hosts were reimaged fine. Thanks @cmooney for taking the time to explain and fix the issue. [16:10:55] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:11:29] I can reach it [16:11:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57585 and previous config saved to /var/cache/conftool/dbconfig/20240221-161129-arnaudb.json [16:11:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57586 and previous config saved to /var/cache/conftool/dbconfig/20240221-161136-arnaudb.json [16:12:01] let me check from the alert host [16:12:10] (03PS14) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [16:12:20] IIRC we saw this go down and recover. the question is why though. (I know we don't run this) [16:13:14] so probably something with the outside network [16:13:16] ? [16:13:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T355609)', diff saved to https://phabricator.wikimedia.org/P57587 and previous config saved to /var/cache/conftool/dbconfig/20240221-161345-marostegui.json [16:13:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [16:13:50] jynus: this is hosted by Rackspace IIRC [16:14:01] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:14:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2174.codfw.wmnet with reason: Maintenance [16:14:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2174 (T355609)', diff saved to https://phabricator.wikimedia.org/P57588 and previous config saved to /var/cache/conftool/dbconfig/20240221-161407-marostegui.json [16:16:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T357189)', diff saved to https://phabricator.wikimedia.org/P57589 and previous config saved to /var/cache/conftool/dbconfig/20240221-161615-arnaudb.json [16:16:33] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:16:54] (03PS2) 10Fabfur: haproxy: configure extended logging (preparatory for Benthos) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) [16:17:01] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.26 ms [16:18:11] (03CR) 10Clément Goubert: [C: 03+2] logstash_checker.py: Add ability to check all MediaWiki canaries at once [puppet] - 10https://gerrit.wikimedia.org/r/1003885 (https://phabricator.wikimedia.org/T357402) (owner: 10Ahmon Dancy) [16:18:15] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1420/console" [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [16:18:34] alert1001:~$ ping wikitech-static.wikimedia.org is now working [16:19:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57590 and previous config saved to /var/cache/conftool/dbconfig/20240221-161928-root.json [16:21:03] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [16:22:07] (03PS3) 10Fabfur: haproxy: configure extended logging (preparatory for Benthos) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) [16:22:41] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9564123 (10Clement_Goubert) [16:23:47] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1421/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [16:24:07] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9564137 (10Clement_Goubert) p:05Triage→03Medium [16:24:40] !log Repooling parse2004.codfw.wmnet parse2005.codfw.wmnet following codfw A8 network migration - T355874 [16:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:50] !log cgoubert@cumin2002 conftool action : set/pooled=yes; selector: name=parse200(4|5).* [16:24:54] T355874: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874 [16:24:58] I confirm it is a rackspace issue, as it is the last hop tha fails, not anything inbetween [16:25:18] !log Uncordoning kubernetes2025.codfw.wmnet kubernetes2026.codfw.wmnet following codfw A8 network migration - T355874 [16:25:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:00] 10SRE, 10ops-codfw, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack A8 from asw-a8-codfw to lsw1-a8-codfw - https://phabricator.wikimedia.org/T355874#9564146 (10cmooney) All hosts moved without issue, thanks Jenn! [16:26:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57591 and previous config saved to /var/cache/conftool/dbconfig/20240221-162635-arnaudb.json [16:26:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57592 and previous config saved to /var/cache/conftool/dbconfig/20240221-162641-arnaudb.json [16:29:29] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Find a way to address canary releases directly - https://phabricator.wikimedia.org/T358117#9564163 (10Clement_Goubert) [16:31:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P57593 and previous config saved to /var/cache/conftool/dbconfig/20240221-163122-arnaudb.json [16:34:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P57594 and previous config saved to /var/cache/conftool/dbconfig/20240221-163433-root.json [16:35:52] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 22.31 ms [16:41:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57595 and previous config saved to /var/cache/conftool/dbconfig/20240221-164140-arnaudb.json [16:41:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57596 and previous config saved to /var/cache/conftool/dbconfig/20240221-164146-arnaudb.json [16:41:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T355609)', diff saved to https://phabricator.wikimedia.org/P57597 and previous config saved to /var/cache/conftool/dbconfig/20240221-164150-marostegui.json [16:42:18] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:46:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P57598 and previous config saved to /var/cache/conftool/dbconfig/20240221-164628-arnaudb.json [16:46:54] (03CR) 10C. Scott Ananian: Turn on Parsoid read views by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) (owner: 10C. Scott Ananian) [16:46:58] (03PS5) 10C. Scott Ananian: Turn on Parsoid read views by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) [16:47:28] (03PS1) 10Ssingh: P:dns::auth: update confd keys to reflect new schema [puppet] - 10https://gerrit.wikimedia.org/r/1005559 (https://phabricator.wikimedia.org/T347054) [16:49:36] (03PS2) 10Ssingh: P:dns::auth: update confd keys to reflect new schema [puppet] - 10https://gerrit.wikimedia.org/r/1005559 (https://phabricator.wikimedia.org/T347054) [16:51:22] (03PS3) 10Ssingh: P:dns::auth: update confd keys to reflect new schema [puppet] - 10https://gerrit.wikimedia.org/r/1005559 (https://phabricator.wikimedia.org/T347054) [16:52:32] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1423/co" [puppet] - 10https://gerrit.wikimedia.org/r/1005559 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:55:25] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9564307 (10Clement_Goubert) [16:56:02] (03CR) 10Volans: "Looks good! Hint for the failing test inline." [software/spicerack] - 10https://gerrit.wikimedia.org/r/979040 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [16:56:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2146 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57599 and previous config saved to /var/cache/conftool/dbconfig/20240221-165644-arnaudb.json [16:56:51] (03PS2) 10Btullis: Add an nginx reverse proxy to superset to help with serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) [16:56:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2106 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P57600 and previous config saved to /var/cache/conftool/dbconfig/20240221-165651-arnaudb.json [16:56:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P57601 and previous config saved to /var/cache/conftool/dbconfig/20240221-165657-marostegui.json [16:57:31] (03CR) 10CI reject: [V: 04-1] Add an nginx reverse proxy to superset to help with serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [16:58:22] (03PS1) 10Jclark-ctr: add an-redacteddb1001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1005560 (https://phabricator.wikimedia.org/T355571) [17:00:12] (03CR) 10Jclark-ctr: [C: 03+2] add an-redacteddb1001 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1005560 (https://phabricator.wikimedia.org/T355571) (owner: 10Jclark-ctr) [17:01:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T357189)', diff saved to https://phabricator.wikimedia.org/P57602 and previous config saved to /var/cache/conftool/dbconfig/20240221-170134-arnaudb.json [17:01:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [17:01:51] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2120.codfw.wmnet with reason: Maintenance [17:01:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2120 (T357189)', diff saved to https://phabricator.wikimedia.org/P57603 and previous config saved to /var/cache/conftool/dbconfig/20240221-170157-arnaudb.json [17:01:58] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:09:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-redacteddb1001.eqiad.wmnet with OS bullseye [17:09:24] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9564382 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS bullseye [17:12:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P57604 and previous config saved to /var/cache/conftool/dbconfig/20240221-171203-marostegui.json [17:14:51] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9564401 (10VRiley-WMF) a:03VRiley-WMF [17:15:08] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9564397 (10Jclark-ctr) a:05Jclark-ctr→03VRiley-WMF [17:15:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T357189)', diff saved to https://phabricator.wikimedia.org/P57605 and previous config saved to /var/cache/conftool/dbconfig/20240221-171521-arnaudb.json [17:15:27] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:18:52] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Hardware): Cloudvirt1063.eqiad.wmnet overheating - https://phabricator.wikimedia.org/T353408#9564424 (10Jclark-ctr) 05Open→03Resolved closing ticket 7 days no faults [17:27:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T355609)', diff saved to https://phabricator.wikimedia.org/P57606 and previous config saved to /var/cache/conftool/dbconfig/20240221-172709-marostegui.json [17:27:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [17:27:18] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:27:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2176.codfw.wmnet with reason: Maintenance [17:27:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2176 (T355609)', diff saved to https://phabricator.wikimedia.org/P57607 and previous config saved to /var/cache/conftool/dbconfig/20240221-172731-marostegui.json [17:30:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P57608 and previous config saved to /var/cache/conftool/dbconfig/20240221-173028-arnaudb.json [17:34:22] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9564556 (10VRiley-WMF) [17:45:03] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9564592 (10VRiley-WMF) 05Open→03In progress [17:45:14] 10SRE, 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9564594 (10Jclark-ctr) a:05VRiley-WMF→03BTullis @BTullis this is a custom configuration and i am not having any luck with imaging 20 disk raid10. if you can asisst thank you [17:45:16] 10SRE, 10ops-eqiad, 10DC-Ops: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9564597 (10Jclark-ctr) [17:45:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P57609 and previous config saved to /var/cache/conftool/dbconfig/20240221-174534-arnaudb.json [17:48:43] (03PS4) 10Kamila Součková: Create a shellbox deployment for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) [17:48:45] (03CR) 10Kamila Součková: "I thought about it and decided to go with video because video files typically contain audio, so it's a superset in my mind. I thought abou" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) (owner: 10Kamila Součková) [17:49:18] (03CR) 10BCornwall: [V: 03+1] fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [17:49:53] (03PS5) 10Kamila Součková: Create a shellbox deployment for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) [17:50:18] (03PS6) 10Kamila Součková: Create a shellbox deployment for videoscalers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1003446 (https://phabricator.wikimedia.org/T357309) [17:56:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T355609)', diff saved to https://phabricator.wikimedia.org/P57610 and previous config saved to /var/cache/conftool/dbconfig/20240221-175601-marostegui.json [17:56:07] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:59:56] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2057:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2057 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1800) [18:00:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T357189)', diff saved to https://phabricator.wikimedia.org/P57611 and previous config saved to /var/cache/conftool/dbconfig/20240221-180041-arnaudb.json [18:00:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [18:00:50] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:00:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2121.codfw.wmnet with reason: Maintenance [18:01:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T357189)', diff saved to https://phabricator.wikimedia.org/P57612 and previous config saved to /var/cache/conftool/dbconfig/20240221-180103-arnaudb.json [18:03:19] (03PS3) 10Btullis: Add an nginx reverse proxy to superset to help with serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) [18:04:23] (03CR) 10Btullis: Add an nginx reverse proxy to superset to help with serving static assets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) (owner: 10Btullis) [18:04:49] (03CR) 10Ssingh: fifo-log-demux: Decouple service from nginx/ats (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [18:09:11] (03CR) 10BCornwall: "I'm a little skeptical that we need a script for checking package versions. Don't we already have monitoring for continuous Puppet runs? T" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [18:11:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P57613 and previous config saved to /var/cache/conftool/dbconfig/20240221-181107-marostegui.json [18:12:01] (03CR) 10Ssingh: "I am not sure how Puppet would be able to do that and also to send out alerts on IRC after checking and comparing the versions. If you hav" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [18:15:09] (03PS4) 10Btullis: Add an nginx reverse proxy to superset to help with serving static assets [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005495 (https://phabricator.wikimedia.org/T357890) [18:17:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T357189)', diff saved to https://phabricator.wikimedia.org/P57614 and previous config saved to /var/cache/conftool/dbconfig/20240221-181729-arnaudb.json [18:17:39] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:18:39] (03CR) 10BCornwall: "I was referring to that more broad alert along the lines of "Puppet changing every run". If we specify/install a version here but somethin" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [18:19:19] (03PS1) 10Joal: Absent some reportupdater systemd-timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/1005565 (https://phabricator.wikimedia.org/T357419) [18:20:29] (03CR) 10CI reject: [V: 04-1] Absent some reportupdater systemd-timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/1005565 (https://phabricator.wikimedia.org/T357419) (owner: 10Joal) [18:20:42] (03CR) 10Ssingh: "Oh that way. Yeah but we don't want to have Puppet control over the installations of varnish. Doing so clears the cache and is generally t" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [18:26:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P57615 and previous config saved to /var/cache/conftool/dbconfig/20240221-182614-marostegui.json [18:28:20] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9564755 (10VRiley-WMF) [18:29:48] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9561242 (10VRiley-WMF) These servers have been unracked and ran the decommission script on them. [18:30:03] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission cloudelastic100[1-4].wikimedia.org - https://phabricator.wikimedia.org/T358046#9564778 (10VRiley-WMF) 05In progress→03Resolved [18:31:56] (03CR) 10Majavah: [C: 04-1] "-1 on introducing a new Icinga check, anything new should be in Prometheus. However I wonder whether it'd be a better idea to enforce the " [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [18:32:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P57616 and previous config saved to /var/cache/conftool/dbconfig/20240221-183236-arnaudb.json [18:35:47] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9564792 (10VRiley-WMF) 05Open→03In progress [18:38:26] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9564807 (10VRiley-WMF) [18:41:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T355609)', diff saved to https://phabricator.wikimedia.org/P57617 and previous config saved to /var/cache/conftool/dbconfig/20240221-184120-marostegui.json [18:41:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2188.codfw.wmnet with reason: Maintenance [18:41:30] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:41:30] (03PS1) 10Jdlrobson: Remove Japanese Wikipedia from projects sharing user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005569 (https://phabricator.wikimedia.org/T301212) [18:41:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2188.codfw.wmnet with reason: Maintenance [18:41:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T355609)', diff saved to https://phabricator.wikimedia.org/P57618 and previous config saved to /var/cache/conftool/dbconfig/20240221-184144-marostegui.json [18:42:25] (03CR) 10Ssingh: "That already exists in the varnishkafka deb package. This task is about making sure that the correct versions (individually) are installed" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [18:43:26] (SystemdUnitFailed) firing: (2) send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:44:12] (03CR) 10RLazarus: [C: 03+2] k8s-controller-sidecars: Bump the pod's memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005219 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [18:46:55] (03Merged) 10jenkins-bot: k8s-controller-sidecars: Bump the pod's memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005219 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [18:47:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P57619 and previous config saved to /var/cache/conftool/dbconfig/20240221-184743-arnaudb.json [18:48:36] (PuppetZeroResources) firing: Puppet has failed generate resources on ncmonitor1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [18:48:58] (03CR) 10Ssingh: "I am not a big fan of this patch as well but I don't think "Puppet changing every run" and a message as broad as that is a good solution. " [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [18:49:22] (03PS1) 10Jdlrobson: Enable night mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005570 (https://phabricator.wikimedia.org/T357759) [18:50:02] (03CR) 10Majavah: [C: 04-1] "The `Depends` field is enforced by Apt during package installation time, and it will refuse to install or upgrade any package in a way tha" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [18:53:01] (03CR) 10Ssingh: "The requirement here is that individual packages (starting with these but maybe more) should adhere to a fixed version definition. If we t" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [18:55:29] (03CR) 10Majavah: [C: 04-1] "My understanding is that the issue this was trying to detect was that a `varnishkafka` version installed would not be compatible with the " [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [18:59:23] (03CR) 10Ssingh: "Right but only if we are talking about ordering/dependency, in which case I agree that varnishkafka should (does) specify a stricter order" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [19:00:05] jeena and brennen: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T1900). [19:02:29] The train is blocked currently [19:02:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T357189)', diff saved to https://phabricator.wikimedia.org/P57620 and previous config saved to /var/cache/conftool/dbconfig/20240221-190249-arnaudb.json [19:02:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [19:02:56] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:03:05] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2122.codfw.wmnet with reason: Maintenance [19:03:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T357189)', diff saved to https://phabricator.wikimedia.org/P57621 and previous config saved to /var/cache/conftool/dbconfig/20240221-190311-arnaudb.json [19:06:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T355609)', diff saved to https://phabricator.wikimedia.org/P57622 and previous config saved to /var/cache/conftool/dbconfig/20240221-190637-marostegui.json [19:06:53] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:07:18] (03PS1) 10Bking: rdf-streaming-updater: restore from savepoint (WIP) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005572 (https://phabricator.wikimedia.org/T348685) [19:08:36] (03CR) 10BCornwall: "IMO the alerting should not be on making sure arbitrary numbers match expectations but rather that the application itself is behaving as e" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [19:11:36] (03PS2) 10Joal: Absent some reportupdater systemd-timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/1005565 (https://phabricator.wikimedia.org/T357419) [19:12:23] (03PS1) 10Cathal Mooney: WIP: adjust reimage cookbook to clear switch caches for vms too [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) [19:12:47] (03CR) 10CI reject: [V: 04-1] Absent some reportupdater systemd-timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/1005565 (https://phabricator.wikimedia.org/T357419) (owner: 10Joal) [19:16:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T357189)', diff saved to https://phabricator.wikimedia.org/P57623 and previous config saved to /var/cache/conftool/dbconfig/20240221-191628-arnaudb.json [19:16:38] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:18:14] (03CR) 10BCornwall: [V: 03+1] fifo-log-demux: Decouple service from nginx/ats (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [19:20:31] (03PS3) 10Joal: Absent some reportupdater systemd-timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/1005565 (https://phabricator.wikimedia.org/T357419) [19:21:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P57624 and previous config saved to /var/cache/conftool/dbconfig/20240221-192144-marostegui.json [19:23:36] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:27:25] (03CR) 10Ssingh: fifo-log-demux: Decouple service from nginx/ats (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [19:31:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P57625 and previous config saved to /var/cache/conftool/dbconfig/20240221-193135-arnaudb.json [19:31:48] (03CR) 10Ssingh: "Puppet does not upgrade the varnish package (or any other basically?) for us so it wouldn't notice anything if we uploaded an incorrect ve" [puppet] - 10https://gerrit.wikimedia.org/r/1005140 (owner: 10Ssingh) [19:34:46] (03CR) 10Eevans: [V: 03+2 C: 03+2] c-cqlsh is now deprecated; long live cqlsh-instance [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/1004235 (owner: 10Eevans) [19:36:10] (03CR) 10Herron: [C: 03+1] grafana: provision thanos-downsample datasources [puppet] - 10https://gerrit.wikimedia.org/r/1004680 (owner: 10Filippo Giunchedi) [19:36:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P57626 and previous config saved to /var/cache/conftool/dbconfig/20240221-193650-marostegui.json [19:36:52] (03CR) 10Herron: [C: 03+1] grafana: provision thanos-downsample datasources (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1004680 (owner: 10Filippo Giunchedi) [19:38:38] (03PS15) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [19:38:41] !log bking@deploy2002 deleting old flink data from thanos-swift T348685 [19:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:47] T348685: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685 [19:44:41] (03PS1) 10Bartosz Dziewoński: CentralAuthHooks::onGetUserBlock: Only run for reg. users [extensions/CentralAuth] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005481 (https://phabricator.wikimedia.org/T358112) [19:45:25] jeena: brennen: if you'd like to unblock the train, this can be backported ^ [19:46:39] thanks, I will do that [19:46:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P57627 and previous config saved to /var/cache/conftool/dbconfig/20240221-194641-arnaudb.json [19:48:01] 10SRE, 10User-aborrero: reimage cookbook: failure when updating netbox data from puppetdb on cloudvirt1033 - https://phabricator.wikimedia.org/T358099#9564998 (10cmooney) p:05Triage→03Low a:03cmooney Thanks @aborrero. Yeah something strange happening, I began to look at this earlier and got the same thi... [19:50:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005481 (https://phabricator.wikimedia.org/T358112) (owner: 10Bartosz Dziewoński) [19:51:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T355609)', diff saved to https://phabricator.wikimedia.org/P57628 and previous config saved to /var/cache/conftool/dbconfig/20240221-195157-marostegui.json [19:52:04] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:56:48] (03Merged) 10jenkins-bot: CentralAuthHooks::onGetUserBlock: Only run for reg. users [extensions/CentralAuth] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1005481 (https://phabricator.wikimedia.org/T358112) (owner: 10Bartosz Dziewoński) [19:57:00] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9565036 (10VRiley-WMF) [19:57:14] !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:1005481|CentralAuthHooks::onGetUserBlock: Only run for reg. users (T358112)]] [19:57:19] T358112: Special:Contributions for IP ranges fails with InvalidArgumentException, due to CentralAuth - https://phabricator.wikimedia.org/T358112 [19:58:43] (03PS16) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [19:58:45] !log jhuneidi@deploy2002 jhuneidi and matmarex: Backport for [[gerrit:1005481|CentralAuthHooks::onGetUserBlock: Only run for reg. users (T358112)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:59:23] jeena: thanks, looks fixed for me on mw.org [20:01:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T357189)', diff saved to https://phabricator.wikimedia.org/P57629 and previous config saved to /var/cache/conftool/dbconfig/20240221-200148-arnaudb.json [20:01:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [20:02:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2150.codfw.wmnet with reason: Maintenance [20:02:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T357189)', diff saved to https://phabricator.wikimedia.org/P57630 and previous config saved to /var/cache/conftool/dbconfig/20240221-200209-arnaudb.json [20:02:10] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:03:14] !log jhuneidi@deploy2002 jhuneidi and matmarex: Continuing with sync [20:03:26] MatmaRex: thanks for checking! [20:07:44] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, and 2 others: Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402#9565070 (10CodeReviewBot) dancy opened https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/219 Check bare metal and mw-... [20:11:23] !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:1005481|CentralAuthHooks::onGetUserBlock: Only run for reg. users (T358112)]] (duration: 14m 09s) [20:11:45] T358112: Special:Contributions for IP ranges fails with InvalidArgumentException, due to CentralAuth - https://phabricator.wikimedia.org/T358112 [20:12:56] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005583 (https://phabricator.wikimedia.org/T354437) [20:13:00] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005583 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot) [20:13:41] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005583 (https://phabricator.wikimedia.org/T354437) (owner: 10TrainBranchBot) [20:14:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T357189)', diff saved to https://phabricator.wikimedia.org/P57631 and previous config saved to /var/cache/conftool/dbconfig/20240221-201400-arnaudb.json [20:14:18] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:16:54] !log jhuneidi@deploy2002 scap failed: average error rate on 4/4 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) [20:17:11] :O [20:17:38] oh boy [20:17:53] I've never had this happen before so...not sure how to proceed [20:18:10] roll back? [20:19:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003885 was just merged and seems very related [20:19:08] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9565099 (10VRiley-WMF) [20:19:09] jeena: did it give any more detail in the console? [20:19:15] oh that [20:19:20] logstash has the error details [20:19:47] there are quite a few exceptions [20:20:48] looks like timeouts mostly though [20:21:56] i wonder if adding the mw-on-k8s canaries just caught timeouts that have been happening at every deploy? [20:22:01] * brennen fumbles around in logstash [20:22:37] * Zippybonzo wishes they had logstash access :( [20:24:45] I must clearly be missing something, but I'm not seeing any massive increase of errors anywhere [20:25:49] * Zippybonzo doesn't know what errors to look for so can't really look for any but doesn't notice anything breaking [20:26:05] maybe it was just a problem with logstash? All the exceptions from scap are this: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search [20:26:46] that seems to be a problem about _querying_ logstash, not that the error rate in logstash has increased [20:27:06] which supports my theory that the recent logstash_checker patch broke something [20:27:11] yeah, that's a believable failure mode given the change here [20:27:20] cc: dancy, thcipriani [20:27:23] yeah, I wasn't thinking that the error rate had increased [20:28:00] sorry if I made it sound like that [20:28:27] ah, gotcha, re: timeouts. [20:29:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P57632 and previous config saved to /var/cache/conftool/dbconfig/20240221-202906-arnaudb.json [20:29:12] yeah [20:29:23] (i suppose it's also possible that there's no bug with the checker as such and the requests really did just timeout.) [20:30:15] anyhow, given that the error rate did not in fact skyrocket, continuing the deployment seems safe to me [20:30:49] probably, although i don't think we want to be operating without canaries in general. [20:30:57] yeah [20:31:56] so there was one backport after the checker patch was merged that succeeded, and now this deployment that failed [20:31:59] there is still the question of why it was timing out [20:32:04] s/was/is [20:32:17] So as far as I understand it there's not action to take to continue with deployment, it is already deployed [20:32:41] umm yeah although I don't think the backport should have affected this https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1005481 [20:32:44] * dancy reads [20:33:03] jeena: i don't think it will have continued beyond canaries. [20:33:12] jeena: I see wmf.18 when visiting Special:Version on commons, so definitely not deployed everywhere [20:33:17] oh okay [20:33:36] I was looking at the versions page [20:34:06] it's on officewiki fwiw [20:34:13] officewiki is group0 [20:34:15] The mods I made to logstash_checker.py should not be causing this since scap itself doesn't have the necessary changes to activate the new behavior. [20:34:35] taavi i learn something new every day.  i thought it was group1. [20:34:39] my understanding of the change in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1003885 is that it it needs a related config change somewhere to actually take effect [20:34:48] shall I just do a re-run? [20:35:40] If scap ran to completion before, re-running isn't necessary [20:35:54] I don't think it did? [20:36:00] it failed at the canary checks [20:36:02] it failed on the canary check phase [20:36:14] Ok..yes re-run [20:36:50] okay, trying again [20:36:58] looking at scap logs [20:37:03] When I get back to my desk I'll run logstash_checker.py manually to see how it behaves. [20:38:28] looks like there was a socket timeout [20:38:47] > __main__.CheckServiceError: Timeout on connection while downloading logstash1023.eqiad.wmnet:9200/logstash-*/_search [20:39:13] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:39:19] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:39:58] and that socket timeout caused all canaries to fail [20:41:20] (03PS17) 10ArielGlenn: sql/xml dumps: add role for helper worker for wikidata full history dumps [puppet] - 10https://gerrit.wikimedia.org/r/993659 (https://phabricator.wikimedia.org/T252396) [20:44:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P57633 and previous config saved to /var/cache/conftool/dbconfig/20240221-204415-arnaudb.json [20:44:25] makes sense. [20:45:08] logstash_checker.py still seems to work in the general case, afaict, running it from the command line. [20:45:33] network hiccup that we've never had before seems strange though [20:46:12] !log jhuneidi@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.19 refs T354437 [20:46:17] T354437: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437 [20:49:06] canary check passed [20:50:09] yeah, running logstash_checker.py manually seems fine, too. [20:51:05] i'm not 100% sure this hasn't happened before. [20:51:17] definitely not often, but it kind of jogs a memory. [20:53:33] (03PS1) 10Eevans: restbase: provision restbase1034-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005590 (https://phabricator.wikimedia.org/T354560) [20:53:35] (03PS1) 10Eevans: restbase: provision restbase1035-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005591 (https://phabricator.wikimedia.org/T354560) [20:53:37] (03PS1) 10Eevans: restbase: provision restbase1036-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005592 (https://phabricator.wikimedia.org/T354560) [20:53:39] (03PS1) 10Eevans: restbase: provision restbase1037-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005593 (https://phabricator.wikimedia.org/T354560) [20:53:41] (03PS1) 10Eevans: restbase: provision restbase1038-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005594 (https://phabricator.wikimedia.org/T354560) [20:53:43] (03PS1) 10Eevans: restbase: provision restbase1039-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005595 (https://phabricator.wikimedia.org/T354560) [20:53:45] (03PS1) 10Eevans: restbase: provision restbase1040-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005596 (https://phabricator.wikimedia.org/T354560) [20:53:47] (03PS1) 10Eevans: restbase: provision restbase1041-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005597 (https://phabricator.wikimedia.org/T354560) [20:53:49] (03PS1) 10Eevans: restbase: provision restbase1042-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005598 (https://phabricator.wikimedia.org/T354560) [20:54:48] !log jhuneidi@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.19 refs T354437 (duration: 08m 35s) [20:55:03] T354437: 1.42.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T354437 [20:55:05] thanks for the help everyone [20:55:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase: Q3:rack/setup/install restbase10[34-42] - https://phabricator.wikimedia.org/T354893#9565196 (10Eevans) 05Open→03Resolved >>! In T354893#9563034, @Volans wrote: > @Eevans yes, we've done it already in T305568#7992643 :( > I've created the records for 3 cassandra... [20:56:16] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T357790#9565199 (10VRiley-WMF) 05In progress→03Resolved [20:59:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T357189)', diff saved to https://phabricator.wikimedia.org/P57634 and previous config saved to /var/cache/conftool/dbconfig/20240221-205922-arnaudb.json [20:59:25] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [20:59:29] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:59:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2159.codfw.wmnet with reason: Maintenance [20:59:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [20:59:55] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [21:00:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T357189)', diff saved to https://phabricator.wikimedia.org/P57635 and previous config saved to /var/cache/conftool/dbconfig/20240221-210001-arnaudb.json [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: Time to do the UTC late backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T2100). [21:00:04] cscott, Jdlrobson, and anzx: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:08] (03PS2) 10Anzx: cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005476 (https://phabricator.wikimedia.org/T357978) [21:00:58] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:01:59] here [21:02:11] o/ [21:02:26] please hold one second [21:02:41] jeena: might want to look at that CentralAuth error real quick [21:02:45] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Arthur Taylor - https://phabricator.wikimedia.org/T357147#9565223 (10odimitrijevic) Approved [21:02:49] (just noticed it in logspam-watch) [21:02:58] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ElineWMDE - https://phabricator.wikimedia.org/T357097#9565224 (10odimitrijevic) Approved [21:03:09] looking [21:03:20] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics-privatedata-users for jwheeler - https://phabricator.wikimedia.org/T357731#9565225 (10odimitrijevic) Approved [21:03:36] (SystemdUnitFailed) firing: (2) check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:04:10] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:04:15] hmm, weird [21:04:20] brennen: T143982, T144033 [21:04:24] T143982: scap on beta cluster does not run anymore due to logstash being down - https://phabricator.wikimedia.org/T143982 [21:04:24] T144033: handle logstash timeouts separately from spikes in errors reported by logstash - https://phabricator.wikimedia.org/T144033 [21:04:34] T143973 [21:04:34] T143973: beta-scap-eqiad failing: Timeout on connection while downloading deployment-logstash2.deployment-prep.eqiad.wmflabs:9200/logstash-*/_search - https://phabricator.wikimedia.org/T143973 [21:05:15] jeena: not a huge spike of it, and all the inputs are spam. probably worth flagging but i'm guessing it's ok for backports to go ahead. [21:05:24] * brennen disappears into a meeting. [21:05:32] yeah I thought it looked like spam too [21:05:42] i'm here. [21:06:25] I can run the backports if no one is available [21:08:53] okay cscott yours is first in the list so I'll go ahead with that one [21:09:09] cool! [21:09:18] should be straightforward, i'll get setup to test [21:09:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jhuneidi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) (owner: 10C. Scott Ananian) [21:10:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T357189)', diff saved to https://phabricator.wikimedia.org/P57636 and previous config saved to /var/cache/conftool/dbconfig/20240221-211039-arnaudb.json [21:10:42] (03Merged) 10jenkins-bot: Turn on Parsoid read views by default on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/999062 (https://phabricator.wikimedia.org/T355566) (owner: 10C. Scott Ananian) [21:10:59] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:11:05] !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:999062|Turn on Parsoid read views by default on officewiki (T355566)]] [21:11:22] T355566: Use Parsoid for read views on OfficeWiki by default - https://phabricator.wikimedia.org/T355566 [21:12:12] jeena: When you have a moment, please copy-and-paste the relevant portion of the scap output into the description of T144033. [21:12:12] T144033: handle logstash timeouts separately from spikes in errors reported by logstash - https://phabricator.wikimedia.org/T144033 [21:12:40] !log jhuneidi@deploy2002 cscott and jhuneidi: Backport for [[gerrit:999062|Turn on Parsoid read views by default on officewiki (T355566)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:12:41] I've been working in that area of the code recently so now's the time to fix it. [21:14:35] dancy: done [21:14:51] cscott: ready for you to check on mwdebug [21:15:58] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [21:16:38] jeena: FYI one of mine https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1005570?usp=search is just a beta cluster change [21:16:48] jeena: Thanks! [21:17:03] jeena ok checking [21:17:12] fyi backporteers -- you're about to see some SAL noise from a helmfile deploy I'm running, but it's unimpactful to anything you're doing [21:17:29] !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [21:17:30] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host ncmonitor1001.eqiad.wmnet with OS bookworm [21:17:35] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710#9565283 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host ncmonitor1001.eqiad.wmnet with OS bookworm [21:17:49] !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [21:18:06] jeena looks good, ok to continue [21:18:10] !log jhuneidi@deploy2002 cscott and jhuneidi: Continuing with sync [21:18:22] Jdlrobson: I can do both yours next [21:18:36] !log rzl@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [21:18:54] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 44 probes of 804 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:19:00] !log rzl@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [21:19:43] jeena: https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/219 will take care of some of the terribleness. [21:20:10] (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005570 (https://phabricator.wikimedia.org/T357759) (owner: 10Jdlrobson) [21:20:28] (03CR) 10Jeena Huneidi: [C: 03+2] "backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005569 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [21:21:22] (03Merged) 10jenkins-bot: Remove Japanese Wikipedia from projects sharing user scripts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005569 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [21:21:27] (03CR) 10Eevans: [C: 03+2] restbase: provision restbase1034-{a,b,c} (new) [puppet] - 10https://gerrit.wikimedia.org/r/1005590 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans) [21:21:34] (03Merged) 10jenkins-bot: Enable night mode on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005570 (https://phabricator.wikimedia.org/T357759) (owner: 10Jdlrobson) [21:24:14] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [21:24:28] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [21:25:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P57637 and previous config saved to /var/cache/conftool/dbconfig/20240221-212546-arnaudb.json [21:26:24] !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:999062|Turn on Parsoid read views by default on officewiki (T355566)]] (duration: 15m 19s) [21:26:31] T355566: Use Parsoid for read views on OfficeWiki by default - https://phabricator.wikimedia.org/T355566 [21:27:24] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ncmonitor1001.eqiad.wmnet with reason: host reimage [21:27:31] !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:1005569|Remove Japanese Wikipedia from projects sharing user scripts (T301212)]], [[gerrit:1005570|Enable night mode on beta cluster (T357759)]] [21:27:47] Jdlrobson: I've started your backports [21:27:56] T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212 [21:27:56] T357759: Deploy night mode on the minerva skin on test wiki - https://phabricator.wikimedia.org/T357759 [21:28:29] jeena: thanks [21:29:00] !log jhuneidi@deploy2002 jdlrobson and jhuneidi: Backport for [[gerrit:1005569|Remove Japanese Wikipedia from projects sharing user scripts (T301212)]], [[gerrit:1005570|Enable night mode on beta cluster (T357759)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:31:06] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ncmonitor1001.eqiad.wmnet with reason: host reimage [21:31:53] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [21:32:11] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [21:32:38] Jdlrobson: any checks you need to do? [21:33:21] jeena: yep shouldnt take long doing now [21:33:32] 👍 [21:33:57] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 33 probes of 804 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:34:40] jeena: yep looks good please sync [21:34:44] thanks! [21:34:48] !log jhuneidi@deploy2002 jdlrobson and jhuneidi: Continuing with sync [21:37:07] anzx: are you around? [21:37:21] jeena: yes [21:37:50] okay, I'm going to go ahead and +2 your change [21:38:39] PROBLEM - Check whether ferm is active by checking the default input chain on mw1385 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:39:18] (03CR) 10Jeena Huneidi: [C: 03+2] cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005476 (https://phabricator.wikimedia.org/T357978) (owner: 10Anzx) [21:39:27] PROBLEM - Check whether ferm is active by checking the default input chain on mw1377 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:39:32] I have a config patch I'd like to deploy once everyone else has deployed their changes. I can self-serve. [21:40:01] (03Merged) 10jenkins-bot: cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005476 (https://phabricator.wikimedia.org/T357978) (owner: 10Anzx) [21:40:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P57638 and previous config saved to /var/cache/conftool/dbconfig/20240221-214052-arnaudb.json [21:41:15] Dreamy_Jazz: I'll ping you when done [21:41:21] Thanks! [21:42:56] !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:1005569|Remove Japanese Wikipedia from projects sharing user scripts (T301212)]], [[gerrit:1005570|Enable night mode on beta cluster (T357759)]] (duration: 15m 25s) [21:43:04] T301212: Vector-2022.js should no longer load legacy Vector site and user scripts/styles - https://phabricator.wikimedia.org/T301212 [21:43:07] T357759: Deploy night mode on the minerva skin on test wiki - https://phabricator.wikimedia.org/T357759 [21:43:43] !log jhuneidi@deploy2002 Started scap: Backport for [[gerrit:1005476|cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon (T357978)]] [21:43:48] T357978: Lift IP cap for WikiGap Editathon - https://phabricator.wikimedia.org/T357978 [21:43:52] jeena: nothing to test, you can sync it [21:43:58] okay [21:44:11] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ncmonitor1001.eqiad.wmnet with OS bookworm [21:44:15] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710#9565349 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host ncmonitor1001.eqiad.wmnet with OS bookworm completed: - ncmoni... [21:45:11] !log jhuneidi@deploy2002 anzx and jhuneidi: Backport for [[gerrit:1005476|cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon (T357978)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:46:04] !log jhuneidi@deploy2002 anzx and jhuneidi: Continuing with sync [21:46:32] Thanks jeena [21:46:44] You're welcome! [21:48:09] (03PS1) 10Dreamy Jazz: Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005607 (https://phabricator.wikimedia.org/T356923) [21:51:37] (03PS2) 10Dreamy Jazz: Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005607 (https://phabricator.wikimedia.org/T356923) [21:51:51] !log boostrapping Cassandra, restbase1034-{a,b,c} — T354560 [21:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:57] T354560: Provision new RESTBase cluster nodes: restbase10[34-42] - https://phabricator.wikimedia.org/T354560 [21:52:41] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase1034.eqiad.wmnet with reason: Bootstrapping — T354560 [21:52:55] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1034.eqiad.wmnet with reason: Bootstrapping — T354560 [21:54:22] (03PS1) 10Eevans: restbase: (phony) keys & certs for missing hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1005608 (https://phabricator.wikimedia.org/T354560) [21:54:30] !log jhuneidi@deploy2002 Finished scap: Backport for [[gerrit:1005476|cswiki, commonswiki, enwiki: fix IP cap date and IP for WikiGap Editathon (T357978)]] (duration: 10m 47s) [21:54:34] jeena: thanks [21:54:37] T357978: Lift IP cap for WikiGap Editathon - https://phabricator.wikimedia.org/T357978 [21:54:46] np [21:54:56] (03PS2) 10Eevans: restbase: (phony) keys & certs for missing/new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1005608 (https://phabricator.wikimedia.org/T354560) [21:54:59] Dreamy_Jazz: backports are finished [21:55:05] Thanks! [21:55:29] 10SRE-swift-storage, 10MediaWiki-Uploading, 10User-revi: FAILED: stashfailed: Could not read file "mwstore://local-swift-eqiad/local-temp/a/ac/15xi9btm14os.u9p1dr.1208681.webm.0". - https://phabricator.wikimedia.org/T200820#9565400 (10Bawolff) maybe what is happening is that two assemble jobs are running at... [21:55:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T357189)', diff saved to https://phabricator.wikimedia.org/P57639 and previous config saved to /var/cache/conftool/dbconfig/20240221-215558-arnaudb.json [21:56:00] (03PS3) 10Dreamy Jazz: Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005607 (https://phabricator.wikimedia.org/T356923) [21:56:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [21:56:04] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:56:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [21:56:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T357189)', diff saved to https://phabricator.wikimedia.org/P57640 and previous config saved to /var/cache/conftool/dbconfig/20240221-215620-arnaudb.json [21:57:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005607 (https://phabricator.wikimedia.org/T356923) (owner: 10Dreamy Jazz) [21:58:03] (03Merged) 10jenkins-bot: Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1005607 (https://phabricator.wikimedia.org/T356923) (owner: 10Dreamy Jazz) [21:58:25] !log dreamyjazz@deploy2002 Started scap: Backport for [[gerrit:1005607|Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis (T356923 T356924)]] [21:58:34] T356923: Create a configuration value to control whether global account blocks are enabled - https://phabricator.wikimedia.org/T356923 [21:58:34] T356924: Deploy global account blocks to WMF wikis - https://phabricator.wikimedia.org/T356924 [21:58:54] (03PS1) 10Ahmon Dancy: logstash_checker.py: Exit 10 if over error threshold [puppet] - 10https://gerrit.wikimedia.org/r/1005610 (https://phabricator.wikimedia.org/T144033) [21:59:58] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1005607|Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis (T356923 T356924)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240221T2200) [22:00:43] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [22:02:23] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2041*,elastic2042*,elastic2057*,elastic2063*,elastic2064*,elastic2077*,elastic2078*,elastic2092*,elastic2093*,elastic2094* for switch maintenance - bking@cumin2002 - T355860 [22:02:27] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2041*,elastic2042*,elastic2057*,elastic2063*,elastic2064*,elastic2077*,elastic2078*,elastic2092*,elastic2093*,elastic2094* for switch maintenance - bking@cumin2002 - T355860 [22:02:42] T355860: Migrate servers in codfw rack B4 from asw-b4-codfw to lsw1-b4-codfw - https://phabricator.wikimedia.org/T355860 [22:04:40] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2057:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2057 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:08:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T357189)', diff saved to https://phabricator.wikimedia.org/P57641 and previous config saved to /var/cache/conftool/dbconfig/20240221-220807-arnaudb.json [22:08:15] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [22:08:39] RECOVERY - Check whether ferm is active by checking the default input chain on mw1385 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:08:42] !log dreamyjazz@deploy2002 Finished scap: Backport for [[gerrit:1005607|Pin wgGlobalBlockingAllowGlobalAccountBlocks as false on WMF wikis (T356923 T356924)]] (duration: 10m 16s) [22:08:51] T356923: Create a configuration value to control whether global account blocks are enabled - https://phabricator.wikimedia.org/T356923 [22:08:52] T356924: Deploy global account blocks to WMF wikis - https://phabricator.wikimedia.org/T356924 [22:09:27] RECOVERY - Check whether ferm is active by checking the default input chain on mw1377 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:10:40] !log [WDQS] T355868 Depooling `wdqs2024`, `wdqs2014,` `wdqs2010` in anticipation of row maintenance [22:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:49] T355868: Migrate servers in codfw rack B2 from asw-b2-codfw to lsw1-b2-codfw - https://phabricator.wikimedia.org/T355868 [22:12:08] !log Evening UTC backport window done [22:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:22] !log ryankemper@deploy2002 Started deploy [wdqs/wdqs@8a290df]: new allowlisted endpoints for wdqs [22:18:58] (03CR) 10Eevans: [V: 03+2 C: 03+2] restbase: (phony) keys & certs for missing/new hosts [labs/private] - 10https://gerrit.wikimedia.org/r/1005608 (https://phabricator.wikimedia.org/T354560) (owner: 10Eevans) [22:20:36] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:20:37] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:23:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P57642 and previous config saved to /var/cache/conftool/dbconfig/20240221-222313-arnaudb.json [22:25:44] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:25:50] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:29:04] PROBLEM - Host ps1-c3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [22:29:22] !log ryankemper@deploy2002 Finished deploy [wdqs/wdqs@8a290df]: new allowlisted endpoints for wdqs (duration: 11m 59s) [22:37:48] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:38:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P57643 and previous config saved to /var/cache/conftool/dbconfig/20240221-223819-arnaudb.json [22:41:48] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1405:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:42:48] (PuppetZeroResources) firing: Puppet has failed generate resources on irc1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:43:43] (SystemdUnitFailed) firing: (2) send_tile_invalidations.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:43:48] (PuppetZeroResources) firing: Puppet has failed generate resources on chartmuseum1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:43:48] (PuppetZeroResources) firing: Puppet has failed generate resources on parse1015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:43:54] (PuppetZeroResources) firing: Puppet has failed generate resources on apt2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:45:47] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10vm-requests: eqiad: 1 VM request for ncmonitor - https://phabricator.wikimedia.org/T356710#9565580 (10BCornwall) 05Open→03Resolved Thanks for the nudge. Puppet is applying now. [22:47:49] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:50:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:51:04] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:51:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1366:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:53:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T357189)', diff saved to https://phabricator.wikimedia.org/P57644 and previous config saved to /var/cache/conftool/dbconfig/20240221-225326-arnaudb.json [22:53:29] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:53:38] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [22:53:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [22:53:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on chartmuseum1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:53:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T357189)', diff saved to https://phabricator.wikimedia.org/P57645 and previous config saved to /var/cache/conftool/dbconfig/20240221-225350-arnaudb.json [22:55:19] (03CR) 10Fabfur: [C: 03+1] "I think this is ok" [puppet] - 10https://gerrit.wikimedia.org/r/1004082 (owner: 10Majavah) [22:58:48] (PuppetZeroResources) resolved: Puppet has failed generate resources on parse1015:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [22:58:48] (PuppetZeroResources) firing: Puppet has failed generate resources on parse1022:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:03:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on parse1019:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:13:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on parse1019:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:20:39] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2078-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:22:49] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1398:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:22:49] (PuppetZeroResources) resolved: Puppet has failed generate resources on irc1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:23:50] (PuppetZeroResources) resolved: Puppet has failed generate resources on apt2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:24:38] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:24:45] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:26:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T357189)', diff saved to https://phabricator.wikimedia.org/P57646 and previous config saved to /var/cache/conftool/dbconfig/20240221-232649-arnaudb.json [23:26:55] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [23:30:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (2) Elasticsearch instance elastic2063-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:35:18] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on irc1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:35:39] (CirrusSearchNodeIndexingNotIncreasing) firing: (3) Elasticsearch instance elastic2063-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:37:05] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:37:11] (03CR) 10Jeena Huneidi: [C: 03+1] logstash_checker.py: Exit 10 if over error threshold [puppet] - 10https://gerrit.wikimedia.org/r/1005610 (https://phabricator.wikimedia.org/T144033) (owner: 10Ahmon Dancy) [23:37:12] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:37:48] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on mw1426:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:41:48] (PuppetZeroResources) firing: Puppet has failed generate resources on conf1007:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:41:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P57647 and previous config saved to /var/cache/conftool/dbconfig/20240221-234156-arnaudb.json [23:42:48] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1426:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:47:49] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on mw1426:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:52:48] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on mw1426:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:56:48] (PuppetZeroResources) firing: Puppet has failed generate resources on maps1006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:56:49] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on mw1366:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [23:57:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P57648 and previous config saved to /var/cache/conftool/dbconfig/20240221-235703-arnaudb.json