[00:00:34] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:03:46] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:05:18] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:06:00] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [00:06:31] !log eevans@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host restbase2028.codfw.wmnet [00:06:34] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:12:25] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group for ArthurTaylor - https://phabricator.wikimedia.org/T352653 (10KFrancis) Hi @ArthurTaylor, please send your email address to kfrancis@wikimedia.org and I will put the NDA together and send to you for signing. Thanks! [00:13:55] (03PS2) 10Kimberly Sarabia: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) [00:23:07] (03CR) 10Krinkle: [C: 03+1] "LGTM to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [00:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979961 [00:38:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979961 (owner: 10TrainBranchBot) [00:56:20] PROBLEM - cassandra-c service on restbase2028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:56:46] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979961 (owner: 10TrainBranchBot) [01:03:02] RECOVERY - Restbase root url on restbase2028 is OK: HTTP OK: HTTP/1.1 200 - 17816 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/RESTBase [01:03:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T352731 (10phaultfinder) [01:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [01:15:47] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "has been added to google doc by KFrancis" [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520) (owner: 10Dzahn) [01:16:22] RECOVERY - cassandra-a SSL 10.192.16.237:7000 on restbase2028 is OK: SSL OK - Certificate restbase2028-a valid until 2025-12-03 21:32:59 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [01:17:59] (PuppetFailure) resolved: Puppet has failed on restbase2028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:18:49] !log LDAP - added user xqt to group nda (T348520) [01:18:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:53] T348520: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 [01:20:42] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) 05Open→03Resolved >>! In T348520#9381464, @KFrancis wrote: > Done, thanks! Thank you as well! Also done on our side. @Xqt You have been a... [01:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:39:04] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:12] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [02:59:40] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T0300) [03:00:20] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:09:04] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:52:58] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:53:32] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:53:48] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:57:24] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [03:57:58] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:58:16] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T0400) [04:01:56] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:02:28] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:02:44] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:07:12] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:07:48] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:08:24] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:37:58] (03PS2) 10Andrew Bogott: Horizon: allow image uploading via horizon for users with glance admin [puppet] - 10https://gerrit.wikimedia.org/r/980021 (https://phabricator.wikimedia.org/T326818) [04:38:00] (03PS1) 10Andrew Bogott: cloud-init: make puppet optional [puppet] - 10https://gerrit.wikimedia.org/r/980079 (https://phabricator.wikimedia.org/T326818) [04:47:34] (03PS2) 10Andrew Bogott: cloud-init: make puppet optional [puppet] - 10https://gerrit.wikimedia.org/r/980079 (https://phabricator.wikimedia.org/T326818) [06:03:03] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10Marostegui) Thank you! Icinga looks good now [06:16:38] (03PS1) 10Marostegui: Revert "mariadb: Promote db1119 to m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/979695 [06:17:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352631 [06:17:16] T352631: Switchover m5 master db1119 -> db1176 - https://phabricator.wikimedia.org/T352631 [06:17:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352631 [06:20:28] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote db1119 to m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/979695 (owner: 10Marostegui) [06:23:50] !log Failover m5 from db1119 to db1176 - T352631 [06:23:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:54] T352631: Switchover m5 master db1119 -> db1176 - https://phabricator.wikimedia.org/T352631 [06:27:28] (03PS1) 10Marostegui: db1119: To be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/980082 [06:28:33] (03CR) 10Marostegui: [C: 03+2] db1119: To be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/980082 (owner: 10Marostegui) [06:35:54] (03PS1) 10Marostegui: db1176: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/980083 [06:36:27] (03CR) 10Marostegui: [C: 03+2] db1176: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/980083 (owner: 10Marostegui) [06:47:11] (03PS1) 10Marostegui: site.pp: db1119 will be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/980084 (https://phabricator.wikimedia.org/T337206) [06:47:46] (03CR) 10Marostegui: [C: 03+2] site.pp: db1119 will be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/980084 (https://phabricator.wikimedia.org/T337206) (owner: 10Marostegui) [06:51:58] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable IPIP on eqsin text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/979994 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [06:54:04] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:55:11] !log rolling restart of text|secondary LVS on eqsin effectively enabling IPIP encapsulation for ncredir@eqsin - T351069 [06:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:15] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [06:55:43] (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/980087 (https://phabricator.wikimedia.org/T351864) [06:58:20] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [06:59:05] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T0700) [07:00:04] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T0700). Please do the needful. [07:00:20] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:11:00] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:11:58] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:12:28] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:30] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:13:28] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:20:58] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 400 probes of 731 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:22:46] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:25:38] (03PS1) 10Vgutierrez: hiera: Disable rp_filter for ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980272 (https://phabricator.wikimedia.org/T351069) [07:25:40] (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980273 (https://phabricator.wikimedia.org/T351069) [07:25:42] (03PS1) 10Vgutierrez: hiera: Enable IPIP on text|secondary LVS in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069) [07:27:16] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/817/con" [puppet] - 10https://gerrit.wikimedia.org/r/980272 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [07:29:25] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/818/con" [puppet] - 10https://gerrit.wikimedia.org/r/980273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [07:29:40] 10SRE-tools, 10Infrastructure-Foundations, 10Orchestrator: Add database host removal from Orchestrator to sre.hosts.decommission cookbook - https://phabricator.wikimedia.org/T287954 (10Marostegui) 05Open→03Declined I think we can probably decline this. Orchestrator removes the host itself after 14 days,... [07:30:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:30:57] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [07:31:38] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:36:04] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:37:20] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 731 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:40:32] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:43:28] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:49:22] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:50] (03PS4) 10Awight: [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (https://phabricator.wikimedia.org/T347453) [07:56:42] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:00:05] Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T0800). [08:00:05] bwang and awight: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:01:10] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:05:52] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:10:04] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:10:48] (03CR) 10Muehlenhoff: "Ready for review. Not sure why it claims that PCC failed, the actual report is all fine." [puppet] - 10https://gerrit.wikimedia.org/r/979897 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [08:11:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:17:07] (03CR) 10Muehlenhoff: firewall::service: spelling fixes, add missing parameter comments (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn) [08:22:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (https://phabricator.wikimedia.org/T347453) (owner: 10Awight) [08:23:28] (03Merged) 10jenkins-bot: [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (https://phabricator.wikimedia.org/T347453) (owner: 10Awight) [08:24:15] 10SRE, 10SRE-Unowned: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492 (10fgiunchedi) [08:24:27] 10SRE, 10Infrastructure-Foundations: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10fgiunchedi) [08:25:34] (03CR) 10MVernon: [C: 03+1] "matthew@tsk:~/puppet$ md5sum hieradata/hosts/dbproxy102{3,5}.yaml" [dns] - 10https://gerrit.wikimedia.org/r/980087 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [08:25:55] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/980087 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [08:26:12] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:26:29] !log Failover m2-master dbproxy1023.eqiad.wmnet -> dbproxy1025.eqiad.wmnet T351864 [08:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:33] T351864: Migrate dbproxy hosts to Bookworm - https://phabricator.wikimedia.org/T351864 [08:30:08] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:09] (03CR) 10Muehlenhoff: "This is because these were installed with insetup::serviceops and that role already defaults to Puppet 7. @Eric Ping me when you are aroun" [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [08:31:42] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:32:50] (03PS1) 10Muehlenhoff: Fix Hadoop Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/980342 (https://phabricator.wikimedia.org/T352193) [08:36:35] (03CR) 10Muehlenhoff: [C: 03+2] Fix Hadoop Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/980342 (https://phabricator.wikimedia.org/T352193) (owner: 10Muehlenhoff) [08:43:02] (03CR) 10Muehlenhoff: restbase: migrate restbase2028 to puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [08:52:09] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/820/console" [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008) (owner: 10David Caro) [08:53:59] (PuppetFailure) firing: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:56:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:57:14] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:59] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix rest gateway endpoint creation in article descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [08:58:52] (03Merged) 10jenkins-bot: ml-services: fix rest gateway endpoint creation in article descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [08:59:39] !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:01:42] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [09:03:59] (PuppetFailure) resolved: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:04:23] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance [09:05:48] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 58952 [09:06:42] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 58952 [09:10:31] (03CR) 10Slyngshede: [C: 03+1] "Look good, much simpler." [puppet] - 10https://gerrit.wikimedia.org/r/979897 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [09:12:11] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [09:12:26] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance [09:12:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T348183)', diff saved to https://phabricator.wikimedia.org/P54147 and previous config saved to /var/cache/conftool/dbconfig/20231205-091232-arnaudb.json [09:12:36] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [09:12:49] (03PS5) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) [09:16:41] (03CR) 10Volans: [C: 03+1] "LGTM DNS and Netbox wise. I'll leave to the service owners to review the procedure to setup the service and naming :)" [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [09:19:06] (03CR) 10Elukey: [C: 03+1] Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [09:19:38] (03CR) 10Phuedx: Add stream config for *webuiactions via Metrics Platform (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming) [09:20:24] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:20:54] (03CR) 10Phuedx: [C: 03+1] Define the corresponding stream for scroll (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [09:20:56] (03CR) 10Jgiannelos: [C: 03+2] Use zap for structured logs on tile pregeneration [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/978030 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos) [09:21:42] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:22:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T348183)', diff saved to https://phabricator.wikimedia.org/P54148 and previous config saved to /var/cache/conftool/dbconfig/20231205-092202-arnaudb.json [09:22:06] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [09:22:15] (03Merged) 10jenkins-bot: Use zap for structured logs on tile pregeneration [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/978030 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos) [09:26:43] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:28:22] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:46] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:31:26] (03CR) 10Brouberol: [C: 03+2] Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [09:32:12] (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:01] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) Another datapoint for the mw*/parse* clusters, they will be migrated to be k8s hosts, that are su... [09:35:42] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:37:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P54149 and previous config saved to /var/cache/conftool/dbconfig/20231205-093709-arnaudb.json [09:37:12] (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:31] !log running authdns-update on dns1004.wikimedia.org - T352639 [09:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:34] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [09:42:23] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bookworm [09:42:30] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1003.eqiad.wmnet with OS bookworm [09:43:10] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:47:52] 10SRE-tools, 10Infrastructure-Foundations: spicerack.dnsdisc.Discovery should expose TTL - https://phabricator.wikimedia.org/T259875 (10JMeybohm) I don't exactly recall thb. but I would imagine I wanted something like this in one of the pool/depool/service-route cookbooks to store the TTL, lower it, change wha... [09:48:05] (03CR) 10Brouberol: [V: 03+1 C: 03+2] Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [09:48:44] 10SRE-tools, 10Infrastructure-Foundations: spicerack.dnsdisc.Discovery should expose TTL - https://phabricator.wikimedia.org/T259875 (10Volans) Ack, thanks for the info [09:51:02] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 63927 [09:52:16] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P54150 and previous config saved to /var/cache/conftool/dbconfig/20231205-095215-arnaudb.json [09:54:30] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 63927 [09:55:14] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:57:20] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage [10:02:07] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15305 [10:02:24] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage [10:05:21] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15305 [10:05:49] (ConfdResourceFailed) firing: confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:07:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T348183)', diff saved to https://phabricator.wikimedia.org/P54151 and previous config saved to /var/cache/conftool/dbconfig/20231205-100722-arnaudb.json [10:07:24] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [10:07:27] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [10:07:34] RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:38] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance [10:07:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T348183)', diff saved to https://phabricator.wikimedia.org/P54152 and previous config saved to /var/cache/conftool/dbconfig/20231205-100744-arnaudb.json [10:08:30] (03PS1) 10AikoChou: ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980345 (https://phabricator.wikimedia.org/T352181) [10:10:46] (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:12:12] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 199, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:12:57] 10sre-alert-triage, 10Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr2-eqdfw) - https://phabricator.wikimedia.org/T351083 (10ayounsi) 05Open→03Resolved Deleted. [10:15:26] RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:15:47] (ConfdResourceFailed) firing: (3) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:17:17] (03CR) 10Klausman: [C: 03+1] ml-services: remove mlstaging ingress settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/979907 (owner: 10Elukey) [10:19:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T348183)', diff saved to https://phabricator.wikimedia.org/P54153 and previous config saved to /var/cache/conftool/dbconfig/20231205-101906-arnaudb.json [10:19:10] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [10:20:09] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1003.eqiad.wmnet with OS bookworm [10:20:18] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1003.eqiad.wmnet with OS bookworm completed: - moss-be1003 (**PASS**) - Downtimed on Icinga/Alertma... [10:20:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008) (owner: 10David Caro) [10:20:47] (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:21:00] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1002.eqiad.wmnet with OS bookworm [10:21:06] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm [10:22:19] (03CR) 10Klausman: [C: 03+1] ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980345 (https://phabricator.wikimedia.org/T352181) (owner: 10AikoChou) [10:28:02] (03CR) 10Fabfur: [C: 03+1] "Looks good compared to I396792f6676415844ea29f3c3f656e8d2a77df1e" [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:28:31] (03CR) 10Fabfur: [C: 03+1] "Looks good compared to I2aecbdfb71e1c51a61058d7eed66145899945600" [puppet] - 10https://gerrit.wikimedia.org/r/980273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:28:53] (03CR) 10Fabfur: [C: 03+1] "Looks good compared to I5fb3fa081326475c355ecf251c712ae477bc2da1" [puppet] - 10https://gerrit.wikimedia.org/r/980272 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:31:24] (03PS1) 10Brouberol: Include the profile::lvs::realserver profile on the dse-k8s-roles [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) [10:32:40] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [10:33:53] (03PS23) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [10:34:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P54154 and previous config saved to /var/cache/conftool/dbconfig/20231205-103413-arnaudb.json [10:35:04] (03CR) 10AikoChou: [C: 03+2] ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980345 (https://phabricator.wikimedia.org/T352181) (owner: 10AikoChou) [10:35:59] (03Merged) 10jenkins-bot: ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980345 (https://phabricator.wikimedia.org/T352181) (owner: 10AikoChou) [10:36:35] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980345 (https://phabricator.wikimedia.org/T352181) (owner: 10AikoChou) [10:45:17] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:45:51] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [10:49:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P54155 and previous config saved to /var/cache/conftool/dbconfig/20231205-104919-arnaudb.json [10:53:04] (03PS1) 10Marostegui: dbproxy1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/980350 (https://phabricator.wikimedia.org/T351864) [10:54:01] (03PS1) 10JMeybohm: CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 [10:54:17] (03CR) 10Marostegui: [C: 03+2] dbproxy1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/980350 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [10:54:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bookworm [10:55:05] (03PS1) 10Effie Mouzeli: scaffold: fix annoying tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/980352 [10:55:08] (03CR) 10JMeybohm: [C: 03+1] mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/980011 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:56:05] (03CR) 10JMeybohm: [C: 03+1] jobqueue: migrate a moderately weighty job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979396 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [10:59:28] (03PS3) 10Slyngshede: C:netbox switch Netbox-Next to use plain OIDC [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1100) [11:00:06] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:21] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:01:32] (03CR) 10Slyngshede: C:netbox switch Netbox-Next to use plain OIDC (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) (owner: 10Slyngshede) [11:02:20] (03PS2) 10Effie Mouzeli: scaffold: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/980352 [11:02:33] (03CR) 10Hnowlan: CI: Update validate_envoy_config to use entrypoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm) [11:02:48] (03PS3) 10Effie Mouzeli: scaffold: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/980352 [11:02:52] !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be1002.eqiad.wmnet with OS bookworm [11:03:04] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm executed with errors: - moss-be1002 (**FAIL**) - Downtimed on Ici... [11:03:18] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/821/con" [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) (owner: 10Slyngshede) [11:03:26] (03PS2) 10Brouberol: Include the profile::lvs::realserver profile on the dse-k8s-roles [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) [11:03:58] (03PS1) 10Brouberol: Remove the inference realserver pool from the dse cluster [puppet] - 10https://gerrit.wikimedia.org/r/980353 (https://phabricator.wikimedia.org/T352639) [11:04:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T348183)', diff saved to https://phabricator.wikimedia.org/P54156 and previous config saved to /var/cache/conftool/dbconfig/20231205-110426-arnaudb.json [11:04:28] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [11:04:30] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [11:04:30] (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/980011 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:04:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance [11:04:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T348183)', diff saved to https://phabricator.wikimedia.org/P54157 and previous config saved to /var/cache/conftool/dbconfig/20231205-110448-arnaudb.json [11:05:28] (03PS2) 10Brouberol: Remove the inference realserver pool from the dse cluster [puppet] - 10https://gerrit.wikimedia.org/r/980353 (https://phabricator.wikimedia.org/T352639) [11:05:31] (03Merged) 10jenkins-bot: mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/980011 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:06:21] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 224, down: 2, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:06:31] (03PS2) 10Hnowlan: jobqueue: migrate a moderately weighty job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979396 (https://phabricator.wikimedia.org/T349796) [11:06:58] (03PS3) 10Brouberol: Include the profile::lvs::realserver profile on the dse-k8s-roles [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) [11:07:06] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/980353 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [11:07:21] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP on text|secondary LVS in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:07:35] (03CR) 10Vgutierrez: [V: 03+1] hiera: Enable IPIP on text|secondary LVS in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:07:37] !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [11:07:41] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Disable rp_filter for ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980272 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:07:43] (03CR) 10Brouberol: [C: 03+2] Remove the inference realserver pool from the dse cluster [puppet] - 10https://gerrit.wikimedia.org/r/980353 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [11:07:50] !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:08:00] !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [11:08:10] !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [11:08:54] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1023.eqiad.wmnet with reason: host reimage [11:08:56] brouberol: ok to merge Brouberol: Remove the inference realserver pool from the dse cluster (84d1f24443) :? [11:09:14] (03CR) 10JMeybohm: CI: Update validate_envoy_config to use entrypoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm) [11:10:02] yes vgutierrez [11:10:41] (03CR) 10Hnowlan: [C: 03+2] jobqueue: migrate a moderately weighty job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979396 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:10:58] brouberol: done [11:11:28] (03Merged) 10jenkins-bot: jobqueue: migrate a moderately weighty job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979396 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [11:11:36] (03CR) 10Jelto: [C: 04-1] "one comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [11:11:43] (03PS2) 10JMeybohm: CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 [11:12:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1023.eqiad.wmnet with reason: host reimage [11:12:34] (03PS1) 10Peter Fischer: Bump version for cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980354 [11:14:49] (03CR) 10Hnowlan: [C: 03+1] CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm) [11:15:31] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [11:15:48] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [11:16:13] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [11:16:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T348183)', diff saved to https://phabricator.wikimedia.org/P54158 and previous config saved to /var/cache/conftool/dbconfig/20231205-111625-arnaudb.json [11:16:29] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [11:16:32] (03PS1) 10Urbanecm: User impact: update quantizeViews to process small series of view data [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979698 (https://phabricator.wikimedia.org/T352349) [11:16:37] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [11:16:38] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [11:17:03] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [11:17:34] (03CR) 10DCausse: [C: 03+1] Bump version for cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980354 (owner: 10Peter Fischer) [11:20:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:20:58] (03CR) 10DCausse: [C: 03+2] Bump version for cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980354 (owner: 10Peter Fischer) [11:21:40] (03CR) 10CI reject: [V: 04-1] CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm) [11:21:47] (03Merged) 10jenkins-bot: Bump version for cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980354 (owner: 10Peter Fischer) [11:23:15] (03CR) 10David Caro: [V: 03+1 C: 03+2] codfw1dev: add smart_hosts [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008) (owner: 10David Caro) [11:24:07] (03CR) 10Clément Goubert: [C: 03+1] mw-api-int: increase replicas by 30% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980032 (owner: 10Kamila Součková) [11:24:45] (03PS1) 10Urbanecm: Growth: Enable Welcome survey user research for ar/en/es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980357 (https://phabricator.wikimedia.org/T351266) [11:27:56] (03PS3) 10EoghanGaffney: [admin] Add user account for xiaoxiao to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979389 (https://phabricator.wikimedia.org/T352098) [11:27:58] (03PS4) 10EoghanGaffney: [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) [11:28:00] (03PS1) 10EoghanGaffney: [admin] Add ehughes ldap account [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387) [11:28:09] jouncebot: nowandnext [11:28:09] For the next 0 hour(s) and 31 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1100) [11:28:09] In 1 hour(s) and 31 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1300) [11:28:39] (03PS3) 10JMeybohm: CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 [11:29:48] (03CR) 10Ladsgroup: [C: 03+2] Bump ParserCache TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup) [11:30:01] (03PS1) 10Gmodena: mw-page-content-enrich: version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/980359 (https://phabricator.wikimedia.org/T345806) [11:30:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup) [11:30:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1023.eqiad.wmnet with OS bookworm [11:30:31] (03Merged) 10jenkins-bot: Bump ParserCache TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup) [11:30:43] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:979920|Bump ParserCache TTL back to 30 days (T280604)]] [11:30:47] T280604: Post-deployment: (partly) ramp parser cache retention back up - https://phabricator.wikimedia.org/T280604 [11:31:23] (03CR) 10Clément Goubert: [C: 03+1] Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/980040 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [11:31:32] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P54159 and previous config saved to /var/cache/conftool/dbconfig/20231205-113132-arnaudb.json [11:32:01] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:979920|Bump ParserCache TTL back to 30 days (T280604)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:32:35] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [11:32:44] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [11:33:24] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:36:32] (03CR) 10Clément Goubert: [C: 03+1] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/980039 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [11:37:07] (03PS1) 10Marostegui: Revert "dbproxy1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979699 [11:37:24] (03Abandoned) 10Elukey: ml-services: remove mlstaging ingress settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/979907 (owner: 10Elukey) [11:38:12] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979699 (owner: 10Marostegui) [11:38:17] (03CR) 10Kamila Součková: [C: 03+2] mw-api-int: increase replicas by 30% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980032 (owner: 10Kamila Součková) [11:38:30] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:979920|Bump ParserCache TTL back to 30 days (T280604)]] (duration: 07m 47s) [11:38:34] T280604: Post-deployment: (partly) ramp parser cache retention back up - https://phabricator.wikimedia.org/T280604 [11:39:13] (03Merged) 10jenkins-bot: mw-api-int: increase replicas by 30% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980032 (owner: 10Kamila Součková) [11:40:17] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [11:40:18] (03Abandoned) 10Effie Mouzeli: scaffold: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/980352 (owner: 10Effie Mouzeli) [11:40:30] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [11:40:41] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [11:40:54] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [11:42:20] (03CR) 10Kamila Součková: [C: 03+2] mobileapps: 45% to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/976221 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [11:42:22] (03PS1) 10Peter Fischer: Add page-rerender-stream config to cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980360 [11:44:36] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the production Swift cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) a:05jcrespo→03None [11:44:56] (03PS1) 10Jcrespo: Prepare for 0.3.0 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/980362 (https://phabricator.wikimedia.org/T352655) [11:45:38] (03CR) 10Jcrespo: [C: 03+2] Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655) (owner: 10Jcrespo) [11:45:45] (03CR) 10Jcrespo: [C: 03+2] Prepare for 0.3.0 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/980362 (https://phabricator.wikimedia.org/T352655) (owner: 10Jcrespo) [11:46:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P54160 and previous config saved to /var/cache/conftool/dbconfig/20231205-114638-arnaudb.json [11:48:44] (03CR) 10Clément Goubert: mcrouter: add chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:50:14] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:netbox switch Netbox-Next to use plain OIDC [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) (owner: 10Slyngshede) [11:50:40] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4049.ulsfo.wmnet [11:51:09] (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/980039 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [11:51:19] (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/980040 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [11:51:36] !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:51:39] (03CR) 10DCausse: [C: 03+2] Add page-rerender-stream config to cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980360 (owner: 10Peter Fischer) [11:51:55] !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:51:58] (03Merged) 10jenkins-bot: Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/980040 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [11:52:15] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [11:52:24] (03Merged) 10jenkins-bot: Add page-rerender-stream config to cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980360 (owner: 10Peter Fischer) [11:52:40] (03PS1) 10Muehlenhoff: Switch cp4049 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980364 (https://phabricator.wikimedia.org/T349619) [11:53:22] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [11:54:45] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4049 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980364 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:01:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4049.ulsfo.wmnet [12:01:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T348183)', diff saved to https://phabricator.wikimedia.org/P54161 and previous config saved to /var/cache/conftool/dbconfig/20231205-120145-arnaudb.json [12:01:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:01:50] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [12:02:00] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance [12:02:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T348183)', diff saved to https://phabricator.wikimedia.org/P54162 and previous config saved to /var/cache/conftool/dbconfig/20231205-120206-arnaudb.json [12:02:17] (03PS8) 10Vgutierrez: traffic: Alert on configured and observed MSS mismatch [alerts] - 10https://gerrit.wikimedia.org/r/980280 (https://phabricator.wikimedia.org/T351069) [12:02:45] 10SRE, 10Traffic, 10Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10CodeReviewBot) fabfur merged https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/1 Basic retry mechanism for specific kafka errors [12:04:11] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4039.ulsfo.wmnet [12:05:32] PROBLEM - Check systemd state on kubernetes2060 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:06:49] !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw2423.codfw.wmnet with OS bullseye [12:07:00] RECOVERY - Check systemd state on kubernetes2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:07:04] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:07:15] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:09:11] (03PS1) 10Muehlenhoff: Switch cp4039 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980365 (https://phabricator.wikimedia.org/T349619) [12:10:32] !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw2424.codfw.wmnet with OS bullseye [12:10:44] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) [12:11:05] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) p:05Triage→03Medium [12:11:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T348183)', diff saved to https://phabricator.wikimedia.org/P54163 and previous config saved to /var/cache/conftool/dbconfig/20231205-121121-arnaudb.json [12:11:25] (03CR) 10JMeybohm: [C: 03+2] CI: Update validate_envoy_config to use entrypoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm) [12:11:28] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [12:13:12] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4039 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980365 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:16:17] (03PS1) 10Kosta Harlan: Add maintenance script to import existing files to scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979700 (https://phabricator.wikimedia.org/T350863) [12:16:28] (03PS1) 10Kosta Harlan: Only allow drawing and bitmap media types to be scanned [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979701 (https://phabricator.wikimedia.org/T352234) [12:16:51] (03PS4) 10Brouberol: Include the profile::lvs::realserver profile on the dse-k8s-roles [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) [12:17:02] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [12:18:02] !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw2434.codfw.wmnet with OS bullseye [12:18:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4039.ulsfo.wmnet [12:19:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:20:39] (03CR) 10Brouberol: [C: 03+2] Include the profile::lvs::realserver profile on the dse-k8s-roles [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [12:21:14] (03Merged) 10jenkins-bot: CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm) [12:22:06] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4051.ulsfo.wmnet [12:23:23] (03PS1) 10Muehlenhoff: Switch cp4051 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980367 (https://phabricator.wikimedia.org/T349619) [12:23:50] !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2423.codfw.wmnet with reason: host reimage [12:24:38] !log installing unbound bugfix updates from Bookworm point release [12:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [12:25:23] !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw2435.codfw.wmnet with OS bullseye [12:26:19] !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [12:26:20] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4051 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980367 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:26:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P54164 and previous config saved to /var/cache/conftool/dbconfig/20231205-122628-arnaudb.json [12:26:44] PROBLEM - Check systemd state on kubernetes1039 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:26:45] (03PS1) 10Brouberol: Switch the k8s-ingress-dse LVS service in lvs_setup state [puppet] - 10https://gerrit.wikimedia.org/r/980368 (https://phabricator.wikimedia.org/T352639) [12:26:51] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2423.codfw.wmnet with reason: host reimage [12:27:03] !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [12:27:21] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [12:27:47] (03PS2) 10Brouberol: Switch the k8s-ingress-dse LVS service in lvs_setup state [puppet] - 10https://gerrit.wikimedia.org/r/980368 (https://phabricator.wikimedia.org/T352639) [12:28:24] (03PS1) 10Hnowlan: mw-jobrunner: bump replicas further [deployment-charts] - 10https://gerrit.wikimedia.org/r/980369 (https://phabricator.wikimedia.org/T349796) [12:28:33] !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2424.codfw.wmnet with reason: host reimage [12:28:34] !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1463.eqiad.wmnet with OS bullseye [12:29:20] (03CR) 10Clément Goubert: [C: 03+1] mw-jobrunner: bump replicas further [deployment-charts] - 10https://gerrit.wikimedia.org/r/980369 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [12:30:15] (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/980368 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [12:30:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4051.ulsfo.wmnet [12:30:56] RECOVERY - Check systemd state on kubernetes1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:30:56] RECOVERY - Check systemd state on mw2261 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:31:55] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2424.codfw.wmnet with reason: host reimage [12:32:06] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4042.ulsfo.wmnet [12:32:47] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got worse - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:32:59] (03PS1) 10Ladsgroup: Set migration of pagelinks on large wikis of s5 to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980370 (https://phabricator.wikimedia.org/T351237) [12:33:35] jouncebot: nowandnext [12:33:35] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [12:33:35] In 0 hour(s) and 26 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1300) [12:33:43] (03CR) 10Ladsgroup: [C: 03+2] Set migration of pagelinks on large wikis of s5 to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980370 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [12:34:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980370 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [12:34:26] (03Merged) 10jenkins-bot: Set migration of pagelinks on large wikis of s5 to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980370 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [12:34:42] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:980370|Set migration of pagelinks on large wikis of s5 to read new (T351237)]] [12:34:50] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [12:36:15] !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2434.codfw.wmnet with reason: host reimage [12:37:13] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:980370|Set migration of pagelinks on large wikis of s5 to read new (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:37:47] (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:38:45] (03PS1) 10Muehlenhoff: Switch cp4042 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980371 (https://phabricator.wikimedia.org/T349619) [12:39:41] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2434.codfw.wmnet with reason: host reimage [12:40:16] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [12:40:33] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4042 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980371 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:41:09] !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1463.eqiad.wmnet with reason: host reimage [12:41:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P54165 and previous config saved to /var/cache/conftool/dbconfig/20231205-124134-arnaudb.json [12:42:34] !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2435.codfw.wmnet with reason: host reimage [12:44:30] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1463.eqiad.wmnet with reason: host reimage [12:45:02] !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [12:45:55] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2423.codfw.wmnet with OS bullseye [12:45:56] !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [12:47:13] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:980370|Set migration of pagelinks on large wikis of s5 to read new (T351237)]] (duration: 12m 30s) [12:47:16] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [12:47:30] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2435.codfw.wmnet with reason: host reimage [12:50:17] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1076.eqiad.wmnet with OS bullseye [12:50:24] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye [12:50:34] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2424.codfw.wmnet with OS bullseye [12:52:47] (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:53:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4042.ulsfo.wmnet [12:54:28] (03PS1) 10Btullis: Bring an-coord1004 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/980372 (https://phabricator.wikimedia.org/T336045) [12:55:29] (03PS2) 10Btullis: Bring an-coord1004 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/980372 (https://phabricator.wikimedia.org/T336045) [12:56:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T348183)', diff saved to https://phabricator.wikimedia.org/P54167 and previous config saved to /var/cache/conftool/dbconfig/20231205-125641-arnaudb.json [12:56:43] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:56:45] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [12:56:58] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [12:57:47] (Traffic bill over quota) resolved: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [12:57:54] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: redis::misc::slave [12:58:23] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/978665 (owner: 10PipelineBot) [12:58:31] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2434.codfw.wmnet with OS bullseye [12:59:18] !log cmooney@cumin2002 START - Cookbook sre.dns.netbox [12:59:32] (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/978665 (owner: 10PipelineBot) [12:59:37] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/823/con" [puppet] - 10https://gerrit.wikimedia.org/r/980372 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1300) [13:00:40] PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:00:52] (03PS1) 10Jbond: installserver: add spec test for role [puppet] - 10https://gerrit.wikimedia.org/r/980375 [13:00:54] (03PS1) 10Jbond: installserver: test spec test fires [puppet] - 10https://gerrit.wikimedia.org/r/980376 [13:02:39] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1463.eqiad.wmnet with OS bullseye [13:03:37] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [13:04:08] !log cmooney@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update entry for sretest2003. - cmooney@cumin2002" [13:04:12] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [13:04:16] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1076.eqiad.wmnet with reason: host reimage [13:04:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:04:49] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance [13:04:55] !log cmooney@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update entry for sretest2003. - cmooney@cumin2002" [13:04:55] !log cmooney@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:05:03] (03CR) 10Jbond: "instead if this i think you just need to create a spec file in the modules/role/spec/classes[1] or move the yaml file so that it is in mod" [puppet] - 10https://gerrit.wikimedia.org/r/979119 (owner: 10Brouberol) [13:05:55] (03PS1) 10EoghanGaffney: [gitlab] Update BroadcastMessage class to new namespace [puppet] - 10https://gerrit.wikimedia.org/r/980377 (https://phabricator.wikimedia.org/T352433) [13:06:06] (03CR) 10Jbond: "not necessarily against this but just wanted to point out we can do this in rspec (copying below the same comment i added to https://gerri" [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [13:06:07] !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1464.eqiad.wmnet with OS bullseye [13:06:34] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2435.codfw.wmnet with OS bullseye [13:07:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1076.eqiad.wmnet with reason: host reimage [13:07:41] (03PS1) 10Muehlenhoff: Switch redis::misc::slave to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980385 (https://phabricator.wikimedia.org/T349619) [13:07:42] RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 51.12 ms [13:08:53] !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1465.eqiad.wmnet with OS bullseye [13:08:58] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:48] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1079.eqiad.wmnet with OS bullseye [13:09:54] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1079.eqiad.wmnet with OS bullseye [13:10:11] (03CR) 10Muehlenhoff: [C: 03+2] Switch redis::misc::slave to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980385 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:10:36] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1078.eqiad.wmnet with OS bullseye [13:10:46] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye [13:11:00] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:12:20] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [13:12:34] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [13:12:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T348183)', diff saved to https://phabricator.wikimedia.org/P54168 and previous config saved to /var/cache/conftool/dbconfig/20231205-131240-arnaudb.json [13:12:56] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:14:11] !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1470.eqiad.wmnet with OS bullseye [13:14:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: redis::misc::slave [13:16:30] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:16:40] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:18:34] !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1464.eqiad.wmnet with reason: host reimage [13:21:28] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1464.eqiad.wmnet with reason: host reimage [13:21:34] !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1465.eqiad.wmnet with reason: host reimage [13:22:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T348183)', diff saved to https://phabricator.wikimedia.org/P54169 and previous config saved to /var/cache/conftool/dbconfig/20231205-132200-arnaudb.json [13:22:15] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:23:51] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1079.eqiad.wmnet with reason: host reimage [13:24:21] !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1078.eqiad.wmnet with reason: host reimage [13:24:44] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be1079.eqiad.wmnet with reason: host reimage [13:24:59] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1465.eqiad.wmnet with reason: host reimage [13:26:02] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [13:26:42] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:26:45] !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1470.eqiad.wmnet with reason: host reimage [13:27:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [13:27:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1076.eqiad.wmnet with OS bullseye [13:27:19] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye completed: - ms-be... [13:27:46] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1078.eqiad.wmnet with reason: host reimage [13:30:39] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1470.eqiad.wmnet with reason: host reimage [13:32:34] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) [13:34:12] PROBLEM - Check systemd state on ms-be1079 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:36:50] PROBLEM - Host ms-be1079 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P54171 and previous config saved to /var/cache/conftool/dbconfig/20231205-133706-arnaudb.json [13:38:16] RECOVERY - Host ms-be1079 is UP: PING OK - Packet loss = 0%, RTA = 5.51 ms [13:38:28] PROBLEM - Check systemd state on ms-be1079 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:56] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1464.eqiad.wmnet with OS bullseye [13:41:17] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4048.ulsfo.wmnet [13:43:08] RECOVERY - Check systemd state on ms-be1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:43:42] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1465.eqiad.wmnet with OS bullseye [13:44:02] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [13:44:12] (03PS1) 10Muehlenhoff: Switch cp4048 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980386 (https://phabricator.wikimedia.org/T349619) [13:44:53] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4048 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980386 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [13:48:03] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1470.eqiad.wmnet with OS bullseye [13:48:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [13:48:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1079.eqiad.wmnet with OS bullseye [13:48:17] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1079.eqiad.wmnet with OS bullseye completed: - ms-be... [13:48:52] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [13:48:58] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [13:50:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001" [13:50:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1078.eqiad.wmnet with OS bullseye [13:50:26] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:28] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye completed: - ms-be... [13:51:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4048.ulsfo.wmnet [13:52:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P54172 and previous config saved to /var/cache/conftool/dbconfig/20231205-135213-arnaudb.json [13:53:46] (03CR) 10Jelto: [C: 03+1] "lgtm, thanks for the quick fix!" [puppet] - 10https://gerrit.wikimedia.org/r/980377 (https://phabricator.wikimedia.org/T352433) (owner: 10EoghanGaffney) [13:54:29] (03CR) 10EoghanGaffney: [C: 03+2] [gitlab] Update BroadcastMessage class to new namespace [puppet] - 10https://gerrit.wikimedia.org/r/980377 (https://phabricator.wikimedia.org/T352433) (owner: 10EoghanGaffney) [13:55:22] (03CR) 10D3r1ck01: "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:55:47] (03PS1) 10Elukey: services: upgrade recommendation-api's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/980391 (https://phabricator.wikimedia.org/T349118) [13:58:01] (03PS1) 10Filippo Giunchedi: pontoon: enable remote syslog in o11y [puppet] - 10https://gerrit.wikimedia.org/r/980392 [13:59:00] (03CR) 10Jforrester: [C: 03+1] "LGTM! Is there a runbook for deploying this and knowing for sure the service is still working? If so, happy to deploy it for you." [deployment-charts] - 10https://gerrit.wikimedia.org/r/980391 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey) [14:00:08] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1400). [14:00:08] Urbanecm and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:36] I can deploy (unless kostajh wants to lead the window?) [14:00:39] (03CR) 10Elukey: services: upgrade recommendation-api's Docker image (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980391 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey) [14:00:58] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: enable remote syslog in o11y [puppet] - 10https://gerrit.wikimedia.org/r/980392 (owner: 10Filippo Giunchedi) [14:01:02] hi [14:01:04] PROBLEM - WDQS SPARQL on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:01:09] urbanecm: would be happy for you to deploy them [14:01:10] (03CR) 10Urbanecm: [C: 03+2] User impact: update quantizeViews to process small series of view data [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979698 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm) [14:01:11] (03CR) 10Jforrester: [C: 03+1] services: upgrade recommendation-api's Docker image (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980391 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey) [14:01:22] PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 428 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:01:24] (03CR) 10Btullis: [V: 03+1 C: 03+2] Bring an-coord1004 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/980372 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [14:01:28] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:01:32] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1014.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1007.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs10 [14:01:32] .wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:01:52] kostajh: i am a bit confused by your message. do you want me to deploy, or do you want to? [14:02:00] (PuppetZeroResources) firing: Puppet has failed generate resources on mw1471:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:02:07] urbanecm: if you can deploy, that would be great [14:02:11] sure, no problem [14:02:18] (03CR) 10Urbanecm: [C: 03+2] Add maintenance script to import existing files to scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979700 (https://phabricator.wikimedia.org/T350863) (owner: 10Kosta Harlan) [14:02:20] (03CR) 10Urbanecm: [C: 03+2] Only allow drawing and bitmap media types to be scanned [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979701 (https://phabricator.wikimedia.org/T352234) (owner: 10Kosta Harlan) [14:02:28] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1015.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1014.eqiad.wmnet, wdqs10 [14:02:28] .wmnet, wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:03:00] (03CR) 10Elukey: [C: 03+2] services: upgrade recommendation-api's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/980391 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey) [14:03:00] i assume you'll handle running the script after the deployment, right? [14:03:09] (03PS2) 10Urbanecm: Growth: Enable Welcome survey user research for ar/en/es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980357 (https://phabricator.wikimedia.org/T351266) [14:03:14] (03CR) 10Urbanecm: [C: 03+2] Growth: Enable Welcome survey user research for ar/en/es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980357 (https://phabricator.wikimedia.org/T351266) (owner: 10Urbanecm) [14:03:28] urbanecm: not immediately, probably tomorrow [14:03:29] !log installing cups security updates [14:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:37] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980357 (https://phabricator.wikimedia.org/T351266) (owner: 10Urbanecm) [14:03:48] fine with me, just double checking i'm not supposed to run it :) [14:03:55] (03Merged) 10jenkins-bot: Growth: Enable Welcome survey user research for ar/en/es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980357 (https://phabricator.wikimedia.org/T351266) (owner: 10Urbanecm) [14:04:10] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:980357|Growth: Enable Welcome survey user research for ar/en/es (T351266)]] [14:04:12] PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:04:16] T351266: enable the T342353 checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T351266 [14:04:30] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:04:40] PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:05:48] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:05:59] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: sync [14:06:14] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync [14:06:27] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:980357|Growth: Enable Welcome survey user research for ar/en/es (T351266)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:07:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T348183)', diff saved to https://phabricator.wikimedia.org/P54173 and previous config saved to /var/cache/conftool/dbconfig/20231205-140720-arnaudb.json [14:07:22] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:07:24] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:07:24] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:07:36] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance [14:07:43] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T348183)', diff saved to https://phabricator.wikimedia.org/P54174 and previous config saved to /var/cache/conftool/dbconfig/20231205-140742-arnaudb.json [14:08:28] RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:09:14] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T352731 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact. [14:10:20] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:10:22] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 0.358 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:11:16] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:12:44] RECOVERY - WDQS SPARQL on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:13:43] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:980357|Growth: Enable Welcome survey user research for ar/en/es (T351266)]] (duration: 09m 33s) [14:13:47] T351266: enable the T342353 checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T351266 [14:14:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979698 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm) [14:14:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979700 (https://phabricator.wikimedia.org/T350863) (owner: 10Kosta Harlan) [14:14:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979701 (https://phabricator.wikimedia.org/T352234) (owner: 10Kosta Harlan) [14:14:27] Growth config change done. waiting on the backports to merge [14:14:36] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:15:12] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:43] 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [14:15:56] 10SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 4 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [14:15:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:16:04] (03PS1) 10Slyngshede: Package for Debian [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 [14:16:08] RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:17:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T348183)', diff saved to https://phabricator.wikimedia.org/P54175 and previous config saved to /var/cache/conftool/dbconfig/20231205-141701-arnaudb.json [14:17:06] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:17:19] (03PS2) 10Slyngshede: Package for Debian [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 [14:18:11] (03PS3) 10Slyngshede: Package for Debian [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 [14:19:20] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:20:27] (03CR) 10Volans: "Wouldn't be easier to have a debian branch for the debian/ directory like we do for some of our projects?" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede) [14:20:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:21:02] (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:21:24] (03CR) 10Muehlenhoff: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede) [14:21:37] almost there... [14:22:49] (03PS1) 10Jgiannelos: wikifeeds: Configure http client to use service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/980398 [14:22:52] PROBLEM - cassandra-c service on restbase2028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:23:28] (03Merged) 10jenkins-bot: User impact: update quantizeViews to process small series of view data [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979698 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm) [14:23:28] (03Merged) 10jenkins-bot: Add maintenance script to import existing files to scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979700 (https://phabricator.wikimedia.org/T350863) (owner: 10Kosta Harlan) [14:23:34] here we go :) [14:23:37] (03Merged) 10jenkins-bot: Only allow drawing and bitmap media types to be scanned [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979701 (https://phabricator.wikimedia.org/T352234) (owner: 10Kosta Harlan) [14:23:51] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:979698|User impact: update quantizeViews to process small series of view data (T352349)]], [[gerrit:979700|Add maintenance script to import existing files to scan table (T350863)]], [[gerrit:979701|Only allow drawing and bitmap media types to be scanned (T352234)]] [14:23:57] T352349: Impact Module: Views on articles you've edited graph - https://phabricator.wikimedia.org/T352349 [14:23:57] T350863: Create maintenance script to import all existing images to mediamoderation_scan table - https://phabricator.wikimedia.org/T350863 [14:23:58] T352234: MediaModerationFileProcessor::canScanFile incorrectly registers audio and video files as scannable when TimedMediaHandler extension is installed - https://phabricator.wikimedia.org/T352234 [14:24:30] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ceph2002.mgmt.codfw.wmnet with reboot policy FORCED [14:25:08] !log urbanecm@deploy2002 kharlan and urbanecm: Backport for [[gerrit:979698|User impact: update quantizeViews to process small series of view data (T352349)]], [[gerrit:979700|Add maintenance script to import existing files to scan table (T350863)]], [[gerrit:979701|Only allow drawing and bitmap media types to be scanned (T352234)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:25:27] kostajh: fyi ^^, not sure if the other one is testable. [14:25:52] urbanecm: not testable, really, until we run them for real [14:25:58] makes sense. [14:25:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [14:25:58] so just syncing would be great, thank you. [14:26:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ceph2002.mgmt.codfw.wmnet with reboot policy FORCED [14:26:03] Growth's fix works, so syncing [14:26:05] !log urbanecm@deploy2002 kharlan and urbanecm: Continuing with sync [14:27:22] (03CR) 10Volans: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede) [14:27:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ceph2002.mgmt.codfw.wmnet with reboot policy FORCED [14:29:37] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ceph2002'] [14:30:07] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: redis::misc::master [14:30:54] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [14:30:59] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) fixed ceph2002 [14:31:57] (03PS1) 10Muehlenhoff: Switch redis::misc::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980400 (https://phabricator.wikimedia.org/T349619) [14:32:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P54176 and previous config saved to /var/cache/conftool/dbconfig/20231205-143207-arnaudb.json [14:32:47] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:979698|User impact: update quantizeViews to process small series of view data (T352349)]], [[gerrit:979700|Add maintenance script to import existing files to scan table (T350863)]], [[gerrit:979701|Only allow drawing and bitmap media types to be scanned (T352234)]] (duration: 08m 55s) [14:32:52] and done [14:32:55] anything else? :) [14:32:59] thanks! [14:33:02] T352349: Impact Module: Views on articles you've edited graph - https://phabricator.wikimedia.org/T352349 [14:33:03] T350863: Create maintenance script to import all existing images to mediamoderation_scan table - https://phabricator.wikimedia.org/T350863 [14:33:03] T352234: MediaModerationFileProcessor::canScanFile incorrectly registers audio and video files as scannable when TimedMediaHandler extension is installed - https://phabricator.wikimedia.org/T352234 [14:33:12] np [14:34:13] (03CR) 10Muehlenhoff: [C: 03+2] Switch redis::misc::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980400 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:34:58] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) [14:35:10] (03PS1) 10Brouberol: Add discovery records for the k8s-ingress-dse LVS service [dns] - 10https://gerrit.wikimedia.org/r/980404 (https://phabricator.wikimedia.org/T352639) [14:35:17] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:35:20] (03CR) 10Jelto: [C: 04-1] "looks mostly good, but some typo and nitpick comments in-line 😊" [puppet] - 10https://gerrit.wikimedia.org/r/979912 (owner: 10EoghanGaffney) [14:36:43] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ceph2002'] [14:38:46] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:39:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: redis::misc::master [14:39:05] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:09] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:40:11] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:40:26] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:40:33] (03PS1) 10Elukey: services: use 127.0.0.1 instead of localhost for rec-api's mw host [deployment-charts] - 10https://gerrit.wikimedia.org/r/980407 (https://phabricator.wikimedia.org/T349118) [14:41:18] !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:41:32] !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:41:55] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:42:37] (03CR) 10Elukey: [C: 03+2] services: use 127.0.0.1 instead of localhost for rec-api's mw host [deployment-charts] - 10https://gerrit.wikimedia.org/r/980407 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey) [14:43:59] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sessionstore2004-6 to codfw - jhancock@cumin2002" [14:44:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sessionstore2004-6 to codfw - jhancock@cumin2002" [14:44:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:45:13] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: sync [14:45:22] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync [14:46:57] (03CR) 10Hnowlan: [C: 03+1] wikifeeds: Configure http client to use service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/980398 (owner: 10Jgiannelos) [14:47:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P54177 and previous config saved to /var/cache/conftool/dbconfig/20231205-144714-arnaudb.json [14:48:33] (03PS1) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2014 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980409 (https://phabricator.wikimedia.org/T352758) [14:50:07] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) [14:51:58] (03CR) 10Slyngshede: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede) [14:52:18] jouncebot: nowandnext [14:52:18] For the next 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1400) [14:52:18] In 1 hour(s) and 7 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1600) [14:52:34] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4043.ulsfo.wmnet [14:52:58] (03CR) 10Brouberol: [C: 03+2] Switch the k8s-ingress-dse LVS service in lvs_setup state [puppet] - 10https://gerrit.wikimedia.org/r/980368 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [14:54:05] (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:45] !log adding k8s-ingress-dse backend to LVS - T352639 [14:54:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:49] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [14:55:25] (03PS1) 10Muehlenhoff: Switch cp4043 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980410 (https://phabricator.wikimedia.org/T349619) [14:55:43] (03CR) 10Volans: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede) [14:55:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sessionstore2004.mgmt.codfw.wmnet with reboot policy FORCED [14:55:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sessionstore2005.mgmt.codfw.wmnet with reboot policy FORCED [14:55:49] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sessionstore2006.mgmt.codfw.wmnet with reboot policy FORCED [14:57:06] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [14:57:33] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [14:57:56] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync [14:58:05] !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1471.eqiad.wmnet with OS bullseye [14:58:21] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync [14:58:30] (03PS2) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2014 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980409 (https://phabricator.wikimedia.org/T352758) [14:59:39] (03CR) 10Muehlenhoff: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede) [14:59:51] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4043 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980410 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:00:21] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:01:21] !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[1018,1020].eqiad.wmnet} and A:lvs (T352639) [15:01:29] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [15:01:35] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:01:54] ^that's me and brouberol [15:02:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T348183)', diff saved to https://phabricator.wikimedia.org/P54178 and previous config saved to /var/cache/conftool/dbconfig/20231205-150220-arnaudb.json [15:02:23] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [15:02:25] PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 129 connections established with conf1007.eqiad.wmnet:4001 (min=130) https://wikitech.wikimedia.org/wiki/PyBal [15:02:27] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:02:33] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:02:37] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance [15:02:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T348183)', diff saved to https://phabricator.wikimedia.org/P54179 and previous config saved to /var/cache/conftool/dbconfig/20231205-150243-arnaudb.json [15:02:59] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: Configure http client to use service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/980398 (owner: 10Jgiannelos) [15:03:54] (03Merged) 10jenkins-bot: wikifeeds: Configure http client to use service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/980398 (owner: 10Jgiannelos) [15:04:31] (03PS1) 10Hnowlan: thumbor: disable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/980413 (https://phabricator.wikimedia.org/T337649) [15:04:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4043.ulsfo.wmnet [15:05:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Papaul) @BTullis where you able to add those nodes to partman-early-command.sh ? [15:05:28] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sessionstore2006.mgmt.codfw.wmnet with reboot policy FORCED [15:06:00] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [15:06:06] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [15:06:32] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10BTullis) >>! In T349934#9383342, @Papaul wrote: > @BTullis where you able to add those nodes to partman-early-command.sh ? Oh sorry, I missed the ping. I'll add t... [15:06:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sessionstore2004.mgmt.codfw.wmnet with reboot policy FORCED [15:07:25] (03PS1) 10Jgiannelos: wikifeeds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980414 [15:07:27] 10SRE, 10Traffic, 10Patch-For-Review: Add version flag to purged - https://phabricator.wikimedia.org/T347839 (10CodeReviewBot) fabfur closed https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/2 Draft: Add version print option [15:07:35] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sessionstore2005.mgmt.codfw.wmnet with reboot policy FORCED [15:07:47] (03PS2) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980273 (https://phabricator.wikimedia.org/T351069) [15:08:06] (03CR) 10Hnowlan: [C: 03+1] wikifeeds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980414 (owner: 10Jgiannelos) [15:08:30] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980414 (owner: 10Jgiannelos) [15:09:26] (03Merged) 10jenkins-bot: wikifeeds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980414 (owner: 10Jgiannelos) [15:10:31] !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1471.eqiad.wmnet with reason: host reimage [15:11:07] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [15:11:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] thumbor: disable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/980413 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [15:11:57] !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.loadbalancer.restart-pybal (exit_code=1) rolling-restart of pybal on P{lvs[1018,1020].eqiad.wmnet} and A:lvs (T352639) [15:12:03] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [15:12:06] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [15:12:07] (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable IPIP encapsulation on ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:12:21] (03CR) 10Hnowlan: [C: 03+2] thumbor: disable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/980413 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [15:12:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T348183)', diff saved to https://phabricator.wikimedia.org/P54180 and previous config saved to /var/cache/conftool/dbconfig/20231205-151255-arnaudb.json [15:13:05] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:13:17] (03Merged) 10jenkins-bot: thumbor: disable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/980413 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [15:13:36] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1471.eqiad.wmnet with reason: host reimage [15:14:08] 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10lmata) p:05Triage→03Low [15:15:04] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10BTullis) >>! In T349934#9383342, @Papaul wrote: > @BTullis where you able to add those nodes to partman-early-command.sh ? Oh, I'm so sorry. I've made a mistake w... [15:15:42] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:15:49] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host aqs2001.codfw.wmnet [15:15:51] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:16:05] !log Manually restarting pybal on lvs1020 - T352639 [15:16:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:18] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:18:25] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/828/console" [puppet] - 10https://gerrit.wikimedia.org/r/952488 (https://phabricator.wikimedia.org/T321099) (owner: 10Jforrester) [15:18:43] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:18:44] (03PS1) 10Muehlenhoff: Switch aqs2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980416 (https://phabricator.wikimedia.org/T349619) [15:20:05] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:21:14] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:21:16] (03CR) 10JMeybohm: mcrouter: add helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [15:21:36] (03CR) 10Muehlenhoff: [C: 03+2] Switch aqs2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980416 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:22:14] !log Manually restarting pybal on lvs1019 - T352639 [15:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:22] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [15:25:05] 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10TheresNoTime) Atlassian statuspage has [[ https://support.atlassian.com/statuspage/docs/enable-webhook-notifications/ | webhook support ]].. that might be easier than RSS? [15:26:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host aqs2001.codfw.wmnet [15:27:52] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) attempted to provision sessionstore2004 on the new lsw switch. needs further attention. [15:28:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P54181 and previous config saved to /var/cache/conftool/dbconfig/20231205-152801-arnaudb.json [15:28:09] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] wikifunctions: Drop beta monitoring [puppet] - 10https://gerrit.wikimedia.org/r/952488 (https://phabricator.wikimedia.org/T321099) (owner: 10Jforrester) [15:28:20] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sessionstore2005.mgmt.codfw.wmnet with reboot policy FORCED [15:28:35] (03CR) 10Clément Goubert: [C: 03+1] Add discovery records for the k8s-ingress-dse LVS service [dns] - 10https://gerrit.wikimedia.org/r/980404 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [15:28:46] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2006'] [15:29:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sessionstore2006'] [15:29:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sessionstore2005.mgmt.codfw.wmnet with reboot policy FORCED [15:29:33] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2005'] [15:29:46] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sessionstore2005'] [15:29:48] (03PS1) 10Brouberol: Fix cluster conftool selector for the k8s-ingress-dse LVS service [puppet] - 10https://gerrit.wikimedia.org/r/980417 (https://phabricator.wikimedia.org/T352639) [15:29:57] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:30:09] (03CR) 10Clément Goubert: [C: 03+1] Fix cluster conftool selector for the k8s-ingress-dse LVS service [puppet] - 10https://gerrit.wikimedia.org/r/980417 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [15:31:17] (03CR) 10Brouberol: [C: 03+2] Fix cluster conftool selector for the k8s-ingress-dse LVS service [puppet] - 10https://gerrit.wikimedia.org/r/980417 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [15:31:35] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1471.eqiad.wmnet with OS bullseye [15:31:59] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) [15:32:20] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [15:35:47] (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:39:01] !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1002.eqiad.wmnet with OS bookworm [15:39:07] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm [15:40:47] (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:41:04] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff) [15:42:31] !log Manually restarting pybal on lvs1020 - T352639 [15:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:35] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [15:42:47] !log installing monitoring-plugins bugfix updates from Bookworm point release [15:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:52] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [15:43:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P54182 and previous config saved to /var/cache/conftool/dbconfig/20231205-154308-arnaudb.json [15:44:13] RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 130 connections established with conf1007.eqiad.wmnet:4001 (min=130) https://wikitech.wikimedia.org/wiki/PyBal [15:44:15] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [15:45:03] Obviously now it's alerting because the backend isn't responding [15:45:06] joy [15:45:47] (ConfdResourceFailed) resolved: (4) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:45:59] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4040.ulsfo.wmnet [15:46:33] claime: \m/ [15:47:11] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker1006.eqiad.wmnet, dse-k8s-worker1007.eqiad.wmnet, dse-k8s-worker1005.eqiad.wmnet, dse-k8s-worker1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:47:49] (03PS1) 10Muehlenhoff: Switch cp4040 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980420 (https://phabricator.wikimedia.org/T349619) [15:48:10] (03PS2) 10Aqu: Rewrite metrics sent by Airflow [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532) [15:48:34] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4040 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980420 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:49:44] !log sudo confctl select "service=kubesvc,cluster=dse-k8s" set/pooled=inactive - T352639 [15:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:48] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [15:50:33] (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: bump replicas further [deployment-charts] - 10https://gerrit.wikimedia.org/r/980369 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:50:54] (03PS3) 10Aqu: Rewrite metrics sent by Airflow [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532) [15:51:07] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker1006.eqiad.wmnet, dse-k8s-worker1007.eqiad.wmnet, dse-k8s-worker1005.eqiad.wmnet, dse-k8s-worker1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:51:46] (03Merged) 10jenkins-bot: mw-jobrunner: bump replicas further [deployment-charts] - 10https://gerrit.wikimedia.org/r/980369 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:52:46] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [15:53:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4040.ulsfo.wmnet [15:54:12] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 (10cmooney) [15:55:47] (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [15:56:16] !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [15:56:30] !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [15:56:36] !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [15:56:47] !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [15:57:55] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:58:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T348183)', diff saved to https://phabricator.wikimedia.org/P54183 and previous config saved to /var/cache/conftool/dbconfig/20231205-155814-arnaudb.json [15:58:18] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [15:58:19] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:58:33] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [15:58:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:58:52] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [15:58:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T348183)', diff saved to https://phabricator.wikimedia.org/P54184 and previous config saved to /var/cache/conftool/dbconfig/20231205-155858-arnaudb.json [15:59:58] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding testhost2001 to codfw - jhancock@cumin2002" [16:00:04] eoghan, jelto, and arnoldokoth: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1600). [16:00:11] (03PS4) 10Samtar: .well-known: Add F-Droid signature to assetlinks.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) [16:00:27] (03CR) 10Slyngshede: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede) [16:00:47] (ConfdResourceFailed) resolved: (3) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:00:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding testhost2001 to codfw - jhancock@cumin2002" [16:00:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:01:07] jouncebot: nowandnext [16:01:07] For the next 0 hour(s) and 58 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1600) [16:01:07] In 0 hour(s) and 58 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1700) [16:01:51] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host testhost2001.mgmt.codfw.wmnet with reboot policy FORCED [16:03:22] (03PS1) 10Elukey: Revert "services: upgrade recommendation-api's Docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979703 [16:03:32] (03CR) 10CI reject: [V: 04-1] Revert "services: upgrade recommendation-api's Docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979703 (owner: 10Elukey) [16:04:41] (03PS2) 10Elukey: Revert "services: upgrade recommendation-api's Docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979703 [16:06:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) (owner: 10Samtar) [16:06:26] (03CR) 10Elukey: [C: 03+2] Revert "services: upgrade recommendation-api's Docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979703 (owner: 10Elukey) [16:06:49] (03Merged) 10jenkins-bot: .well-known: Add F-Droid signature to assetlinks.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) (owner: 10Samtar) [16:07:04] !log samtar@deploy2002 Started scap: Backport for [[gerrit:959327|.well-known: Add F-Droid signature to assetlinks.json (T346951)]] [16:07:08] T346951: assetlinks.json is missing F-Droid build signature - https://phabricator.wikimedia.org/T346951 [16:07:30] (03PS3) 10JMeybohm: Add new mesh module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/979094 (https://phabricator.wikimedia.org/T300033) [16:07:32] (03PS3) 10JMeybohm: Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033) [16:07:34] (03PS1) 10JMeybohm: function-orchestrator: Update to latest mesh and ingress module [deployment-charts] - 10https://gerrit.wikimedia.org/r/980425 (https://phabricator.wikimedia.org/T300033) [16:07:43] (03PS1) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2013 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980426 (https://phabricator.wikimedia.org/T352784) [16:08:21] !log samtar@deploy2002 samtar: Backport for [[gerrit:959327|.well-known: Add F-Droid signature to assetlinks.json (T346951)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:08:42] !log samtar@deploy2002 samtar: Continuing with sync [16:08:44] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 (10cmooney) [16:08:53] (03PS1) 10Ssingh: P:dns::auth: add support for depooling recdns via confd [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [16:09:10] !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [16:09:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T348183)', diff saved to https://phabricator.wikimedia.org/P54185 and previous config saved to /var/cache/conftool/dbconfig/20231205-160920-arnaudb.json [16:09:24] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [16:09:35] !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [16:09:52] !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync [16:10:05] (03CR) 10JMeybohm: [C: 03+1] jobqueue: migrate a heavyweight job to jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/979397 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:10:15] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:10:58] (03PS2) 10Hnowlan: jobqueue: migrate a heavyweight job to jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/979397 (https://phabricator.wikimedia.org/T349796) [16:11:48] !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync [16:13:00] (03CR) 10Hnowlan: [C: 03+2] jobqueue: migrate a heavyweight job to jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/979397 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:13:26] (03PS1) 10Brouberol: Rollback state of LVS k8s-ingress-dse to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/980428 (https://phabricator.wikimedia.org/T352639) [16:14:01] (03CR) 10Clément Goubert: [C: 03+1] Rollback state of LVS k8s-ingress-dse to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/980428 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [16:14:03] (03Merged) 10jenkins-bot: jobqueue: migrate a heavyweight job to jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/979397 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [16:14:39] (03CR) 10Brouberol: [C: 03+2] Rollback state of LVS k8s-ingress-dse to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/980428 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [16:14:58] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:959327|.well-known: Add F-Droid signature to assetlinks.json (T346951)]] (duration: 07m 53s) [16:15:03] T346951: assetlinks.json is missing F-Droid build signature - https://phabricator.wikimedia.org/T346951 [16:17:45] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [16:18:00] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Nicely done" [alerts] - 10https://gerrit.wikimedia.org/r/980280 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:18:13] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [16:18:20] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [16:18:24] !log Rolling back k8s-ingress-dse - restarting pybal on lvs1020 - T352639 [16:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:28] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [16:18:29] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:18:35] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:18:42] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [16:24:02] !log Rolling back k8s-ingress-dse - restarting pybal on lvs1019 - T352639 [16:24:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:06] T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639 [16:24:25] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:24:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P54186 and previous config saved to /var/cache/conftool/dbconfig/20231205-162426-arnaudb.json [16:24:56] (03PS8) 10Bernard Wang: Deploy VectorClientPreferences to beta and pl,fr,ca,fa,tr wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) [16:25:49] (03PS9) 10Bernard Wang: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) [16:26:23] (03CR) 10Bernard Wang: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang) [16:34:10] !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1002.eqiad.wmnet with reason: host reimage [16:36:55] RECOVERY - cassandra-a CQL 10.192.16.237:9042 on restbase2028 is OK: TCP OK - 0.032 second response time on 10.192.16.237 port 9042 https://phabricator.wikimedia.org/T93886 [16:37:38] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1002.eqiad.wmnet with reason: host reimage [16:39:34] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P54187 and previous config saved to /var/cache/conftool/dbconfig/20231205-163933-arnaudb.json [16:39:44] (03PS1) 10Hnowlan: jobqueue: migrate a larger job (and one smaller one) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980433 (https://phabricator.wikimedia.org/T349796) [16:42:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host testhost2001.mgmt.codfw.wmnet with reboot policy FORCED [16:42:46] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['testhost2001'] [16:42:55] (03PS1) 10Filippo Giunchedi: rsyslog: update receiver config for version 8.2302 [puppet] - 10https://gerrit.wikimedia.org/r/980434 (https://phabricator.wikimedia.org/T351710) [16:44:49] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) I can do this @BTullis. np! [16:46:57] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) [16:47:20] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [16:50:31] (03CR) 10Herron: [C: 03+1] rsyslog: update receiver config for version 8.2302 [puppet] - 10https://gerrit.wikimedia.org/r/980434 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [16:52:09] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bullseye [16:52:35] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [16:54:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T348183)', diff saved to https://phabricator.wikimedia.org/P54188 and previous config saved to /var/cache/conftool/dbconfig/20231205-165439-arnaudb.json [16:54:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [16:54:44] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [16:54:57] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [16:55:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1206 (T348183)', diff saved to https://phabricator.wikimedia.org/P54189 and previous config saved to /var/cache/conftool/dbconfig/20231205-165503-arnaudb.json [17:00:04] jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1700). [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:00:36] !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1002.eqiad.wmnet with OS bookworm [17:00:42] 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm completed: - moss-be1002 (**PASS**) - Removed from Puppet and Pup... [17:08:00] RECOVERY - cassandra-b SSL 10.192.16.238:7000 on restbase2028 is OK: SSL OK - Certificate restbase2028-b valid until 2025-12-03 21:33:01 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [17:11:05] !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bullseye [17:12:18] 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10MatthewVernon) [17:13:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['testhost2001'] [17:14:18] 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10hnowlan) >>! In T352744#9383998, @MatthewVernon wrote: > I think `ms-*` swift will fall foul of this too, via the wmf-rewrite middleware (which is using python's `urllib.request.build_opener` to talk t... [17:15:01] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['testhost2001'] [17:15:47] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['testhost2001'] [17:18:30] (03CR) 10JHathaway: apt_repo: validate preseed data with a JSON Schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [17:22:42] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/980434 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [17:24:00] (03CR) 10JMeybohm: [C: 03+1] jobqueue: migrate a larger job (and one smaller one) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980433 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [17:26:08] (03CR) 10Hnowlan: [C: 03+2] jobqueue: migrate a larger job (and one smaller one) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980433 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [17:26:58] (03Merged) 10jenkins-bot: jobqueue: migrate a larger job (and one smaller one) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980433 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [17:28:59] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [17:29:03] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye [17:29:25] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [17:29:26] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [17:29:47] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [17:42:21] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP on text|secondary LVS in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [17:45:59] !log rolling restart of text|secondary LVS on drmrs effectively enabling IPIP encapsulation for ncredir@drmrs- T351069 [17:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:02] T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 [17:47:15] (03PS1) 10Jdlrobson: [Zebra] Make .vector-column-start cache compatible [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979704 (https://phabricator.wikimedia.org/T347712) [17:49:25] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [17:50:01] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [17:52:53] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage [17:55:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T348183)', diff saved to https://phabricator.wikimedia.org/P54190 and previous config saved to /var/cache/conftool/dbconfig/20231205-175526-arnaudb.json [17:55:30] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [17:59:41] (03PS1) 10Btullis: Update the refinery version used by the refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121) [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1800) [18:00:40] (03PS1) 10Samtar: IS: Set Phonos to Inline Audio Player mode on test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980446 [18:01:13] (03CR) 10JHathaway: "thanks for the patch @jbond, this didn't work for me with `bundle exec rspec modules/role` even after adding `include profile::installserv" [puppet] - 10https://gerrit.wikimedia.org/r/980375 (owner: 10Jbond) [18:01:39] ACKNOWLEDGEMENT - MD RAID on cp4052 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.128.0.12. Check system logs on 10.128.0.12 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T352795 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [18:02:03] 10SRE, 10ops-ulsfo: Degraded RAID on cp4052 - https://phabricator.wikimedia.org/T352795 (10ops-monitoring-bot) [18:02:23] huh [18:03:42] (03CR) 10Jdlrobson: [C: 04-1] Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang) [18:05:35] cp4052 is being reimaged, we will see once it finishes [18:05:40] (depooled, nothing to worry about) [18:07:44] (03PS10) 10Bernard Wang: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) [18:10:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P54191 and previous config saved to /var/cache/conftool/dbconfig/20231205-181032-arnaudb.json [18:13:17] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bullseye [18:13:51] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [18:25:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P54192 and previous config saved to /var/cache/conftool/dbconfig/20231205-182539-arnaudb.json [18:40:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T348183)', diff saved to https://phabricator.wikimedia.org/P54193 and previous config saved to /var/cache/conftool/dbconfig/20231205-184045-arnaudb.json [18:40:48] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [18:40:50] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:41:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [18:41:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1207 (T348183)', diff saved to https://phabricator.wikimedia.org/P54194 and previous config saved to /var/cache/conftool/dbconfig/20231205-184108-arnaudb.json [18:41:54] 10SRE, 10ops-ulsfo: Degraded RAID on cp4052 - https://phabricator.wikimedia.org/T352795 (10ssingh) 05Open→03Invalid Resolved after cookbook finished. [18:50:45] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T348183)', diff saved to https://phabricator.wikimedia.org/P54195 and previous config saved to /var/cache/conftool/dbconfig/20231205-185044-arnaudb.json [18:50:49] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:51:33] (03PS2) 10Jbond: installserver: add spec test for role [puppet] - 10https://gerrit.wikimedia.org/r/980375 [18:51:54] (03PS2) 10Jbond: installserver: test spec test fires [puppet] - 10https://gerrit.wikimedia.org/r/980376 [18:54:05] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:56:37] (03CR) 10Jbond: "thanks for taking a look see inline" [puppet] - 10https://gerrit.wikimedia.org/r/980375 (owner: 10Jbond) [19:00:21] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:01:04] (03CR) 10Dzahn: planet: add ensure parameter allowing to disable update jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [19:05:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P54196 and previous config saved to /var/cache/conftool/dbconfig/20231205-190551-arnaudb.json [19:13:20] (03PS3) 10Dzahn: planet: add ensure parameter allowing to disable update jobs [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) [19:20:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P54197 and previous config saved to /var/cache/conftool/dbconfig/20231205-192057-arnaudb.json [19:36:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T348183)', diff saved to https://phabricator.wikimedia.org/P54198 and previous config saved to /var/cache/conftool/dbconfig/20231205-193604-arnaudb.json [19:36:07] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [19:36:20] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [19:36:21] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [19:36:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1218 (T348183)', diff saved to https://phabricator.wikimedia.org/P54199 and previous config saved to /var/cache/conftool/dbconfig/20231205-193627-arnaudb.json [19:40:49] (03PS4) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/980022 [19:41:18] (03CR) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn) [19:41:39] (03CR) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn) [19:46:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T348183)', diff saved to https://phabricator.wikimedia.org/P54200 and previous config saved to /var/cache/conftool/dbconfig/20231205-194616-arnaudb.json [19:46:21] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [19:47:54] (03CR) 10Bking: wdqs: Monitor LDF endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:48:00] (03CR) 10Bking: [C: 03+2] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:52:25] (03CR) 10Jbond: apt_repo: validate preseed data with a JSON Schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [19:57:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn) [19:58:53] (03PS1) 10Bking: wdqs: Improve regex for ldf check [puppet] - 10https://gerrit.wikimedia.org/r/980460 (https://phabricator.wikimedia.org/T347355) [20:01:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P54201 and previous config saved to /var/cache/conftool/dbconfig/20231205-200123-arnaudb.json [20:01:57] (03PS2) 10Ryan Kemper: wdqs: Improve regex for ldf check [puppet] - 10https://gerrit.wikimedia.org/r/980460 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [20:07:48] (03PS1) 10Jdrewniak: Fix nonzebra sticky container scrolling behavior and scrollable indicator [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980467 (https://phabricator.wikimedia.org/T352464) [20:11:57] (03Abandoned) 10JHathaway: apt_repo: validate preseed data with a JSON Schema [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [20:14:04] (ProbeDown) firing: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:14:40] (03CR) 10JHathaway: installserver: add spec test for role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980375 (owner: 10Jbond) [20:16:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P54202 and previous config saved to /var/cache/conftool/dbconfig/20231205-201629-arnaudb.json [20:22:51] (03CR) 10Bking: [C: 03+2] wdqs: Improve regex for ldf check [puppet] - 10https://gerrit.wikimedia.org/r/980460 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [20:22:52] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:23:14] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:23:26] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:24:39] (03PS2) 10JHathaway: apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) [20:27:07] (03CR) 10CI reject: [V: 04-1] apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [20:31:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T348183)', diff saved to https://phabricator.wikimedia.org/P54203 and previous config saved to /var/cache/conftool/dbconfig/20231205-203136-arnaudb.json [20:31:38] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [20:31:41] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:31:52] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance [20:31:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1219 (T348183)', diff saved to https://phabricator.wikimedia.org/P54204 and previous config saved to /var/cache/conftool/dbconfig/20231205-203158-arnaudb.json [20:41:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T348183)', diff saved to https://phabricator.wikimedia.org/P54205 and previous config saved to /var/cache/conftool/dbconfig/20231205-204147-arnaudb.json [20:41:50] (03PS1) 10Andrew Bogott: Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 [20:41:52] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:41:52] (03PS1) 10Andrew Bogott: Keystone: turn on credential_key managementin eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/980486 [20:44:48] (03CR) 10CI reject: [V: 04-1] Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott) [20:45:03] (03CR) 10CI reject: [V: 04-1] Keystone: turn on credential_key managementin eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/980486 (owner: 10Andrew Bogott) [20:52:23] (03CR) 10Jbond: "i definitely prefer this approach, see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [20:53:50] !log bking@prometheus1006 reload prometheus-blackbox service T347355 [20:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:55] T347355: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 [20:55:40] (03PS2) 10Andrew Bogott: Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 [20:55:42] (03PS2) 10Andrew Bogott: Keystone: turn on credential_key managementin eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/980486 [20:56:50] (03CR) 10Jbond: apt_repo: move hiera data into module, to allow for validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [20:56:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P54206 and previous config saved to /var/cache/conftool/dbconfig/20231205-205654-arnaudb.json [20:58:36] !log bking@prometheus1006 disable puppet for troubleshooting T347355 [20:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T2100). [21:00:04] tgr, James_F, and kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:09] * James_F waves. [21:00:35] (03PS2) 10Jforrester: Revert "Do not try to use Thumbor on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972263 (https://phabricator.wikimedia.org/T344605) (owner: 10Gergő Tisza) [21:00:38] o/ [21:00:43] (03CR) 10Jforrester: [C: 03+2] Revert "Do not try to use Thumbor on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972263 (https://phabricator.wikimedia.org/T344605) (owner: 10Gergő Tisza) [21:00:51] tgr: Yours is easy. :-) [21:00:59] (03PS2) 10Jforrester: nlwikivoyage: Drop Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980009 (https://phabricator.wikimedia.org/T352696) [21:01:05] Hello. [21:01:10] (03PS2) 10Jforrester: Drop Listings extension from Wikivoyages where unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980047 (https://phabricator.wikimedia.org/T352719) [21:01:26] (03Merged) 10jenkins-bot: Revert "Do not try to use Thumbor on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972263 (https://phabricator.wikimedia.org/T344605) (owner: 10Gergő Tisza) [21:01:33] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980009 (https://phabricator.wikimedia.org/T352696) (owner: 10Jforrester) [21:01:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980047 (https://phabricator.wikimedia.org/T352719) (owner: 10Jforrester) [21:02:23] (03Merged) 10jenkins-bot: nlwikivoyage: Drop Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980009 (https://phabricator.wikimedia.org/T352696) (owner: 10Jforrester) [21:02:27] (03Merged) 10jenkins-bot: Drop Listings extension from Wikivoyages where unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980047 (https://phabricator.wikimedia.org/T352719) (owner: 10Jforrester) [21:02:44] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:972263|Revert "Do not try to use Thumbor on beta" (T344605)]], [[gerrit:980009|nlwikivoyage: Drop Listings extension (T352696)]], [[gerrit:980047|Drop Listings extension from Wikivoyages where unused (T352719)]] [21:02:50] T344605: deployment-prep needs a Thumbor instance - https://phabricator.wikimedia.org/T344605 [21:02:50] T352696: Undeploy Listings extension from Dutch Wikivoyage - https://phabricator.wikimedia.org/T352696 [21:02:51] T352719: Undeploy the Listings extension on 7 Wikivoyages on which it's entirely unused - https://phabricator.wikimedia.org/T352719 [21:02:54] kimberly_sarabia: I'll start the merge of your wmf.7 backports if that's OK? [21:03:35] James_F: Thanks [21:03:48] (03CR) 10Jforrester: [C: 03+2] [Zebra] Make .vector-column-start cache compatible [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979704 (https://phabricator.wikimedia.org/T347712) (owner: 10Jdlrobson) [21:03:56] (03CR) 10Jforrester: [C: 03+2] Fix nonzebra sticky container scrolling behavior and scrollable indicator [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980467 (https://phabricator.wikimedia.org/T352464) (owner: 10Jdrewniak) [21:04:11] !log jforrester@deploy2002 tgr and jforrester: Backport for [[gerrit:972263|Revert "Do not try to use Thumbor on beta" (T344605)]], [[gerrit:980009|nlwikivoyage: Drop Listings extension (T352696)]], [[gerrit:980047|Drop Listings extension from Wikivoyages where unused (T352719)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:04:31] !log jforrester@deploy2002 tgr and jforrester: Continuing with sync [21:06:53] (03CR) 10Dzahn: [C: 03+2] firewall::service: spelling fixes, add missing parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn) [21:07:42] (03CR) 10Jforrester: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang) [21:07:56] (03PS11) 10Jforrester: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang) [21:09:53] touching the firewall::service class which would be mildy scary but it's only comments [21:10:34] mutante: What could possibly go wrong? :-) [21:11:11] ;) https://en.wiktionary.org/wiki/jinx#Verb [21:11:22] https://en.wiktionary.org/wiki/reverse_jinx#English [21:11:29] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:972263|Revert "Do not try to use Thumbor on beta" (T344605)]], [[gerrit:980009|nlwikivoyage: Drop Listings extension (T352696)]], [[gerrit:980047|Drop Listings extension from Wikivoyages where unused (T352719)]] (duration: 08m 45s) [21:11:32] * James_F grins. [21:11:37] T344605: deployment-prep needs a Thumbor instance - https://phabricator.wikimedia.org/T344605 [21:11:37] T352696: Undeploy Listings extension from Dutch Wikivoyage - https://phabricator.wikimedia.org/T352696 [21:11:38] T352719: Undeploy the Listings extension on 7 Wikivoyages on which it's entirely unused - https://phabricator.wikimedia.org/T352719 [21:11:43] kimberly_sarabia: OK for me to deploy the 'VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis' change first? [21:12:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P54207 and previous config saved to /var/cache/conftool/dbconfig/20231205-211200-arnaudb.json [21:12:17] James_F: Yup. That's the most important one for us today [21:12:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang) [21:12:24] Cool. [21:13:03] (03Merged) 10jenkins-bot: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang) [21:13:19] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:980028|Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (T351339)]] [21:13:22] T351339: Deploy client preferences to production beta features - https://phabricator.wikimedia.org/T351339 [21:13:59] (PuppetZeroResources) firing: Puppet has failed generate resources on wdqs1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:14:41] kimberly_sarabia: OK, can you please test on an mwdebug server and confirm we're OK to deploy? [21:14:59] (PuppetZeroResources) firing: Puppet has failed generate resources on wdqs1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:15:09] James_F: yes one moment [21:15:13] Sure. [21:15:37] (03PS1) 10Bking: Revert "wdqs: Monitor LDF endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/980468 [21:15:54] (03CR) 10CI reject: [V: 04-1] Revert "wdqs: Monitor LDF endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/980468 (owner: 10Bking) [21:16:06] ^^ fixing puppet errors now [21:18:09] 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Aklapper) @darthmon_wmde ping [21:18:34] thanks inflatador, I figured it's related [21:18:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:19:52] James_F: LGTM [21:19:57] Great. [21:19:58] !log jforrester@deploy2002 bwang and jforrester: Continuing with sync [21:21:13] (03Merged) 10jenkins-bot: [Zebra] Make .vector-column-start cache compatible [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979704 (https://phabricator.wikimedia.org/T347712) (owner: 10Jdlrobson) [21:21:29] (03Merged) 10jenkins-bot: Fix nonzebra sticky container scrolling behavior and scrollable indicator [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980467 (https://phabricator.wikimedia.org/T352464) (owner: 10Jdrewniak) [21:21:56] (Once this config push finishes, I'll do those two now they've landed.) [21:24:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on wdqs1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:25:54] (03PS1) 10Bking: Revert "wdqs: Improve regex for ldf check" [puppet] - 10https://gerrit.wikimedia.org/r/980469 [21:26:15] (03CR) 10Bking: [C: 03+2] Revert "wdqs: Improve regex for ldf check" [puppet] - 10https://gerrit.wikimedia.org/r/980469 (owner: 10Bking) [21:26:17] (03CR) 10Bking: [V: 03+2 C: 03+2] Revert "wdqs: Improve regex for ldf check" [puppet] - 10https://gerrit.wikimedia.org/r/980469 (owner: 10Bking) [21:27:03] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:980028|Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (T351339)]] (duration: 13m 44s) [21:27:07] T351339: Deploy client preferences to production beta features - https://phabricator.wikimedia.org/T351339 [21:27:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T348183)', diff saved to https://phabricator.wikimedia.org/P54208 and previous config saved to /var/cache/conftool/dbconfig/20231205-212707-arnaudb.json [21:27:10] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [21:27:11] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [21:27:13] Finally! [21:27:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [21:27:42] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:979704|[Zebra] Make .vector-column-start cache compatible (T347712 T351830)]], [[gerrit:980467|Fix nonzebra sticky container scrolling behavior and scrollable indicator (T352464)]] [21:27:49] T347712: [Zebra] remove feature flag & merge Zebra into default styles - https://phabricator.wikimedia.org/T347712 [21:27:50] T351830: [Zebra] Make vector-column-start element cache compatible - https://phabricator.wikimedia.org/T351830 [21:27:50] T352464: Non zebra scrollable indicators on sticky pinnable elements (toc, page tools, client prefs) are broken - https://phabricator.wikimedia.org/T352464 [21:27:58] kimberly_sarabia: Once this is done, can I do the two stream config changes together? [21:28:15] (03PS2) 10Bking: Revert "wdqs: Monitor LDF endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/980468 [21:28:39] (03CR) 10Bking: [V: 03+2 C: 03+2] Revert "wdqs: Monitor LDF endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/980468 (owner: 10Bking) [21:28:59] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:29:12] James_F: Thanks for checking. We will need to rebase the second one [21:29:17] * James_F nods. [21:29:24] (03PS5) 10Jforrester: Define the corresponding stream for scroll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [21:29:28] (03PS6) 10Jforrester: Add stream config for *webuiactions via Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming) [21:29:44] Stacked them so now they should be good to co-deploy. [21:29:59] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on wdqs1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:30:00] !log jforrester@deploy2002 jdlrobson and jforrester and jdrewniak: Backport for [[gerrit:979704|[Zebra] Make .vector-column-start cache compatible (T347712 T351830)]], [[gerrit:980467|Fix nonzebra sticky container scrolling behavior and scrollable indicator (T352464)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:30:03] kimberly_sarabia: OK, Vector back-port ready to test. Please check! [21:30:16] Cool sounds good. [21:30:18] Will do [21:32:20] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/979169/833/" [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn) [21:33:59] (PuppetZeroResources) firing: (4) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:34:11] James_F: the Vector patches look good to sync 👍 [21:34:15] Ace. [21:34:17] !log jforrester@deploy2002 jdlrobson and jforrester and jdrewniak: Continuing with sync [21:34:55] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [21:34:59] (PuppetZeroResources) firing: (3) Puppet has failed generate resources on wdqs1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:35:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance [21:35:26] (03CR) 10Jdlrobson: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang) [21:35:58] (03CR) 10Jforrester: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang) [21:36:56] !log bking@prometheus1006 re-enable puppet T347355 [21:36:57] T347355: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355 [21:38:59] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:39:59] (PuppetZeroResources) resolved: (3) Puppet has failed generate resources on wdqs1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:40:32] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:979704|[Zebra] Make .vector-column-start cache compatible (T347712 T351830)]], [[gerrit:980467|Fix nonzebra sticky container scrolling behavior and scrollable indicator (T352464)]] (duration: 12m 50s) [21:40:38] OK! Finally. Now for the two event stream patches. [21:40:38] T347712: [Zebra] remove feature flag & merge Zebra into default styles - https://phabricator.wikimedia.org/T347712 [21:40:38] T351830: [Zebra] Make vector-column-start element cache compatible - https://phabricator.wikimedia.org/T351830 [21:40:39] T352464: Non zebra scrollable indicators on sticky pinnable elements (toc, page tools, client prefs) are broken - https://phabricator.wikimedia.org/T352464 [21:41:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [21:41:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming) [21:41:52] (03Merged) 10jenkins-bot: Define the corresponding stream for scroll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [21:41:55] (03Merged) 10jenkins-bot: Add stream config for *webuiactions via Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming) [21:42:12] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:977785|Define the corresponding stream for scroll (T350883)]], [[gerrit:978947|Add stream config for *webuiactions via Metrics Platform (T351298)]] [21:42:17] T350883: Port WebUIScroll schema to the new metrics platform - https://phabricator.wikimedia.org/T350883 [21:42:17] T351298: [User Story] Partial migration of *UIActions instrument to the Core Interaction API - https://phabricator.wikimedia.org/T351298 [21:43:34] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [21:43:48] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance [21:43:49] !log jforrester@deploy2002 ksarabia and jforrester and cjming: Backport for [[gerrit:977785|Define the corresponding stream for scroll (T350883)]], [[gerrit:978947|Add stream config for *webuiactions via Metrics Platform (T351298)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:43:59] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:44:42] RECOVERY - cassandra-b CQL 10.192.16.238:9042 on restbase2028 is OK: TCP OK - 0.033 second response time on 10.192.16.238 port 9042 https://phabricator.wikimedia.org/T93886 [21:47:47] kimberly_sarabia: Please test and confirm. [21:48:24] James_F: Thanks, one moment [21:48:31] Of course. :-) [21:48:59] (PuppetZeroResources) firing: (5) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:50:40] RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [21:51:15] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [21:51:29] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance [21:51:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T348183)', diff saved to https://phabricator.wikimedia.org/P54209 and previous config saved to /var/cache/conftool/dbconfig/20231205-215135-arnaudb.json [21:51:44] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [21:53:48] James_F: LGTM [21:53:51] Cool. [21:53:53] !log jforrester@deploy2002 ksarabia and jforrester and cjming: Continuing with sync [21:53:59] (PuppetZeroResources) resolved: (3) Puppet has failed generate resources on wdqs1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [21:57:06] RECOVERY - cassandra-c service on restbase2028 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:57:32] RECOVERY - cassandra-c SSL 10.192.16.239:7000 on restbase2028 is OK: SSL OK - Certificate restbase2028-c valid until 2025-12-03 21:33:03 +0000 (expires in 728 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:01:14] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:977785|Define the corresponding stream for scroll (T350883)]], [[gerrit:978947|Add stream config for *webuiactions via Metrics Platform (T351298)]] (duration: 19m 01s) [22:01:27] T350883: Port WebUIScroll schema to the new metrics platform - https://phabricator.wikimedia.org/T350883 [22:01:27] T351298: [User Story] Partial migration of *UIActions instrument to the Core Interaction API - https://phabricator.wikimedia.org/T351298 [22:02:13] OK, deployment window done, finally. [22:02:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T348183)', diff saved to https://phabricator.wikimedia.org/P54210 and previous config saved to /var/cache/conftool/dbconfig/20231205-220256-arnaudb.json [22:03:01] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:03:09] (03PS1) 10Ryan Kemper: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) [22:03:43] (03CR) 10CI reject: [V: 04-1] wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [22:03:46] (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [22:04:04] (ProbeDown) resolved: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:04:39] James_F: Thank you! [22:04:50] (03PS2) 10Ryan Kemper: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) [22:05:06] Of course. :-) [22:14:26] (03PS3) 10Ryan Kemper: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) [22:15:06] (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/834/console" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper) [22:18:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P54211 and previous config saved to /var/cache/conftool/dbconfig/20231205-221803-arnaudb.json [22:19:53] (03PS3) 10JHathaway: apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) [22:21:56] (03CR) 10JHathaway: apt_repo: move hiera data into module, to allow for validation (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [22:22:33] (03CR) 10CI reject: [V: 04-1] apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway) [22:26:06] (03PS4) 10JHathaway: apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) [22:33:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P54212 and previous config saved to /var/cache/conftool/dbconfig/20231205-223309-arnaudb.json [22:34:06] (03PS1) 10Bking: trafficserver: revert to using hostname for wdqs ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980503 (https://phabricator.wikimedia.org/T352111) [22:34:58] (03PS2) 10RhinosF1: test [puppet] - 10https://gerrit.wikimedia.org/r/980470 [22:35:00] (03CR) 10RhinosF1: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980470 (owner: 10RhinosF1) [22:41:16] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980503 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [22:41:47] (03PS2) 10Bking: trafficserver: revert to using hostname for wdqs ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980503 (https://phabricator.wikimedia.org/T352111) [22:43:09] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980503 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking) [22:48:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T348183)', diff saved to https://phabricator.wikimedia.org/P54213 and previous config saved to /var/cache/conftool/dbconfig/20231205-224816-arnaudb.json [22:48:19] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [22:48:21] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:48:32] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance [22:48:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T348183)', diff saved to https://phabricator.wikimedia.org/P54214 and previous config saved to /var/cache/conftool/dbconfig/20231205-224838-arnaudb.json [22:54:05] (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:54:27] (03PS2) 10Ladsgroup: [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) [22:57:02] (03CR) 10CI reject: [V: 04-1] [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [22:58:22] (03CR) 10Ladsgroup: [WIP] Add compare tables periodic job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [22:59:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T348183)', diff saved to https://phabricator.wikimedia.org/P54215 and previous config saved to /var/cache/conftool/dbconfig/20231205-225905-arnaudb.json [22:59:10] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [23:00:22] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:14:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P54216 and previous config saved to /var/cache/conftool/dbconfig/20231205-231412-arnaudb.json [23:29:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P54217 and previous config saved to /var/cache/conftool/dbconfig/20231205-232918-arnaudb.json [23:42:43] Hey James_F are you still around? It seems like there is a problem with the last deployment. [23:44:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T348183)', diff saved to https://phabricator.wikimedia.org/P54218 and previous config saved to /var/cache/conftool/dbconfig/20231205-234425-arnaudb.json [23:44:28] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [23:44:30] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [23:44:34] Specifically https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/980028 doesn't seem to have had any effect on French Wikipedia [23:44:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [23:44:49] * James_F looks. [23:45:39] I can't replicate this locally. I'd be curious what the value of $wgVectorClientPreferences['beta'] is in the context of French Wikipedia [23:45:49] $wgVectorClientPreferences is set correctly in reality (per `mwscript --wiki=frwiki`). [23:46:22] Hmm.. I wonder if there's any clues in the logs [23:46:39] Whereas on dewiki it's set to false. [23:48:24] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:29] No log entries about VectorClientPreferences (or the indirection of CONFIG_KEY_CLIENT_PREFERENCES) that I can see. [23:49:02] What is the value of $wgVectorZebraDesign for those wikis (just curious)? [23:49:28] dewiki is false/false/false; frwiki is true/true/false [23:49:51] (So both beta keys false.) [23:50:50] hmm.. It's almost as if Vector's GetBetaFeaturePreferences hook is not running for some reason [23:50:59] Oh wait. [23:51:02] This is a Beta Feature? [23:51:09] Did you go through the new Beta Feature process? [23:51:34] (I know you didn't speak to me about this, but maybe you spoke to Greg?) [23:51:51] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [23:52:00] New Beta Features have to be approved and added to the allowlist (to avoid over-loading our users with too many Beta Features at once). [23:52:07] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [23:52:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T348183)', diff saved to https://phabricator.wikimedia.org/P54219 and previous config saved to /var/cache/conftool/dbconfig/20231205-235213-arnaudb.json [23:52:17] ahhh okay that would explain it. [23:52:17] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [23:53:33] No I don't believe we went through a process (I vaguely recall there is one now you point it out to me, but I don't think I've built a beta feature since 2014 :)). [23:53:50] With that in mind would it make sense to revert the patch or is it safe in the current state? [23:54:11] FFS. [23:54:18] VECTOR_2022_BETA_KEY indirection is very unhelpful. [23:54:31] It's entirely safe, it's just non-operational. [23:54:48] Okay. Is this the specific process you are talking about: https://www.mediawiki.org/wiki/Beta_Features#Creating_your_own [23:55:38] Yup. [23:57:10] Also your landing page is just https://www.mediawiki.org/wiki/Skin:Vector/2022 rather than specifically about this Beta Feature and why people should think this particular feature is worth opting into, and the talk page is just https://www.mediawiki.org/wiki/Talk:Reading/Web/Desktop_Improvements which isn't ideal either (but as long as it's monitored it's fine). [23:57:45] The preference is 'vector-2022-beta-feature' which suggests it's very general, but this is the text accessibility work, right? [23:58:42] Oooh, you're trying to register multiple different features under one Beta Feature? Tut, that's going to confuse/upset a bunch of people.