[00:00:34] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:03:46] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:05:18] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:06:00] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[00:06:31] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host restbase2028.codfw.wmnet
[00:06:34] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:12:25] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group for ArthurTaylor - https://phabricator.wikimedia.org/T352653 (10KFrancis) Hi @ArthurTaylor, please send your email address to kfrancis@wikimedia.org and I will put the NDA together and send to you for signing.  Thanks!
[00:13:55] <wikibugs>	 (03PS2) 10Kimberly Sarabia: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337)
[00:23:07] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "LGTM to deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[00:38:24] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979961
[00:38:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979961 (owner: 10TrainBranchBot)
[00:56:20] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:56:46] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979961 (owner: 10TrainBranchBot)
[01:03:02] <icinga-wm>	 RECOVERY - Restbase root url on restbase2028 is OK: HTTP OK: HTTP/1.1 200 - 17816 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/RESTBase
[01:03:49] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T352731 (10phaultfinder)
[01:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[01:15:47] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "has been added to google doc by KFrancis" [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520) (owner: 10Dzahn)
[01:16:22] <icinga-wm>	 RECOVERY - cassandra-a SSL 10.192.16.237:7000 on restbase2028 is OK: SSL OK - Certificate restbase2028-a valid until 2025-12-03 21:32:59 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[01:17:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on restbase2028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[01:18:49] <mutante>	 !log LDAP - added user xqt to group nda (T348520)
[01:18:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:18:53] <stashbot>	 T348520: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520
[01:20:42] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) 05Open→03Resolved >>! In T348520#9381464, @KFrancis wrote: > Done, thanks!  Thank you as well!  Also done on our side.  @Xqt You have been a...
[01:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:39:04] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:58:12] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[02:59:40] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T0300)
[03:00:20] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:09:04] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:52:58] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:53:32] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:53:48] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:57:24] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[03:57:58] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:58:16] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T0400)
[04:01:56] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:02:28] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:02:44] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:07:12] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[04:07:48] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[04:08:24] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:37:58] <wikibugs>	 (03PS2) 10Andrew Bogott: Horizon: allow image uploading via horizon for users with glance admin [puppet] - 10https://gerrit.wikimedia.org/r/980021 (https://phabricator.wikimedia.org/T326818)
[04:38:00] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-init: make puppet optional [puppet] - 10https://gerrit.wikimedia.org/r/980079 (https://phabricator.wikimedia.org/T326818)
[04:47:34] <wikibugs>	 (03PS2) 10Andrew Bogott: cloud-init: make puppet optional [puppet] - 10https://gerrit.wikimedia.org/r/980079 (https://phabricator.wikimedia.org/T326818)
[06:03:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10Marostegui) Thank you! Icinga looks good now
[06:16:38] <wikibugs>	 (03PS1) 10Marostegui: Revert "mariadb: Promote db1119 to m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/979695
[06:17:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352631
[06:17:16] <stashbot>	 T352631: Switchover m5 master db1119 -> db1176 - https://phabricator.wikimedia.org/T352631
[06:17:22] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352631
[06:20:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Promote db1119 to m5 master" [puppet] - 10https://gerrit.wikimedia.org/r/979695 (owner: 10Marostegui)
[06:23:50] <marostegui>	 !log Failover m5 from db1119 to db1176 - T352631
[06:23:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:23:54] <stashbot>	 T352631: Switchover m5 master db1119 -> db1176 - https://phabricator.wikimedia.org/T352631
[06:27:28] <wikibugs>	 (03PS1) 10Marostegui: db1119: To be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/980082
[06:28:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1119: To be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/980082 (owner: 10Marostegui)
[06:35:54] <wikibugs>	 (03PS1) 10Marostegui: db1176: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/980083
[06:36:27] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1176: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/980083 (owner: 10Marostegui)
[06:47:11] <wikibugs>	 (03PS1) 10Marostegui: site.pp: db1119 will be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/980084 (https://phabricator.wikimedia.org/T337206)
[06:47:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: db1119 will be decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/980084 (https://phabricator.wikimedia.org/T337206) (owner: 10Marostegui)
[06:51:58] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable IPIP on eqsin text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/979994 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[06:54:04] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:55:11] <vgutierrez>	 !log rolling restart of text|secondary LVS on eqsin effectively enabling IPIP encapsulation for ncredir@eqsin - T351069
[06:55:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:15] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[06:55:43] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/980087 (https://phabricator.wikimedia.org/T351864)
[06:58:20] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez)
[06:59:05] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T0700)
[07:00:04] <jouncebot>	 kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T0700). Please do the needful.
[07:00:20] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:11:00] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:11:58] <icinga-wm>	 PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active - HE, AS6939/IPv4: Active - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:12:28] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 211, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:12:30] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[07:13:28] <icinga-wm>	 PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 185.15.59.128, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:20:58] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 400 probes of 731 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:22:46] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:25:38] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Disable rp_filter for ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980272 (https://phabricator.wikimedia.org/T351069)
[07:25:40] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980273 (https://phabricator.wikimedia.org/T351069)
[07:25:42] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP on text|secondary LVS in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069)
[07:27:16] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/817/con" [puppet] - 10https://gerrit.wikimedia.org/r/980272 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[07:29:25] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/818/con" [puppet] - 10https://gerrit.wikimedia.org/r/980273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[07:29:40] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Orchestrator: Add database host removal from Orchestrator to sre.hosts.decommission cookbook - https://phabricator.wikimedia.org/T287954 (10Marostegui) 05Open→03Declined I think we can probably decline this. Orchestrator removes the host itself after 14 days,...
[07:30:10] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:30:57] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[07:31:38] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:36:04] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:37:20] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 46 probes of 731 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:40:32] <icinga-wm>	 PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:43:28] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:49:22] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:52:50] <wikibugs>	 (03PS4) 10Awight: [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (https://phabricator.wikimedia.org/T347453)
[07:56:42] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:00:05] <jouncebot>	 Amir1 and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T0800).
[08:00:05] <jouncebot>	 bwang and awight: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:01:10] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:05:52] <icinga-wm>	 PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:10:04] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:10:48] <wikibugs>	 (03CR) 10Muehlenhoff: "Ready for review. Not sure why it claims that PCC failed, the actual report is all fine." [puppet] - 10https://gerrit.wikimedia.org/r/979897 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[08:11:42] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:17:07] <wikibugs>	 (03CR) 10Muehlenhoff: firewall::service: spelling fixes, add missing parameter comments (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn)
[08:22:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by awight@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (https://phabricator.wikimedia.org/T347453) (owner: 10Awight)
[08:23:28] <wikibugs>	 (03Merged) 10jenkins-bot: [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (https://phabricator.wikimedia.org/T347453) (owner: 10Awight)
[08:24:15] <wikibugs>	 10SRE, 10SRE-Unowned: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492 (10fgiunchedi)
[08:24:27] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Set nofail for raid0 recipes - https://phabricator.wikimedia.org/T350461 (10fgiunchedi)
[08:25:34] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] "matthew@tsk:~/puppet$ md5sum hieradata/hosts/dbproxy102{3,5}.yaml" [dns] - 10https://gerrit.wikimedia.org/r/980087 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui)
[08:25:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/980087 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui)
[08:26:12] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[08:26:29] <marostegui>	 !log Failover m2-master dbproxy1023.eqiad.wmnet -> dbproxy1025.eqiad.wmnet T351864
[08:26:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:33] <stashbot>	 T351864: Migrate dbproxy hosts to Bookworm - https://phabricator.wikimedia.org/T351864
[08:30:08] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:30:09] <wikibugs>	 (03CR) 10Muehlenhoff: "This is because these were installed with insetup::serviceops and that role already defaults to Puppet 7. @Eric Ping me when you are aroun" [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans)
[08:31:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:32:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Fix Hadoop Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/980342 (https://phabricator.wikimedia.org/T352193)
[08:36:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Fix Hadoop Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/980342 (https://phabricator.wikimedia.org/T352193) (owner: 10Muehlenhoff)
[08:43:02] <wikibugs>	 (03CR) 10Muehlenhoff: restbase: migrate restbase2028 to puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans)
[08:52:09] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/820/console" [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008) (owner: 10David Caro)
[08:53:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:56:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:57:14] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:57:59] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix rest gateway endpoint creation in article descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos)
[08:58:52] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: fix rest gateway endpoint creation in article descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos)
[08:59:39] <logmsgbot>	 !log isaranto@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:01:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:03:57] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[09:03:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[09:04:23] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1106.eqiad.wmnet with reason: Maintenance
[09:05:48] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 58952
[09:06:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:06:44] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 58952
[09:10:31] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Look good, much simpler." [puppet] - 10https://gerrit.wikimedia.org/r/979897 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[09:12:11] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[09:12:26] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1128.eqiad.wmnet with reason: Maintenance
[09:12:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1128 (T348183)', diff saved to https://phabricator.wikimedia.org/P54147 and previous config saved to /var/cache/conftool/dbconfig/20231205-091232-arnaudb.json
[09:12:36] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[09:12:49] <wikibugs>	 (03PS5) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639)
[09:16:41] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM DNS and Netbox wise. I'll leave to the service owners to review the procedure to setup the service and naming :)" [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[09:19:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[09:19:38] <wikibugs>	 (03CR) 10Phuedx: Add stream config for *webuiactions via Metrics Platform (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming)
[09:20:24] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:20:54] <wikibugs>	 (03CR) 10Phuedx: [C: 03+1] Define the corresponding stream for scroll (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia)
[09:20:56] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] Use zap for structured logs on tile pregeneration [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/978030 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos)
[09:21:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:22:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T348183)', diff saved to https://phabricator.wikimedia.org/P54148 and previous config saved to /var/cache/conftool/dbconfig/20231205-092202-arnaudb.json
[09:22:06] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[09:22:15] <wikibugs>	 (03Merged) 10jenkins-bot: Use zap for structured logs on tile pregeneration [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/978030 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos)
[09:26:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:28:22] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:30:46] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:31:26] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[09:32:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:34:01] <wikibugs>	 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) Another datapoint for the mw*/parse* clusters, they will be migrated to be k8s hosts, that are su...
[09:35:42] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:37:09] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P54149 and previous config saved to /var/cache/conftool/dbconfig/20231205-093709-arnaudb.json
[09:37:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:37:31] <brouberol>	 !log running authdns-update on dns1004.wikimedia.org - T352639
[09:37:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:34] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[09:42:23] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bookworm
[09:42:30] <wikibugs>	 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1003.eqiad.wmnet with OS bookworm
[09:43:10] <icinga-wm>	 RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:47:52] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: spicerack.dnsdisc.Discovery should expose TTL - https://phabricator.wikimedia.org/T259875 (10JMeybohm) I don't exactly recall thb. but I would imagine I wanted something like this in one of the pool/depool/service-route cookbooks to store the TTL, lower it, change wha...
[09:48:05] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1 C: 03+2] Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[09:48:44] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: spicerack.dnsdisc.Discovery should expose TTL - https://phabricator.wikimedia.org/T259875 (10Volans) Ack, thanks for the info
[09:51:02] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 63927
[09:52:16] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128', diff saved to https://phabricator.wikimedia.org/P54150 and previous config saved to /var/cache/conftool/dbconfig/20231205-095215-arnaudb.json
[09:54:30] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 63927
[09:55:14] <icinga-wm>	 RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:57:20] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage
[10:02:07] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15305
[10:02:24] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage
[10:05:21] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15305
[10:05:49] <jinxer-wm>	 (ConfdResourceFailed) firing: confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[10:07:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1128 (T348183)', diff saved to https://phabricator.wikimedia.org/P54151 and previous config saved to /var/cache/conftool/dbconfig/20231205-100722-arnaudb.json
[10:07:24] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[10:07:27] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[10:07:34] <icinga-wm>	 RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:07:38] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1132.eqiad.wmnet with reason: Maintenance
[10:07:45] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1132 (T348183)', diff saved to https://phabricator.wikimedia.org/P54152 and previous config saved to /var/cache/conftool/dbconfig/20231205-100744-arnaudb.json
[10:08:30] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980345 (https://phabricator.wikimedia.org/T352181)
[10:10:46] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[10:12:12] <icinga-wm>	 RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 199, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:12:57] <wikibugs>	 10sre-alert-triage, 10Infrastructure-Foundations, 10netops: Alert in need of triage: BGP status (instance cr2-eqdfw) - https://phabricator.wikimedia.org/T351083 (10ayounsi) 05Open→03Resolved Deleted.
[10:15:26] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ml-serve1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[10:15:47] <jinxer-wm>	 (ConfdResourceFailed) firing: (3) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[10:17:17] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: remove mlstaging ingress settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/979907 (owner: 10Elukey)
[10:19:06] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T348183)', diff saved to https://phabricator.wikimedia.org/P54153 and previous config saved to /var/cache/conftool/dbconfig/20231205-101906-arnaudb.json
[10:19:10] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[10:20:09] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1003.eqiad.wmnet with OS bookworm
[10:20:18] <wikibugs>	 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1003.eqiad.wmnet with OS bookworm completed: - moss-be1003 (**PASS**)   - Downtimed on Icinga/Alertma...
[10:20:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008) (owner: 10David Caro)
[10:20:47] <jinxer-wm>	 (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[10:21:00] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1002.eqiad.wmnet with OS bookworm
[10:21:06] <wikibugs>	 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm
[10:22:19] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980345 (https://phabricator.wikimedia.org/T352181) (owner: 10AikoChou)
[10:28:02] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "Looks good compared to I396792f6676415844ea29f3c3f656e8d2a77df1e" [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:28:31] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "Looks good compared to I2aecbdfb71e1c51a61058d7eed66145899945600" [puppet] - 10https://gerrit.wikimedia.org/r/980273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:28:53] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "Looks good compared to I5fb3fa081326475c355ecf251c712ae477bc2da1" [puppet] - 10https://gerrit.wikimedia.org/r/980272 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:31:24] <wikibugs>	 (03PS1) 10Brouberol: Include the profile::lvs::realserver profile on the dse-k8s-roles [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639)
[10:32:40] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[10:33:53] <wikibugs>	 (03PS23) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[10:34:13] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P54154 and previous config saved to /var/cache/conftool/dbconfig/20231205-103413-arnaudb.json
[10:35:04] <wikibugs>	 (03CR) 10AikoChou: [C: 03+2] ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980345 (https://phabricator.wikimedia.org/T352181) (owner: 10AikoChou)
[10:35:59] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980345 (https://phabricator.wikimedia.org/T352181) (owner: 10AikoChou)
[10:36:35] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ml-services: update revert-risk images [deployment-charts] - 10https://gerrit.wikimedia.org/r/980345 (https://phabricator.wikimedia.org/T352181) (owner: 10AikoChou)
[10:45:17] <logmsgbot>	 !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[10:45:51] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[10:49:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132', diff saved to https://phabricator.wikimedia.org/P54155 and previous config saved to /var/cache/conftool/dbconfig/20231205-104919-arnaudb.json
[10:53:04] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/980350 (https://phabricator.wikimedia.org/T351864)
[10:54:01] <wikibugs>	 (03PS1) 10JMeybohm: CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351
[10:54:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1023: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/980350 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui)
[10:54:51] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1023.eqiad.wmnet with OS bookworm
[10:55:05] <wikibugs>	 (03PS1) 10Effie Mouzeli: scaffold: fix annoying tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/980352
[10:55:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/980011 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[10:56:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] jobqueue: migrate a moderately weighty job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979396 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[10:59:28] <wikibugs>	 (03PS3) 10Slyngshede: C:netbox switch Netbox-Next to use plain OIDC [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950)
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1100)
[11:00:06] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:00:21] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:01:32] <wikibugs>	 (03CR) 10Slyngshede: C:netbox switch Netbox-Next to use plain OIDC (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) (owner: 10Slyngshede)
[11:02:20] <wikibugs>	 (03PS2) 10Effie Mouzeli: scaffold: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/980352
[11:02:33] <wikibugs>	 (03CR) 10Hnowlan: CI: Update validate_envoy_config to use entrypoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm)
[11:02:48] <wikibugs>	 (03PS3) 10Effie Mouzeli: scaffold: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/980352
[11:02:52] <logmsgbot>	 !log mvernon@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host moss-be1002.eqiad.wmnet with OS bookworm
[11:03:04] <wikibugs>	 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm executed with errors: - moss-be1002 (**FAIL**)   - Downtimed on Ici...
[11:03:18] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/821/con" [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) (owner: 10Slyngshede)
[11:03:26] <wikibugs>	 (03PS2) 10Brouberol: Include the profile::lvs::realserver profile on the dse-k8s-roles [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639)
[11:03:58] <wikibugs>	 (03PS1) 10Brouberol: Remove the inference realserver pool from the dse cluster [puppet] - 10https://gerrit.wikimedia.org/r/980353 (https://phabricator.wikimedia.org/T352639)
[11:04:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1132 (T348183)', diff saved to https://phabricator.wikimedia.org/P54156 and previous config saved to /var/cache/conftool/dbconfig/20231205-110426-arnaudb.json
[11:04:28] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[11:04:30] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[11:04:30] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/980011 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:04:42] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1134.eqiad.wmnet with reason: Maintenance
[11:04:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1134 (T348183)', diff saved to https://phabricator.wikimedia.org/P54157 and previous config saved to /var/cache/conftool/dbconfig/20231205-110448-arnaudb.json
[11:05:28] <wikibugs>	 (03PS2) 10Brouberol: Remove the inference realserver pool from the dse cluster [puppet] - 10https://gerrit.wikimedia.org/r/980353 (https://phabricator.wikimedia.org/T352639)
[11:05:31] <wikibugs>	 (03Merged) 10jenkins-bot: mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/980011 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:06:21] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 224, down: 2, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:06:31] <wikibugs>	 (03PS2) 10Hnowlan: jobqueue: migrate a moderately weighty job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979396 (https://phabricator.wikimedia.org/T349796)
[11:06:58] <wikibugs>	 (03PS3) 10Brouberol: Include the profile::lvs::realserver profile on the dse-k8s-roles [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639)
[11:07:06] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/980353 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[11:07:21] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP on text|secondary LVS in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[11:07:35] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] hiera: Enable IPIP on text|secondary LVS in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[11:07:37] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync
[11:07:41] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Disable rp_filter for ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980272 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[11:07:43] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Remove the inference realserver pool from the dse cluster [puppet] - 10https://gerrit.wikimedia.org/r/980353 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[11:07:50] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync
[11:08:00] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync
[11:08:10] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync
[11:08:54] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1023.eqiad.wmnet with reason: host reimage
[11:08:56] <vgutierrez>	 brouberol: ok to merge Brouberol: Remove the inference realserver pool from the dse cluster (84d1f24443) :?
[11:09:14] <wikibugs>	 (03CR) 10JMeybohm: CI: Update validate_envoy_config to use entrypoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm)
[11:10:02] <brouberol>	 yes vgutierrez
[11:10:41] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] jobqueue: migrate a moderately weighty job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979396 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:10:58] <vgutierrez>	 brouberol: done
[11:11:28] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: migrate a moderately weighty job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979396 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[11:11:36] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "one comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn)
[11:11:43] <wikibugs>	 (03PS2) 10JMeybohm: CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351
[11:12:10] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1023.eqiad.wmnet with reason: host reimage
[11:12:34] <wikibugs>	 (03PS1) 10Peter Fischer: Bump version for cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980354
[11:14:49] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm)
[11:15:31] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[11:15:48] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[11:16:13] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[11:16:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T348183)', diff saved to https://phabricator.wikimedia.org/P54158 and previous config saved to /var/cache/conftool/dbconfig/20231205-111625-arnaudb.json
[11:16:29] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[11:16:32] <wikibugs>	 (03PS1) 10Urbanecm: User impact: update quantizeViews to process small series of view data [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979698 (https://phabricator.wikimedia.org/T352349)
[11:16:37] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[11:16:38] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[11:17:03] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[11:17:34] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] Bump version for cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980354 (owner: 10Peter Fischer)
[11:20:26] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:20:58] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] Bump version for cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980354 (owner: 10Peter Fischer)
[11:21:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm)
[11:21:47] <wikibugs>	 (03Merged) 10jenkins-bot: Bump version for cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980354 (owner: 10Peter Fischer)
[11:23:15] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] codfw1dev: add smart_hosts [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008) (owner: 10David Caro)
[11:24:07] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mw-api-int: increase replicas by 30% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980032 (owner: 10Kamila Součková)
[11:24:45] <wikibugs>	 (03PS1) 10Urbanecm: Growth: Enable Welcome survey user research for ar/en/es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980357 (https://phabricator.wikimedia.org/T351266)
[11:27:56] <wikibugs>	 (03PS3) 10EoghanGaffney: [admin] Add user account for xiaoxiao to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979389 (https://phabricator.wikimedia.org/T352098)
[11:27:58] <wikibugs>	 (03PS4) 10EoghanGaffney: [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918)
[11:28:00] <wikibugs>	 (03PS1) 10EoghanGaffney: [admin] Add ehughes ldap account [puppet] - 10https://gerrit.wikimedia.org/r/980358 (https://phabricator.wikimedia.org/T351387)
[11:28:09] <Amir1>	 jouncebot: nowandnext
[11:28:09] <jouncebot>	 For the next 0 hour(s) and 31 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1100)
[11:28:09] <jouncebot>	 In 1 hour(s) and 31 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1300)
[11:28:39] <wikibugs>	 (03PS3) 10JMeybohm: CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351
[11:29:48] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Bump ParserCache TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup)
[11:30:01] <wikibugs>	 (03PS1) 10Gmodena: mw-page-content-enrich: version bump. [deployment-charts] - 10https://gerrit.wikimedia.org/r/980359 (https://phabricator.wikimedia.org/T345806)
[11:30:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup)
[11:30:26] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1023.eqiad.wmnet with OS bookworm
[11:30:31] <wikibugs>	 (03Merged) 10jenkins-bot: Bump ParserCache TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup)
[11:30:43] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:979920|Bump ParserCache TTL back to 30 days (T280604)]]
[11:30:47] <stashbot>	 T280604: Post-deployment: (partly) ramp parser cache retention back up  - https://phabricator.wikimedia.org/T280604
[11:31:23] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/980040 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková)
[11:31:32] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P54159 and previous config saved to /var/cache/conftool/dbconfig/20231205-113132-arnaudb.json
[11:32:01] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:979920|Bump ParserCache TTL back to 30 days (T280604)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:32:35] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[11:32:44] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[11:33:24] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[11:36:32] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/980039 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková)
[11:37:07] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979699
[11:37:24] <wikibugs>	 (03Abandoned) 10Elukey: ml-services: remove mlstaging ingress settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/979907 (owner: 10Elukey)
[11:38:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1023: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979699 (owner: 10Marostegui)
[11:38:17] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] mw-api-int: increase replicas by 30% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980032 (owner: 10Kamila Součková)
[11:38:30] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:979920|Bump ParserCache TTL back to 30 days (T280604)]] (duration: 07m 47s)
[11:38:34] <stashbot>	 T280604: Post-deployment: (partly) ramp parser cache retention back up  - https://phabricator.wikimedia.org/T280604
[11:39:13] <wikibugs>	 (03Merged) 10jenkins-bot: mw-api-int: increase replicas by 30% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980032 (owner: 10Kamila Součková)
[11:40:17] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[11:40:18] <wikibugs>	 (03Abandoned) 10Effie Mouzeli: scaffold: minor fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/980352 (owner: 10Effie Mouzeli)
[11:40:30] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[11:40:41] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[11:40:54] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[11:42:20] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] mobileapps: 45% to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/976221 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[11:42:22] <wikibugs>	 (03PS1) 10Peter Fischer: Add page-rerender-stream config to cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980360
[11:44:36] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the production Swift cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) a:05jcrespo→03None
[11:44:56] <wikibugs>	 (03PS1) 10Jcrespo: Prepare for 0.3.0 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/980362 (https://phabricator.wikimedia.org/T352655)
[11:45:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655) (owner: 10Jcrespo)
[11:45:45] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Prepare for 0.3.0 release [software/mediabackups] - 10https://gerrit.wikimedia.org/r/980362 (https://phabricator.wikimedia.org/T352655) (owner: 10Jcrespo)
[11:46:39] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134', diff saved to https://phabricator.wikimedia.org/P54160 and previous config saved to /var/cache/conftool/dbconfig/20231205-114638-arnaudb.json
[11:48:44] <wikibugs>	 (03CR) 10Clément Goubert: mcrouter: add chart (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:50:14] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:netbox switch Netbox-Next to use plain OIDC [puppet] - 10https://gerrit.wikimedia.org/r/978479 (https://phabricator.wikimedia.org/T351950) (owner: 10Slyngshede)
[11:50:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4049.ulsfo.wmnet
[11:51:09] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/980039 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková)
[11:51:19] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/980040 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková)
[11:51:36] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply
[11:51:39] <wikibugs>	 (03CR) 10DCausse: [C: 03+2] Add page-rerender-stream config to cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980360 (owner: 10Peter Fischer)
[11:51:55] <logmsgbot>	 !log kamila@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply
[11:51:58] <wikibugs>	 (03Merged) 10jenkins-bot: Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/980040 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková)
[11:52:15] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[11:52:24] <wikibugs>	 (03Merged) 10jenkins-bot: Add page-rerender-stream config to cirrus-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/980360 (owner: 10Peter Fischer)
[11:52:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cp4049 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980364 (https://phabricator.wikimedia.org/T349619)
[11:53:22] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[11:54:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4049 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980364 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:01:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4049.ulsfo.wmnet
[12:01:45] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1134 (T348183)', diff saved to https://phabricator.wikimedia.org/P54161 and previous config saved to /var/cache/conftool/dbconfig/20231205-120145-arnaudb.json
[12:01:47] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[12:01:50] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[12:02:00] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1135.eqiad.wmnet with reason: Maintenance
[12:02:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1135 (T348183)', diff saved to https://phabricator.wikimedia.org/P54162 and previous config saved to /var/cache/conftool/dbconfig/20231205-120206-arnaudb.json
[12:02:17] <wikibugs>	 (03PS8) 10Vgutierrez: traffic: Alert on configured and observed MSS mismatch [alerts] - 10https://gerrit.wikimedia.org/r/980280 (https://phabricator.wikimedia.org/T351069)
[12:02:45] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: purged issues while kafka brokers are restarted - https://phabricator.wikimedia.org/T334078 (10CodeReviewBot) fabfur merged https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/1  Basic retry mechanism for specific kafka errors
[12:04:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4039.ulsfo.wmnet
[12:05:32] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes2060 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:06:49] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw2423.codfw.wmnet with OS bullseye
[12:07:00] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:07:04] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[12:07:15] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[12:09:11] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cp4039 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980365 (https://phabricator.wikimedia.org/T349619)
[12:10:32] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw2424.codfw.wmnet with OS bullseye
[12:10:44] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney)
[12:11:05] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney) p:05Triage→03Medium
[12:11:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T348183)', diff saved to https://phabricator.wikimedia.org/P54163 and previous config saved to /var/cache/conftool/dbconfig/20231205-121121-arnaudb.json
[12:11:25] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] CI: Update validate_envoy_config to use entrypoint (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm)
[12:11:28] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[12:13:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4039 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980365 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:16:17] <wikibugs>	 (03PS1) 10Kosta Harlan: Add maintenance script to import existing files to scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979700 (https://phabricator.wikimedia.org/T350863)
[12:16:28] <wikibugs>	 (03PS1) 10Kosta Harlan: Only allow drawing and bitmap media types to be scanned [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979701 (https://phabricator.wikimedia.org/T352234)
[12:16:51] <wikibugs>	 (03PS4) 10Brouberol: Include the profile::lvs::realserver profile on the dse-k8s-roles [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639)
[12:17:02] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[12:18:02] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw2434.codfw.wmnet with OS bullseye
[12:18:28] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4039.ulsfo.wmnet
[12:19:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[12:20:39] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Include the profile::lvs::realserver profile on the dse-k8s-roles [puppet] - 10https://gerrit.wikimedia.org/r/980347 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[12:21:14] <wikibugs>	 (03Merged) 10jenkins-bot: CI: Update validate_envoy_config to use entrypoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/980351 (owner: 10JMeybohm)
[12:22:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4051.ulsfo.wmnet
[12:23:23] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cp4051 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980367 (https://phabricator.wikimedia.org/T349619)
[12:23:50] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2423.codfw.wmnet with reason: host reimage
[12:24:38] <moritzm>	 !log installing unbound bugfix updates from Bookworm point release
[12:24:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:24:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[12:25:23] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw2435.codfw.wmnet with OS bullseye
[12:26:19] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[12:26:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4051 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980367 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:26:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P54164 and previous config saved to /var/cache/conftool/dbconfig/20231205-122628-arnaudb.json
[12:26:44] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1039 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:26:45] <wikibugs>	 (03PS1) 10Brouberol: Switch the k8s-ingress-dse LVS service in lvs_setup state [puppet] - 10https://gerrit.wikimedia.org/r/980368 (https://phabricator.wikimedia.org/T352639)
[12:26:51] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2423.codfw.wmnet with reason: host reimage
[12:27:03] <logmsgbot>	 !log kamila@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[12:27:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff)
[12:27:47] <wikibugs>	 (03PS2) 10Brouberol: Switch the k8s-ingress-dse LVS service in lvs_setup state [puppet] - 10https://gerrit.wikimedia.org/r/980368 (https://phabricator.wikimedia.org/T352639)
[12:28:24] <wikibugs>	 (03PS1) 10Hnowlan: mw-jobrunner: bump replicas further [deployment-charts] - 10https://gerrit.wikimedia.org/r/980369 (https://phabricator.wikimedia.org/T349796)
[12:28:33] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2424.codfw.wmnet with reason: host reimage
[12:28:34] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1463.eqiad.wmnet with OS bullseye
[12:29:20] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mw-jobrunner: bump replicas further [deployment-charts] - 10https://gerrit.wikimedia.org/r/980369 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[12:30:15] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/980368 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[12:30:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4051.ulsfo.wmnet
[12:30:56] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:30:56] <icinga-wm>	 RECOVERY - Check systemd state on mw2261 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:31:55] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2424.codfw.wmnet with reason: host reimage
[12:32:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4042.ulsfo.wmnet
[12:32:47] <jinxer-wm>	 (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got worse   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[12:32:59] <wikibugs>	 (03PS1) 10Ladsgroup: Set migration of pagelinks on large wikis of s5 to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980370 (https://phabricator.wikimedia.org/T351237)
[12:33:35] <Amir1>	 jouncebot: nowandnext
[12:33:35] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 26 minute(s)
[12:33:35] <jouncebot>	 In 0 hour(s) and 26 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1300)
[12:33:43] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set migration of pagelinks on large wikis of s5 to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980370 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup)
[12:34:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980370 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup)
[12:34:26] <wikibugs>	 (03Merged) 10jenkins-bot: Set migration of pagelinks on large wikis of s5 to read new [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980370 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup)
[12:34:42] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:980370|Set migration of pagelinks on large wikis of s5 to read new (T351237)]]
[12:34:50] <stashbot>	 T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237
[12:36:15] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2434.codfw.wmnet with reason: host reimage
[12:37:13] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:980370|Set migration of pagelinks on large wikis of s5 to read new (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:37:47] <jinxer-wm>	 (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[12:38:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cp4042 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980371 (https://phabricator.wikimedia.org/T349619)
[12:39:41] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2434.codfw.wmnet with reason: host reimage
[12:40:16] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[12:40:33] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4042 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980371 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:41:09] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1463.eqiad.wmnet with reason: host reimage
[12:41:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135', diff saved to https://phabricator.wikimedia.org/P54165 and previous config saved to /var/cache/conftool/dbconfig/20231205-124134-arnaudb.json
[12:42:34] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2435.codfw.wmnet with reason: host reimage
[12:44:30] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1463.eqiad.wmnet with reason: host reimage
[12:45:02] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[12:45:55] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2423.codfw.wmnet with OS bullseye
[12:45:56] <logmsgbot>	 !log kamila@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[12:47:13] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:980370|Set migration of pagelinks on large wikis of s5 to read new (T351237)]] (duration: 12m 30s)
[12:47:16] <stashbot>	 T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237
[12:47:30] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2435.codfw.wmnet with reason: host reimage
[12:50:17] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1076.eqiad.wmnet with OS bullseye
[12:50:24] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye
[12:50:34] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2424.codfw.wmnet with OS bullseye
[12:52:47] <jinxer-wm>	 (Traffic bill over quota) firing: (4) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[12:53:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4042.ulsfo.wmnet
[12:54:28] <wikibugs>	 (03PS1) 10Btullis: Bring an-coord1004 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/980372 (https://phabricator.wikimedia.org/T336045)
[12:55:29] <wikibugs>	 (03PS2) 10Btullis: Bring an-coord1004 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/980372 (https://phabricator.wikimedia.org/T336045)
[12:56:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1135 (T348183)', diff saved to https://phabricator.wikimedia.org/P54167 and previous config saved to /var/cache/conftool/dbconfig/20231205-125641-arnaudb.json
[12:56:43] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:56:45] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[12:56:58] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[12:57:47] <jinxer-wm>	 (Traffic bill over quota) resolved: (3) Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[12:57:54] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: redis::misc::slave
[12:58:23] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/978665 (owner: 10PipelineBot)
[12:58:31] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2434.codfw.wmnet with OS bullseye
[12:59:18] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.dns.netbox
[12:59:32] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/978665 (owner: 10PipelineBot)
[12:59:37] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/823/con" [puppet] - 10https://gerrit.wikimedia.org/r/980372 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[13:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1300)
[13:00:40] <icinga-wm>	 PROBLEM - Host sretest2003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:00:52] <wikibugs>	 (03PS1) 10Jbond: installserver: add spec test for role [puppet] - 10https://gerrit.wikimedia.org/r/980375
[13:00:54] <wikibugs>	 (03PS1) 10Jbond: installserver: test spec test fires [puppet] - 10https://gerrit.wikimedia.org/r/980376
[13:02:39] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1463.eqiad.wmnet with OS bullseye
[13:03:37] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[13:04:08] <logmsgbot>	 !log cmooney@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update entry for sretest2003. - cmooney@cumin2002"
[13:04:12] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[13:04:16] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1076.eqiad.wmnet with reason: host reimage
[13:04:34] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[13:04:49] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[13:04:55] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update entry for sretest2003. - cmooney@cumin2002"
[13:04:55] <logmsgbot>	 !log cmooney@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:05:03] <wikibugs>	 (03CR) 10Jbond: "instead if this i think you just need to create a spec file in the modules/role/spec/classes[1] or move the yaml file so that it is in mod" [puppet] - 10https://gerrit.wikimedia.org/r/979119 (owner: 10Brouberol)
[13:05:55] <wikibugs>	 (03PS1) 10EoghanGaffney: [gitlab] Update BroadcastMessage class to new namespace [puppet] - 10https://gerrit.wikimedia.org/r/980377 (https://phabricator.wikimedia.org/T352433)
[13:06:06] <wikibugs>	 (03CR) 10Jbond: "not necessarily against this but just wanted to point out we can do this in rspec (copying below the same comment i added to https://gerri" [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[13:06:07] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1464.eqiad.wmnet with OS bullseye
[13:06:34] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2435.codfw.wmnet with OS bullseye
[13:07:35] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1076.eqiad.wmnet with reason: host reimage
[13:07:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch redis::misc::slave to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980385 (https://phabricator.wikimedia.org/T349619)
[13:07:42] <icinga-wm>	 RECOVERY - Host sretest2003 is UP: PING OK - Packet loss = 0%, RTA = 51.12 ms
[13:08:53] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1465.eqiad.wmnet with OS bullseye
[13:08:58] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:09:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1079.eqiad.wmnet with OS bullseye
[13:09:54] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1079.eqiad.wmnet with OS bullseye
[13:10:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch redis::misc::slave to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980385 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:10:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1078.eqiad.wmnet with OS bullseye
[13:10:46] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye
[13:11:00] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:12:20] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[13:12:34] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance
[13:12:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1163 (T348183)', diff saved to https://phabricator.wikimedia.org/P54168 and previous config saved to /var/cache/conftool/dbconfig/20231205-131240-arnaudb.json
[13:12:56] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[13:14:11] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1470.eqiad.wmnet with OS bullseye
[13:14:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: redis::misc::slave
[13:16:30] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[13:16:40] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[13:18:34] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1464.eqiad.wmnet with reason: host reimage
[13:21:28] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1464.eqiad.wmnet with reason: host reimage
[13:21:34] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1465.eqiad.wmnet with reason: host reimage
[13:22:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T348183)', diff saved to https://phabricator.wikimedia.org/P54169 and previous config saved to /var/cache/conftool/dbconfig/20231205-132200-arnaudb.json
[13:22:15] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[13:23:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1079.eqiad.wmnet with reason: host reimage
[13:24:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1078.eqiad.wmnet with reason: host reimage
[13:24:44] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on ms-be1079.eqiad.wmnet with reason: host reimage
[13:24:59] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1465.eqiad.wmnet with reason: host reimage
[13:26:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[13:26:42] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[13:26:45] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1470.eqiad.wmnet with reason: host reimage
[13:27:07] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[13:27:12] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1076.eqiad.wmnet with OS bullseye
[13:27:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye completed: - ms-be...
[13:27:46] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1078.eqiad.wmnet with reason: host reimage
[13:30:39] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1470.eqiad.wmnet with reason: host reimage
[13:32:34] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney)
[13:34:12] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1079 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:36:50] <icinga-wm>	 PROBLEM - Host ms-be1079 is DOWN: PING CRITICAL - Packet loss = 100%
[13:37:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P54171 and previous config saved to /var/cache/conftool/dbconfig/20231205-133706-arnaudb.json
[13:38:16] <icinga-wm>	 RECOVERY - Host ms-be1079 is UP: PING OK - Packet loss = 0%, RTA = 5.51 ms
[13:38:28] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1079 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:38:56] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1464.eqiad.wmnet with OS bullseye
[13:41:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4048.ulsfo.wmnet
[13:43:08] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:43:42] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1465.eqiad.wmnet with OS bullseye
[13:44:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[13:44:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cp4048 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980386 (https://phabricator.wikimedia.org/T349619)
[13:44:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4048 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980386 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[13:48:03] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1470.eqiad.wmnet with OS bullseye
[13:48:07] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[13:48:12] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1079.eqiad.wmnet with OS bullseye
[13:48:17] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1079.eqiad.wmnet with OS bullseye completed: - ms-be...
[13:48:52] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[13:48:58] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[13:50:17] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1001"
[13:50:22] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1078.eqiad.wmnet with OS bullseye
[13:50:26] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:50:28] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye completed: - ms-be...
[13:51:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4048.ulsfo.wmnet
[13:52:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P54172 and previous config saved to /var/cache/conftool/dbconfig/20231205-135213-arnaudb.json
[13:53:46] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, thanks for the quick fix!" [puppet] - 10https://gerrit.wikimedia.org/r/980377 (https://phabricator.wikimedia.org/T352433) (owner: 10EoghanGaffney)
[13:54:29] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] [gitlab] Update BroadcastMessage class to new namespace [puppet] - 10https://gerrit.wikimedia.org/r/980377 (https://phabricator.wikimedia.org/T352433) (owner: 10EoghanGaffney)
[13:55:22] <wikibugs>	 (03CR) 10D3r1ck01: "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:55:47] <wikibugs>	 (03PS1) 10Elukey: services: upgrade recommendation-api's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/980391 (https://phabricator.wikimedia.org/T349118)
[13:58:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: enable remote syslog in o11y [puppet] - 10https://gerrit.wikimedia.org/r/980392
[13:59:00] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "LGTM! Is there a runbook for deploying this and knowing for sure the service is still working? If so, happy to deploy it for you." [deployment-charts] - 10https://gerrit.wikimedia.org/r/980391 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey)
[14:00:08] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1400).
[14:00:08] <jouncebot>	 Urbanecm and kostajh: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:36] <urbanecm>	 I can deploy (unless kostajh wants to lead the window?)
[14:00:39] <wikibugs>	 (03CR) 10Elukey: services: upgrade recommendation-api's Docker image (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980391 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey)
[14:00:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: enable remote syslog in o11y [puppet] - 10https://gerrit.wikimedia.org/r/980392 (owner: 10Filippo Giunchedi)
[14:01:02] <kostajh>	 hi
[14:01:04] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:01:09] <kostajh>	 urbanecm: would be happy for you to deploy them
[14:01:10] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] User impact: update quantizeViews to process small series of view data [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979698 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm)
[14:01:11] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] services: upgrade recommendation-api's Docker image (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980391 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey)
[14:01:22] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string http://www.w3.org/2001/XML... not found on https://query.wikidata.org:443/bigdata/namespace/wdq/sparql?query=SELECT%20*%20WHERE%20%7Bwikibase%3ADump%20schema%3AdateModified%20%3Fy%7D%20LIMIT%201 - 428 bytes in 0.046 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:01:24] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Bring an-coord1004 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/980372 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[14:01:28] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:01:32] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1014.eqiad.wmnet, wdqs1015.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1007.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs10
[14:01:32] <icinga-wm>	 .wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:01:52] <urbanecm>	 kostajh: i am a bit confused by your message. do you want me to deploy, or do you want to?
[14:02:00] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on mw1471:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[14:02:07] <kostajh>	 urbanecm: if you can deploy, that would be great
[14:02:11] <urbanecm>	 sure, no problem
[14:02:18] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add maintenance script to import existing files to scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979700 (https://phabricator.wikimedia.org/T350863) (owner: 10Kosta Harlan)
[14:02:20] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Only allow drawing and bitmap media types to be scanned [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979701 (https://phabricator.wikimedia.org/T352234) (owner: 10Kosta Harlan)
[14:02:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs1015.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs1015.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet, wdqs1014.eqiad.wmnet, wdqs1007.eqiad.wmnet, wdqs1012.eqiad.wmnet are marked down but pooled: wdqs_80: Servers wdqs1014.eqiad.wmnet, wdqs10
[14:02:28] <icinga-wm>	 .wmnet, wdqs1007.eqiad.wmnet, wdqs1006.eqiad.wmnet, wdqs1013.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:03:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: upgrade recommendation-api's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/980391 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey)
[14:03:00] <urbanecm>	 i assume you'll handle running the script after the deployment, right?
[14:03:09] <wikibugs>	 (03PS2) 10Urbanecm: Growth: Enable Welcome survey user research for ar/en/es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980357 (https://phabricator.wikimedia.org/T351266)
[14:03:14] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth: Enable Welcome survey user research for ar/en/es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980357 (https://phabricator.wikimedia.org/T351266) (owner: 10Urbanecm)
[14:03:28] <kostajh>	 urbanecm: not immediately, probably tomorrow
[14:03:29] <moritzm>	 !log installing cups security updates
[14:03:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980357 (https://phabricator.wikimedia.org/T351266) (owner: 10Urbanecm)
[14:03:48] <urbanecm>	 fine with me, just double checking i'm not supposed to run it :)
[14:03:55] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Enable Welcome survey user research for ar/en/es [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980357 (https://phabricator.wikimedia.org/T351266) (owner: 10Urbanecm)
[14:04:10] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:980357|Growth: Enable Welcome survey user research for ar/en/es (T351266)]]
[14:04:12] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:04:16] <stashbot>	 T351266: enable the T342353 checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T351266
[14:04:30] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:04:40] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:05:48] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.137 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:05:59] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: sync
[14:06:14] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync
[14:06:27] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:980357|Growth: Enable Welcome survey user research for ar/en/es (T351266)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:07:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T348183)', diff saved to https://phabricator.wikimedia.org/P54173 and previous config saved to /var/cache/conftool/dbconfig/20231205-140720-arnaudb.json
[14:07:22] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[14:07:24] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[14:07:24] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[14:07:36] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1169.eqiad.wmnet with reason: Maintenance
[14:07:43] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1169 (T348183)', diff saved to https://phabricator.wikimedia.org/P54174 and previous config saved to /var/cache/conftool/dbconfig/20231205-140742-arnaudb.json
[14:08:28] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1015 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 0.084 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:09:14] <wikibugs>	 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T352731 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact.
[14:10:20] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:10:22] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 692 bytes in 0.358 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:11:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:12:44] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1014 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:13:43] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:980357|Growth: Enable Welcome survey user research for ar/en/es (T351266)]] (duration: 09m 33s)
[14:13:47] <stashbot>	 T351266: enable the T342353 checkbox on the Welcome Survey allowing new account holders to consent to being contacted for design research - https://phabricator.wikimedia.org/T351266
[14:14:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979698 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm)
[14:14:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979700 (https://phabricator.wikimedia.org/T350863) (owner: 10Kosta Harlan)
[14:14:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979701 (https://phabricator.wikimedia.org/T352234) (owner: 10Kosta Harlan)
[14:14:27] <urbanecm>	 Growth config change done. waiting on the backports to merge
[14:14:36] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.072 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:15:12] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:43] <wikibugs>	 10SRE, 10serviceops, 10API Platform (RESTbase Deprecation Roadmap): Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF)
[14:15:56] <wikibugs>	 10SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 4 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF)
[14:15:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:16:04] <wikibugs>	 (03PS1) 10Slyngshede: Package for Debian [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397
[14:16:08] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1013 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.156 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[14:17:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T348183)', diff saved to https://phabricator.wikimedia.org/P54175 and previous config saved to /var/cache/conftool/dbconfig/20231205-141701-arnaudb.json
[14:17:06] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[14:17:19] <wikibugs>	 (03PS2) 10Slyngshede: Package for Debian [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397
[14:18:11] <wikibugs>	 (03PS3) 10Slyngshede: Package for Debian [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397
[14:19:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:20:27] <wikibugs>	 (03CR) 10Volans: "Wouldn't be easier to have a debian branch for the debian/ directory like we do for some of our projects?" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede)
[14:20:48] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:21:02] <jinxer-wm>	 (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[14:21:24] <wikibugs>	 (03CR) 10Muehlenhoff: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede)
[14:21:37] <urbanecm>	 almost there...
[14:22:49] <wikibugs>	 (03PS1) 10Jgiannelos: wikifeeds: Configure http client to use service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/980398
[14:22:52] <icinga-wm>	 PROBLEM - cassandra-c service on restbase2028 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:23:28] <wikibugs>	 (03Merged) 10jenkins-bot: User impact: update quantizeViews to process small series of view data [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979698 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm)
[14:23:28] <wikibugs>	 (03Merged) 10jenkins-bot: Add maintenance script to import existing files to scan table [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979700 (https://phabricator.wikimedia.org/T350863) (owner: 10Kosta Harlan)
[14:23:34] <urbanecm>	 here we go :)
[14:23:37] <wikibugs>	 (03Merged) 10jenkins-bot: Only allow drawing and bitmap media types to be scanned [extensions/MediaModeration] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979701 (https://phabricator.wikimedia.org/T352234) (owner: 10Kosta Harlan)
[14:23:51] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:979698|User impact: update quantizeViews to process small series of view data (T352349)]], [[gerrit:979700|Add maintenance script to import existing files to scan table (T350863)]], [[gerrit:979701|Only allow drawing and bitmap media types to be scanned (T352234)]]
[14:23:57] <stashbot>	 T352349: Impact Module: Views on articles you've edited graph - https://phabricator.wikimedia.org/T352349
[14:23:57] <stashbot>	 T350863: Create maintenance script to import all existing images to mediamoderation_scan table - https://phabricator.wikimedia.org/T350863
[14:23:58] <stashbot>	 T352234: MediaModerationFileProcessor::canScanFile incorrectly registers audio and video files as scannable when TimedMediaHandler extension is installed - https://phabricator.wikimedia.org/T352234
[14:24:30] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host ceph2002.mgmt.codfw.wmnet with reboot policy FORCED
[14:25:08] <logmsgbot>	 !log urbanecm@deploy2002 kharlan and urbanecm: Backport for [[gerrit:979698|User impact: update quantizeViews to process small series of view data (T352349)]], [[gerrit:979700|Add maintenance script to import existing files to scan table (T350863)]], [[gerrit:979701|Only allow drawing and bitmap media types to be scanned (T352234)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:25:27] <urbanecm>	 kostajh: fyi ^^, not sure if the other one is testable.
[14:25:52] <kostajh>	 urbanecm: not testable, really, until we run them for real
[14:25:58] <urbanecm>	 makes sense. 
[14:25:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[14:25:58] <kostajh>	 so just syncing would be great, thank you.
[14:26:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ceph2002.mgmt.codfw.wmnet with reboot policy FORCED
[14:26:03] <urbanecm>	 Growth's fix works, so syncing
[14:26:05] <logmsgbot>	 !log urbanecm@deploy2002 kharlan and urbanecm: Continuing with sync
[14:27:22] <wikibugs>	 (03CR) 10Volans: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede)
[14:27:46] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ceph2002.mgmt.codfw.wmnet with reboot policy FORCED
[14:29:37] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ceph2002']
[14:30:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: redis::misc::master
[14:30:54] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm)
[14:30:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) fixed ceph2002
[14:31:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch redis::misc::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980400 (https://phabricator.wikimedia.org/T349619)
[14:32:08] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P54176 and previous config saved to /var/cache/conftool/dbconfig/20231205-143207-arnaudb.json
[14:32:47] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:979698|User impact: update quantizeViews to process small series of view data (T352349)]], [[gerrit:979700|Add maintenance script to import existing files to scan table (T350863)]], [[gerrit:979701|Only allow drawing and bitmap media types to be scanned (T352234)]] (duration: 08m 55s)
[14:32:52] <urbanecm>	 and done
[14:32:55] <urbanecm>	 anything else? :)
[14:32:59] <kostajh>	 thanks!
[14:33:02] <stashbot>	 T352349: Impact Module: Views on articles you've edited graph - https://phabricator.wikimedia.org/T352349
[14:33:03] <stashbot>	 T350863: Create maintenance script to import all existing images to mediamoderation_scan table - https://phabricator.wikimedia.org/T350863
[14:33:03] <stashbot>	 T352234: MediaModerationFileProcessor::canScanFile incorrectly registers audio and video files as scannable when TimedMediaHandler extension is installed - https://phabricator.wikimedia.org/T352234
[14:33:12] <urbanecm>	 np
[14:34:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch redis::misc::master to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980400 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:34:58] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm)
[14:35:10] <wikibugs>	 (03PS1) 10Brouberol: Add discovery records for the k8s-ingress-dse LVS service [dns] - 10https://gerrit.wikimedia.org/r/980404 (https://phabricator.wikimedia.org/T352639)
[14:35:17] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[14:35:20] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "looks mostly good, but some typo and nitpick comments in-line 😊" [puppet] - 10https://gerrit.wikimedia.org/r/979912 (owner: 10EoghanGaffney)
[14:36:43] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ceph2002']
[14:38:46] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:39:00] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: redis::misc::master
[14:39:05] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:40:09] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[14:40:11] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:40:26] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[14:40:33] <wikibugs>	 (03PS1) 10Elukey: services: use 127.0.0.1 instead of localhost for rec-api's mw host [deployment-charts] - 10https://gerrit.wikimedia.org/r/980407 (https://phabricator.wikimedia.org/T349118)
[14:41:18] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[14:41:32] <logmsgbot>	 !log pfischer@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[14:41:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[14:42:37] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: use 127.0.0.1 instead of localhost for rec-api's mw host [deployment-charts] - 10https://gerrit.wikimedia.org/r/980407 (https://phabricator.wikimedia.org/T349118) (owner: 10Elukey)
[14:43:59] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sessionstore2004-6 to codfw - jhancock@cumin2002"
[14:44:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sessionstore2004-6 to codfw - jhancock@cumin2002"
[14:44:52] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:45:13] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: sync
[14:45:22] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync
[14:46:57] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] wikifeeds: Configure http client to use service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/980398 (owner: 10Jgiannelos)
[14:47:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169', diff saved to https://phabricator.wikimedia.org/P54177 and previous config saved to /var/cache/conftool/dbconfig/20231205-144714-arnaudb.json
[14:48:33] <wikibugs>	 (03PS1) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2014 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980409 (https://phabricator.wikimedia.org/T352758)
[14:50:07] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2014 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352758 (10cmooney)
[14:51:58] <wikibugs>	 (03CR) 10Slyngshede: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede)
[14:52:18] <claime>	 jouncebot: nowandnext
[14:52:18] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1400)
[14:52:18] <jouncebot>	 In 1 hour(s) and 7 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1600)
[14:52:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4043.ulsfo.wmnet
[14:52:58] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Switch the k8s-ingress-dse LVS service in lvs_setup state [puppet] - 10https://gerrit.wikimedia.org/r/980368 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[14:54:05] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:45] <brouberol>	 !log adding k8s-ingress-dse backend to LVS - T352639
[14:54:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:49] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[14:55:25] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cp4043 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980410 (https://phabricator.wikimedia.org/T349619)
[14:55:43] <wikibugs>	 (03CR) 10Volans: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede)
[14:55:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sessionstore2004.mgmt.codfw.wmnet with reboot policy FORCED
[14:55:47] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sessionstore2005.mgmt.codfw.wmnet with reboot policy FORCED
[14:55:49] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sessionstore2006.mgmt.codfw.wmnet with reboot policy FORCED
[14:57:06] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync
[14:57:33] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync
[14:57:56] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync
[14:58:05] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1471.eqiad.wmnet with OS bullseye
[14:58:21] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync
[14:58:30] <wikibugs>	 (03PS2) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2014 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980409 (https://phabricator.wikimedia.org/T352758)
[14:59:39] <wikibugs>	 (03CR) 10Muehlenhoff: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede)
[14:59:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4043 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980410 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:00:21] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:01:21] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on P{lvs[1018,1020].eqiad.wmnet} and A:lvs (T352639)
[15:01:29] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[15:01:35] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[15:01:54] <claime>	 ^that's me and brouberol 
[15:02:21] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1169 (T348183)', diff saved to https://phabricator.wikimedia.org/P54178 and previous config saved to /var/cache/conftool/dbconfig/20231205-150220-arnaudb.json
[15:02:23] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[15:02:25] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs1020 is CRITICAL: CRITICAL: 129 connections established with conf1007.eqiad.wmnet:4001 (min=130) https://wikitech.wikimedia.org/wiki/PyBal
[15:02:27] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal
[15:02:33] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[15:02:37] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1186.eqiad.wmnet with reason: Maintenance
[15:02:44] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1186 (T348183)', diff saved to https://phabricator.wikimedia.org/P54179 and previous config saved to /var/cache/conftool/dbconfig/20231205-150243-arnaudb.json
[15:02:59] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: Configure http client to use service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/980398 (owner: 10Jgiannelos)
[15:03:54] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: Configure http client to use service mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/980398 (owner: 10Jgiannelos)
[15:04:31] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: disable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/980413 (https://phabricator.wikimedia.org/T337649)
[15:04:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4043.ulsfo.wmnet
[15:05:21] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Papaul) @BTullis where you able to add those nodes to partman-early-command.sh ?
[15:05:28] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sessionstore2006.mgmt.codfw.wmnet with reboot policy FORCED
[15:06:00] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[15:06:06] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[15:06:32] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10BTullis) >>! In T349934#9383342, @Papaul wrote: > @BTullis where you able to add those nodes to partman-early-command.sh ?  Oh sorry, I missed the ping. I'll add t...
[15:06:56] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sessionstore2004.mgmt.codfw.wmnet with reboot policy FORCED
[15:07:25] <wikibugs>	 (03PS1) 10Jgiannelos: wikifeeds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980414
[15:07:27] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Add version flag to purged - https://phabricator.wikimedia.org/T347839 (10CodeReviewBot) fabfur closed https://gitlab.wikimedia.org/repos/sre/purged/-/merge_requests/2  Draft: Add version print option
[15:07:35] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sessionstore2005.mgmt.codfw.wmnet with reboot policy FORCED
[15:07:47] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980273 (https://phabricator.wikimedia.org/T351069)
[15:08:06] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] wikifeeds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980414 (owner: 10Jgiannelos)
[15:08:30] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980414 (owner: 10Jgiannelos)
[15:09:26] <wikibugs>	 (03Merged) 10jenkins-bot: wikifeeds: Bump chart version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980414 (owner: 10Jgiannelos)
[15:10:31] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1471.eqiad.wmnet with reason: host reimage
[15:11:07] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply
[15:11:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] thumbor: disable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/980413 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan)
[15:11:57] <logmsgbot>	 !log cgoubert@cumin1001 END (FAIL) - Cookbook sre.loadbalancer.restart-pybal (exit_code=1) rolling-restart of pybal on P{lvs[1018,1020].eqiad.wmnet} and A:lvs (T352639)
[15:12:03] <logmsgbot>	 !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply
[15:12:06] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[15:12:07] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Enable IPIP encapsulation on ncredir@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980273 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:12:21] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] thumbor: disable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/980413 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan)
[15:12:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T348183)', diff saved to https://phabricator.wikimedia.org/P54180 and previous config saved to /var/cache/conftool/dbconfig/20231205-151255-arnaudb.json
[15:13:05] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[15:13:17] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: disable expensive throttling [deployment-charts] - 10https://gerrit.wikimedia.org/r/980413 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan)
[15:13:36] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1471.eqiad.wmnet with reason: host reimage
[15:14:08] <wikibugs>	 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10lmata) p:05Triage→03Low
[15:15:04] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10BTullis) >>! In T349934#9383342, @Papaul wrote: > @BTullis where you able to add those nodes to partman-early-command.sh ?  Oh, I'm so sorry. I've made a mistake w...
[15:15:42] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/thumbor: apply
[15:15:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host aqs2001.codfw.wmnet
[15:15:51] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[15:16:05] <claime>	 !log Manually restarting pybal on lvs1020 - T352639
[15:16:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:17:18] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[15:18:25] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/828/console" [puppet] - 10https://gerrit.wikimedia.org/r/952488 (https://phabricator.wikimedia.org/T321099) (owner: 10Jforrester)
[15:18:43] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[15:18:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch aqs2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980416 (https://phabricator.wikimedia.org/T349619)
[15:20:05] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[15:21:14] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[15:21:16] <wikibugs>	 (03CR) 10JMeybohm: mcrouter: add helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[15:21:36] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch aqs2001 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980416 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:22:14] <claime>	 !log Manually restarting pybal on lvs1019 - T352639
[15:22:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:22] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[15:25:05] <wikibugs>	 10SRE, 10Incident Tooling: Bridge wikimediastatus.net to Mastodon - https://phabricator.wikimedia.org/T336701 (10TheresNoTime) Atlassian statuspage has [[ https://support.atlassian.com/statuspage/docs/enable-webhook-notifications/ | webhook support ]].. that might be easier than RSS?
[15:26:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host aqs2001.codfw.wmnet
[15:27:52] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) attempted to provision sessionstore2004 on the new lsw switch. needs further attention.
[15:28:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P54181 and previous config saved to /var/cache/conftool/dbconfig/20231205-152801-arnaudb.json
[15:28:09] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] wikifunctions: Drop beta monitoring [puppet] - 10https://gerrit.wikimedia.org/r/952488 (https://phabricator.wikimedia.org/T321099) (owner: 10Jforrester)
[15:28:20] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host sessionstore2005.mgmt.codfw.wmnet with reboot policy FORCED
[15:28:35] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Add discovery records for the k8s-ingress-dse LVS service [dns] - 10https://gerrit.wikimedia.org/r/980404 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[15:28:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2006']
[15:29:01] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sessionstore2006']
[15:29:05] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sessionstore2005.mgmt.codfw.wmnet with reboot policy FORCED
[15:29:33] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sessionstore2005']
[15:29:46] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sessionstore2005']
[15:29:48] <wikibugs>	 (03PS1) 10Brouberol: Fix cluster conftool selector for the k8s-ingress-dse LVS service [puppet] - 10https://gerrit.wikimedia.org/r/980417 (https://phabricator.wikimedia.org/T352639)
[15:29:57] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:30:09] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Fix cluster conftool selector for the k8s-ingress-dse LVS service [puppet] - 10https://gerrit.wikimedia.org/r/980417 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[15:31:17] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Fix cluster conftool selector for the k8s-ingress-dse LVS service [puppet] - 10https://gerrit.wikimedia.org/r/980417 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[15:31:35] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1471.eqiad.wmnet with OS bullseye
[15:31:59] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm)
[15:32:20] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff)
[15:35:47] <jinxer-wm>	 (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[15:39:01] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.reimage for host moss-be1002.eqiad.wmnet with OS bookworm
[15:39:07] <wikibugs>	 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm
[15:40:47] <jinxer-wm>	 (ConfdResourceFailed) firing: (4) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[15:41:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.2 point update - https://phabricator.wikimedia.org/T348326 (10MoritzMuehlenhoff)
[15:42:31] <claime>	 !log Manually restarting pybal on lvs1020 - T352639
[15:42:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:35] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[15:42:47] <moritzm>	 !log installing monitoring-plugins bugfix updates from Bookworm point release
[15:42:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:52] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[15:43:08] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P54182 and previous config saved to /var/cache/conftool/dbconfig/20231205-154308-arnaudb.json
[15:44:13] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs1020 is OK: OK: 130 connections established with conf1007.eqiad.wmnet:4001 (min=130) https://wikitech.wikimedia.org/wiki/PyBal
[15:44:15] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[15:45:03] <claime>	 Obviously now it's alerting because the backend isn't responding
[15:45:06] <claime>	 joy
[15:45:47] <jinxer-wm>	 (ConfdResourceFailed) resolved: (4) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[15:45:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4040.ulsfo.wmnet
[15:46:33] <sukhe>	 claime: \m/
[15:47:11] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker1006.eqiad.wmnet, dse-k8s-worker1007.eqiad.wmnet, dse-k8s-worker1005.eqiad.wmnet, dse-k8s-worker1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:47:49] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cp4040 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980420 (https://phabricator.wikimedia.org/T349619)
[15:48:10] <wikibugs>	 (03PS2) 10Aqu: Rewrite metrics sent by Airflow [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532)
[15:48:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4040 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980420 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[15:49:44] <claime>	 !log sudo confctl select "service=kubesvc,cluster=dse-k8s" set/pooled=inactive - T352639
[15:49:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:48] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[15:50:33] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] mw-jobrunner: bump replicas further [deployment-charts] - 10https://gerrit.wikimedia.org/r/980369 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[15:50:54] <wikibugs>	 (03PS3) 10Aqu: Rewrite metrics sent by Airflow [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532)
[15:51:07] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - k8s-ingress-dse_30443: Servers dse-k8s-worker1006.eqiad.wmnet, dse-k8s-worker1007.eqiad.wmnet, dse-k8s-worker1005.eqiad.wmnet, dse-k8s-worker1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:51:46] <wikibugs>	 (03Merged) 10jenkins-bot: mw-jobrunner: bump replicas further [deployment-charts] - 10https://gerrit.wikimedia.org/r/980369 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[15:52:46] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[15:53:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4040.ulsfo.wmnet
[15:54:12] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, 10netops: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 (10cmooney)
[15:55:47] <jinxer-wm>	 (ConfdResourceFailed) firing: (2) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[15:56:16] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync
[15:56:30] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync
[15:56:36] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync
[15:56:47] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync
[15:57:55] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[15:58:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T348183)', diff saved to https://phabricator.wikimedia.org/P54183 and previous config saved to /var/cache/conftool/dbconfig/20231205-155814-arnaudb.json
[15:58:18] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[15:58:19] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[15:58:33] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance
[15:58:34] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[15:58:52] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[15:58:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1196 (T348183)', diff saved to https://phabricator.wikimedia.org/P54184 and previous config saved to /var/cache/conftool/dbconfig/20231205-155858-arnaudb.json
[15:59:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding testhost2001 to codfw - jhancock@cumin2002"
[16:00:04] <jouncebot>	 eoghan, jelto, and arnoldokoth: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1600).
[16:00:11] <wikibugs>	 (03PS4) 10Samtar: .well-known: Add F-Droid signature to assetlinks.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951)
[16:00:27] <wikibugs>	 (03CR) 10Slyngshede: Package for Debian (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/980397 (owner: 10Slyngshede)
[16:00:47] <jinxer-wm>	 (ConfdResourceFailed) resolved: (3) confd resource _srv_config-master_pybal_eqiad_k8s-ingress-dse.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[16:00:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding testhost2001 to codfw - jhancock@cumin2002"
[16:00:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:01:07] <TheresNoTime>	 jouncebot: nowandnext
[16:01:07] <jouncebot>	 For the next 0 hour(s) and 58 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1600)
[16:01:07] <jouncebot>	 In 0 hour(s) and 58 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1700)
[16:01:51] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host testhost2001.mgmt.codfw.wmnet with reboot policy FORCED
[16:03:22] <wikibugs>	 (03PS1) 10Elukey: Revert "services: upgrade recommendation-api's Docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979703
[16:03:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "services: upgrade recommendation-api's Docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979703 (owner: 10Elukey)
[16:04:41] <wikibugs>	 (03PS2) 10Elukey: Revert "services: upgrade recommendation-api's Docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979703
[16:06:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) (owner: 10Samtar)
[16:06:26] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "services: upgrade recommendation-api's Docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979703 (owner: 10Elukey)
[16:06:49] <wikibugs>	 (03Merged) 10jenkins-bot: .well-known: Add F-Droid signature to assetlinks.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959327 (https://phabricator.wikimedia.org/T346951) (owner: 10Samtar)
[16:07:04] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:959327|.well-known: Add F-Droid signature to assetlinks.json (T346951)]]
[16:07:08] <stashbot>	 T346951: assetlinks.json is missing F-Droid build signature - https://phabricator.wikimedia.org/T346951
[16:07:30] <wikibugs>	 (03PS3) 10JMeybohm: Add new mesh module versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/979094 (https://phabricator.wikimedia.org/T300033)
[16:07:32] <wikibugs>	 (03PS3) 10JMeybohm: Remove cergen certificate support from mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/979095 (https://phabricator.wikimedia.org/T300033)
[16:07:34] <wikibugs>	 (03PS1) 10JMeybohm: function-orchestrator: Update to latest mesh and ingress module [deployment-charts] - 10https://gerrit.wikimedia.org/r/980425 (https://phabricator.wikimedia.org/T300033)
[16:07:43] <wikibugs>	 (03PS1) 10Cathal Mooney: Add new codfw per-rack vlans to lvs2013 and move row B vlans [puppet] - 10https://gerrit.wikimedia.org/r/980426 (https://phabricator.wikimedia.org/T352784)
[16:08:21] <logmsgbot>	 !log samtar@deploy2002 samtar: Backport for [[gerrit:959327|.well-known: Add F-Droid signature to assetlinks.json (T346951)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:08:42] <logmsgbot>	 !log samtar@deploy2002 samtar: Continuing with sync
[16:08:44] <wikibugs>	 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10Traffic, and 2 others: Move lvs2013 link to row A and connect to new row A/B vlans - https://phabricator.wikimedia.org/T352784 (10cmooney)
[16:08:53] <wikibugs>	 (03PS1) 10Ssingh: P:dns::auth: add support for depooling recdns via confd [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054)
[16:09:10] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync
[16:09:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T348183)', diff saved to https://phabricator.wikimedia.org/P54185 and previous config saved to /var/cache/conftool/dbconfig/20231205-160920-arnaudb.json
[16:09:24] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[16:09:35] <logmsgbot>	 !log elukey@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync
[16:09:52] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync
[16:10:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] jobqueue: migrate a heavyweight job to jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/979397 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[16:10:15] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh)
[16:10:58] <wikibugs>	 (03PS2) 10Hnowlan: jobqueue: migrate a heavyweight job to jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/979397 (https://phabricator.wikimedia.org/T349796)
[16:11:48] <logmsgbot>	 !log elukey@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync
[16:13:00] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] jobqueue: migrate a heavyweight job to jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/979397 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[16:13:26] <wikibugs>	 (03PS1) 10Brouberol: Rollback state of LVS k8s-ingress-dse to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/980428 (https://phabricator.wikimedia.org/T352639)
[16:14:01] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Rollback state of LVS k8s-ingress-dse to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/980428 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[16:14:03] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: migrate a heavyweight job to jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/979397 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[16:14:39] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Rollback state of LVS k8s-ingress-dse to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/980428 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[16:14:58] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:959327|.well-known: Add F-Droid signature to assetlinks.json (T346951)]] (duration: 07m 53s)
[16:15:03] <stashbot>	 T346951: assetlinks.json is missing F-Droid build signature - https://phabricator.wikimedia.org/T346951
[16:17:45] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[16:18:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Nicely done" [alerts] - 10https://gerrit.wikimedia.org/r/980280 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:18:13] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[16:18:20] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[16:18:24] <claime>	 !log Rolling back k8s-ingress-dse - restarting pybal on lvs1020 - T352639
[16:18:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:18:28] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[16:18:29] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:18:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:18:42] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[16:24:02] <claime>	 !log Rolling back k8s-ingress-dse - restarting pybal on lvs1019 - T352639
[16:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:06] <stashbot>	 T352639: Configure ingress to the spark history servers - https://phabricator.wikimedia.org/T352639
[16:24:25] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:24:27] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P54186 and previous config saved to /var/cache/conftool/dbconfig/20231205-162426-arnaudb.json
[16:24:56] <wikibugs>	 (03PS8) 10Bernard Wang: Deploy VectorClientPreferences to beta and pl,fr,ca,fa,tr wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339)
[16:25:49] <wikibugs>	 (03PS9) 10Bernard Wang: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339)
[16:26:23] <wikibugs>	 (03CR) 10Bernard Wang: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang)
[16:34:10] <logmsgbot>	 !log mvernon@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1002.eqiad.wmnet with reason: host reimage
[16:36:55] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.192.16.237:9042 on restbase2028 is OK: TCP OK - 0.032 second response time on 10.192.16.237 port 9042 https://phabricator.wikimedia.org/T93886
[16:37:38] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1002.eqiad.wmnet with reason: host reimage
[16:39:34] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P54187 and previous config saved to /var/cache/conftool/dbconfig/20231205-163933-arnaudb.json
[16:39:44] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: migrate a larger job (and one smaller one) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980433 (https://phabricator.wikimedia.org/T349796)
[16:42:23] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host testhost2001.mgmt.codfw.wmnet with reboot policy FORCED
[16:42:46] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['testhost2001']
[16:42:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: rsyslog: update receiver config for version 8.2302 [puppet] - 10https://gerrit.wikimedia.org/r/980434 (https://phabricator.wikimedia.org/T351710)
[16:44:49] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm) I can do this @BTullis. np!
[16:46:57] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Engineering: Q2:rack/setup/install ceph200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T349934 (10Jhancock.wm)
[16:47:20] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye
[16:50:31] <wikibugs>	 (03CR) 10Herron: [C: 03+1] rsyslog: update receiver config for version 8.2302 [puppet] - 10https://gerrit.wikimedia.org/r/980434 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[16:52:09] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bullseye
[16:52:35] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye
[16:54:40] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T348183)', diff saved to https://phabricator.wikimedia.org/P54188 and previous config saved to /var/cache/conftool/dbconfig/20231205-165439-arnaudb.json
[16:54:42] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[16:54:44] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[16:54:57] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance
[16:55:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1206 (T348183)', diff saved to https://phabricator.wikimedia.org/P54189 and previous config saved to /var/cache/conftool/dbconfig/20231205-165503-arnaudb.json
[17:00:04] <jouncebot>	 jhathaway and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1700).
[17:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[17:00:36] <logmsgbot>	 !log mvernon@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1002.eqiad.wmnet with OS bookworm
[17:00:42] <wikibugs>	 10SRE-swift-storage: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1001 for host moss-be1002.eqiad.wmnet with OS bookworm completed: - moss-be1002 (**PASS**)   - Removed from Puppet and Pup...
[17:08:00] <icinga-wm>	 RECOVERY - cassandra-b SSL 10.192.16.238:7000 on restbase2028 is OK: SSL OK - Certificate restbase2028-b valid until 2025-12-03 21:33:01 +0000 (expires in 729 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[17:11:05] <logmsgbot>	 !log sukhe@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4052.ulsfo.wmnet with OS bullseye
[17:12:18] <wikibugs>	 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10MatthewVernon)
[17:13:46] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['testhost2001']
[17:14:18] <wikibugs>	 10SRE-swift-storage, 10Traffic: OpenSSL 3.x performance issues - https://phabricator.wikimedia.org/T352744 (10hnowlan) >>! In T352744#9383998, @MatthewVernon wrote: > I think `ms-*` swift will fall foul of this too, via the wmf-rewrite middleware (which is using python's `urllib.request.build_opener` to talk t...
[17:15:01] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['testhost2001']
[17:15:47] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['testhost2001']
[17:18:30] <wikibugs>	 (03CR) 10JHathaway: apt_repo: validate preseed data with a JSON Schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[17:22:42] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/980434 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[17:24:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] jobqueue: migrate a larger job (and one smaller one) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980433 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[17:26:08] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] jobqueue: migrate a larger job (and one smaller one) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980433 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[17:26:58] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: migrate a larger job (and one smaller one) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980433 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[17:28:59] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[17:29:03] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host cp4052.ulsfo.wmnet with OS bullseye
[17:29:25] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[17:29:26] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[17:29:47] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[17:42:21] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP on text|secondary LVS in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/980274 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[17:45:59] <vgutierrez>	 !log rolling restart of text|secondary LVS on drmrs effectively enabling IPIP encapsulation for ncredir@drmrs- T351069
[17:46:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:46:02] <stashbot>	 T351069: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069
[17:47:15] <wikibugs>	 (03PS1) 10Jdlrobson: [Zebra] Make .vector-column-start cache compatible [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979704 (https://phabricator.wikimedia.org/T347712)
[17:49:25] <logmsgbot>	 !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[17:50:01] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez)
[17:52:53] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4052.ulsfo.wmnet with reason: host reimage
[17:55:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T348183)', diff saved to https://phabricator.wikimedia.org/P54190 and previous config saved to /var/cache/conftool/dbconfig/20231205-175526-arnaudb.json
[17:55:30] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[17:59:41] <wikibugs>	 (03PS1) 10Btullis: Update the refinery version used by the refine jobs [puppet] - 10https://gerrit.wikimedia.org/r/980445 (https://phabricator.wikimedia.org/T349121)
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T1800)
[18:00:40] <wikibugs>	 (03PS1) 10Samtar: IS: Set Phonos to Inline Audio Player mode on test.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980446
[18:01:13] <wikibugs>	 (03CR) 10JHathaway: "thanks for the patch @jbond, this didn't work for me with `bundle exec rspec modules/role` even after adding `include profile::installserv" [puppet] - 10https://gerrit.wikimedia.org/r/980375 (owner: 10Jbond)
[18:01:39] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on cp4052 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.128.0.12. Check system logs on 10.128.0.12 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T352795 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[18:02:03] <wikibugs>	 10SRE, 10ops-ulsfo: Degraded RAID on cp4052 - https://phabricator.wikimedia.org/T352795 (10ops-monitoring-bot)
[18:02:23] <sukhe>	 huh
[18:03:42] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang)
[18:05:35] <sukhe>	 cp4052 is being reimaged, we will see once it finishes
[18:05:40] <sukhe>	 (depooled, nothing to worry about)
[18:07:44] <wikibugs>	 (03PS10) 10Bernard Wang: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339)
[18:10:33] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P54191 and previous config saved to /var/cache/conftool/dbconfig/20231205-181032-arnaudb.json
[18:13:17] <logmsgbot>	 !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4052.ulsfo.wmnet with OS bullseye
[18:13:51] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[18:25:39] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P54192 and previous config saved to /var/cache/conftool/dbconfig/20231205-182539-arnaudb.json
[18:40:46] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T348183)', diff saved to https://phabricator.wikimedia.org/P54193 and previous config saved to /var/cache/conftool/dbconfig/20231205-184045-arnaudb.json
[18:40:48] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[18:40:50] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[18:41:02] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance
[18:41:08] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1207 (T348183)', diff saved to https://phabricator.wikimedia.org/P54194 and previous config saved to /var/cache/conftool/dbconfig/20231205-184108-arnaudb.json
[18:41:54] <wikibugs>	 10SRE, 10ops-ulsfo: Degraded RAID on cp4052 - https://phabricator.wikimedia.org/T352795 (10ssingh) 05Open→03Invalid Resolved after cookbook finished.
[18:50:45] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T348183)', diff saved to https://phabricator.wikimedia.org/P54195 and previous config saved to /var/cache/conftool/dbconfig/20231205-185044-arnaudb.json
[18:50:49] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[18:51:33] <wikibugs>	 (03PS2) 10Jbond: installserver: add spec test for role [puppet] - 10https://gerrit.wikimedia.org/r/980375
[18:51:54] <wikibugs>	 (03PS2) 10Jbond: installserver: test spec test fires [puppet] - 10https://gerrit.wikimedia.org/r/980376
[18:54:05] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:56:37] <wikibugs>	 (03CR) 10Jbond: "thanks for taking a look see inline" [puppet] - 10https://gerrit.wikimedia.org/r/980375 (owner: 10Jbond)
[19:00:21] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:01:04] <wikibugs>	 (03CR) 10Dzahn: planet: add ensure parameter allowing to disable update jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn)
[19:05:51] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P54196 and previous config saved to /var/cache/conftool/dbconfig/20231205-190551-arnaudb.json
[19:13:20] <wikibugs>	 (03PS3) 10Dzahn: planet: add ensure parameter allowing to disable update jobs [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392)
[19:20:58] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P54197 and previous config saved to /var/cache/conftool/dbconfig/20231205-192057-arnaudb.json
[19:36:04] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T348183)', diff saved to https://phabricator.wikimedia.org/P54198 and previous config saved to /var/cache/conftool/dbconfig/20231205-193604-arnaudb.json
[19:36:07] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[19:36:20] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[19:36:21] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance
[19:36:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1218 (T348183)', diff saved to https://phabricator.wikimedia.org/P54199 and previous config saved to /var/cache/conftool/dbconfig/20231205-193627-arnaudb.json
[19:40:49] <wikibugs>	 (03PS4) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/980022
[19:41:18] <wikibugs>	 (03CR) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn)
[19:41:39] <wikibugs>	 (03CR) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn)
[19:46:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T348183)', diff saved to https://phabricator.wikimedia.org/P54200 and previous config saved to /var/cache/conftool/dbconfig/20231205-194616-arnaudb.json
[19:46:21] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[19:47:54] <wikibugs>	 (03CR) 10Bking: wdqs: Monitor LDF endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:48:00] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:52:25] <wikibugs>	 (03CR) 10Jbond: apt_repo: validate preseed data with a JSON Schema (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[19:57:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn)
[19:58:53] <wikibugs>	 (03PS1) 10Bking: wdqs: Improve regex for ldf check [puppet] - 10https://gerrit.wikimedia.org/r/980460 (https://phabricator.wikimedia.org/T347355)
[20:01:23] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P54201 and previous config saved to /var/cache/conftool/dbconfig/20231205-200123-arnaudb.json
[20:01:57] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: Improve regex for ldf check [puppet] - 10https://gerrit.wikimedia.org/r/980460 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[20:07:48] <wikibugs>	 (03PS1) 10Jdrewniak: Fix nonzebra sticky container scrolling behavior and scrollable indicator [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980467 (https://phabricator.wikimedia.org/T352464)
[20:11:57] <wikibugs>	 (03Abandoned) 10JHathaway: apt_repo: validate preseed data with a JSON Schema [puppet] - 10https://gerrit.wikimedia.org/r/979469 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[20:14:04] <jinxer-wm>	 (ProbeDown) firing: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:14:40] <wikibugs>	 (03CR) 10JHathaway: installserver: add spec test for role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980375 (owner: 10Jbond)
[20:16:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P54202 and previous config saved to /var/cache/conftool/dbconfig/20231205-201629-arnaudb.json
[20:22:51] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wdqs: Improve regex for ldf check [puppet] - 10https://gerrit.wikimedia.org/r/980460 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[20:22:52] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:23:14] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 212, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:23:26] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:24:39] <wikibugs>	 (03PS2) 10JHathaway: apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604)
[20:27:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[20:31:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T348183)', diff saved to https://phabricator.wikimedia.org/P54203 and previous config saved to /var/cache/conftool/dbconfig/20231205-203136-arnaudb.json
[20:31:38] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[20:31:41] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[20:31:52] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1219.eqiad.wmnet with reason: Maintenance
[20:31:58] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1219 (T348183)', diff saved to https://phabricator.wikimedia.org/P54204 and previous config saved to /var/cache/conftool/dbconfig/20231205-203158-arnaudb.json
[20:41:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T348183)', diff saved to https://phabricator.wikimedia.org/P54205 and previous config saved to /var/cache/conftool/dbconfig/20231205-204147-arnaudb.json
[20:41:50] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465
[20:41:52] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[20:41:52] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: turn on credential_key managementin eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/980486
[20:44:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465 (owner: 10Andrew Bogott)
[20:45:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Keystone: turn on credential_key managementin eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/980486 (owner: 10Andrew Bogott)
[20:52:23] <wikibugs>	 (03CR) 10Jbond: "i definitely prefer this approach, see comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[20:53:50] <inflatador>	 !log bking@prometheus1006 reload prometheus-blackbox service T347355
[20:53:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:53:55] <stashbot>	 T347355: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355
[20:55:40] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone: add logic to manage credential keys [puppet] - 10https://gerrit.wikimedia.org/r/980465
[20:55:42] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone: turn on credential_key managementin eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/980486
[20:56:50] <wikibugs>	 (03CR) 10Jbond: apt_repo: move hiera data into module, to allow for validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[20:56:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P54206 and previous config saved to /var/cache/conftool/dbconfig/20231205-205654-arnaudb.json
[20:58:36] <inflatador>	 !log bking@prometheus1006 disable puppet for troubleshooting T347355
[20:58:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time is upon us again. Time for a UTC late backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231205T2100).
[21:00:04] <jouncebot>	 tgr, James_F, and kimberly_sarabia: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:09] * James_F waves.
[21:00:35] <wikibugs>	 (03PS2) 10Jforrester: Revert "Do not try to use Thumbor on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972263 (https://phabricator.wikimedia.org/T344605) (owner: 10Gergő Tisza)
[21:00:38] <tgr>	 o/
[21:00:43] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Revert "Do not try to use Thumbor on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972263 (https://phabricator.wikimedia.org/T344605) (owner: 10Gergő Tisza)
[21:00:51] <James_F>	 tgr: Yours is easy. :-)
[21:00:59] <wikibugs>	 (03PS2) 10Jforrester: nlwikivoyage: Drop Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980009 (https://phabricator.wikimedia.org/T352696)
[21:01:05] <kimberly_sarabia>	 Hello.
[21:01:10] <wikibugs>	 (03PS2) 10Jforrester: Drop Listings extension from Wikivoyages where unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980047 (https://phabricator.wikimedia.org/T352719)
[21:01:26] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Do not try to use Thumbor on beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/972263 (https://phabricator.wikimedia.org/T344605) (owner: 10Gergő Tisza)
[21:01:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980009 (https://phabricator.wikimedia.org/T352696) (owner: 10Jforrester)
[21:01:35] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980047 (https://phabricator.wikimedia.org/T352719) (owner: 10Jforrester)
[21:02:23] <wikibugs>	 (03Merged) 10jenkins-bot: nlwikivoyage: Drop Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980009 (https://phabricator.wikimedia.org/T352696) (owner: 10Jforrester)
[21:02:27] <wikibugs>	 (03Merged) 10jenkins-bot: Drop Listings extension from Wikivoyages where unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980047 (https://phabricator.wikimedia.org/T352719) (owner: 10Jforrester)
[21:02:44] <logmsgbot>	 !log jforrester@deploy2002 Started scap: Backport for [[gerrit:972263|Revert "Do not try to use Thumbor on beta" (T344605)]], [[gerrit:980009|nlwikivoyage: Drop Listings extension (T352696)]], [[gerrit:980047|Drop Listings extension from Wikivoyages where unused (T352719)]]
[21:02:50] <stashbot>	 T344605: deployment-prep needs a Thumbor instance - https://phabricator.wikimedia.org/T344605
[21:02:50] <stashbot>	 T352696: Undeploy Listings extension from Dutch Wikivoyage - https://phabricator.wikimedia.org/T352696
[21:02:51] <stashbot>	 T352719: Undeploy the Listings extension on 7 Wikivoyages on which it's entirely unused - https://phabricator.wikimedia.org/T352719
[21:02:54] <James_F>	 kimberly_sarabia: I'll start the merge of your wmf.7 backports if that's OK?
[21:03:35] <kimberly_sarabia>	 James_F: Thanks
[21:03:48] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] [Zebra] Make .vector-column-start cache compatible [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979704 (https://phabricator.wikimedia.org/T347712) (owner: 10Jdlrobson)
[21:03:56] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Fix nonzebra sticky container scrolling behavior and scrollable indicator [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980467 (https://phabricator.wikimedia.org/T352464) (owner: 10Jdrewniak)
[21:04:11] <logmsgbot>	 !log jforrester@deploy2002 tgr and jforrester: Backport for [[gerrit:972263|Revert "Do not try to use Thumbor on beta" (T344605)]], [[gerrit:980009|nlwikivoyage: Drop Listings extension (T352696)]], [[gerrit:980047|Drop Listings extension from Wikivoyages where unused (T352719)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:04:31] <logmsgbot>	 !log jforrester@deploy2002 tgr and jforrester: Continuing with sync
[21:06:53] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] firewall::service: spelling fixes, add missing parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn)
[21:07:42] <wikibugs>	 (03CR) 10Jforrester: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang)
[21:07:56] <wikibugs>	 (03PS11) 10Jforrester: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang)
[21:09:53] <mutante>	 touching the firewall::service class which would be mildy scary but it's only comments
[21:10:34] <James_F>	 mutante: What could possibly go wrong? :-)
[21:11:11] <mutante>	 ;) https://en.wiktionary.org/wiki/jinx#Verb
[21:11:22] <mutante>	 https://en.wiktionary.org/wiki/reverse_jinx#English
[21:11:29] <logmsgbot>	 !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:972263|Revert "Do not try to use Thumbor on beta" (T344605)]], [[gerrit:980009|nlwikivoyage: Drop Listings extension (T352696)]], [[gerrit:980047|Drop Listings extension from Wikivoyages where unused (T352719)]] (duration: 08m 45s)
[21:11:32] * James_F grins.
[21:11:37] <stashbot>	 T344605: deployment-prep needs a Thumbor instance - https://phabricator.wikimedia.org/T344605
[21:11:37] <stashbot>	 T352696: Undeploy Listings extension from Dutch Wikivoyage - https://phabricator.wikimedia.org/T352696
[21:11:38] <stashbot>	 T352719: Undeploy the Listings extension on 7 Wikivoyages on which it's entirely unused - https://phabricator.wikimedia.org/T352719
[21:11:43] <James_F>	 kimberly_sarabia: OK for me to deploy the 'VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis' change first?
[21:12:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P54207 and previous config saved to /var/cache/conftool/dbconfig/20231205-211200-arnaudb.json
[21:12:17] <kimberly_sarabia>	 James_F: Yup. That's the most important one for us today
[21:12:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang)
[21:12:24] <James_F>	 Cool.
[21:13:03] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang)
[21:13:19] <logmsgbot>	 !log jforrester@deploy2002 Started scap: Backport for [[gerrit:980028|Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (T351339)]]
[21:13:22] <stashbot>	 T351339: Deploy client preferences to production beta features - https://phabricator.wikimedia.org/T351339
[21:13:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on wdqs1008:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:14:41] <James_F>	 kimberly_sarabia: OK, can you please test on an mwdebug server and confirm we're OK to deploy?
[21:14:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on wdqs1009:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:15:09] <kimberly_sarabia>	 James_F: yes one moment
[21:15:13] <James_F>	 Sure.
[21:15:37] <wikibugs>	 (03PS1) 10Bking: Revert "wdqs: Monitor LDF endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/980468
[21:15:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "wdqs: Monitor LDF endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/980468 (owner: 10Bking)
[21:16:06] <inflatador>	 ^^ fixing puppet errors now
[21:18:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to releasers-wikibase for darthmon_wmde - https://phabricator.wikimedia.org/T342968 (10Aklapper) @darthmon_wmde ping
[21:18:34] <mutante>	 thanks inflatador, I figured it's related
[21:18:59] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:19:52] <kimberly_sarabia>	 James_F: LGTM
[21:19:57] <James_F>	 Great.
[21:19:58] <logmsgbot>	 !log jforrester@deploy2002 bwang and jforrester: Continuing with sync
[21:21:13] <wikibugs>	 (03Merged) 10jenkins-bot: [Zebra] Make .vector-column-start cache compatible [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979704 (https://phabricator.wikimedia.org/T347712) (owner: 10Jdlrobson)
[21:21:29] <wikibugs>	 (03Merged) 10jenkins-bot: Fix nonzebra sticky container scrolling behavior and scrollable indicator [skins/Vector] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/980467 (https://phabricator.wikimedia.org/T352464) (owner: 10Jdrewniak)
[21:21:56] <James_F>	 (Once this config push finishes, I'll do those two now they've landed.)
[21:24:59] <jinxer-wm>	 (PuppetZeroResources) firing: (2) Puppet has failed generate resources on wdqs1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:25:54] <wikibugs>	 (03PS1) 10Bking: Revert "wdqs: Improve regex for ldf check" [puppet] - 10https://gerrit.wikimedia.org/r/980469
[21:26:15] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Revert "wdqs: Improve regex for ldf check" [puppet] - 10https://gerrit.wikimedia.org/r/980469 (owner: 10Bking)
[21:26:17] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] Revert "wdqs: Improve regex for ldf check" [puppet] - 10https://gerrit.wikimedia.org/r/980469 (owner: 10Bking)
[21:27:03] <logmsgbot>	 !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:980028|Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (T351339)]] (duration: 13m 44s)
[21:27:07] <stashbot>	 T351339: Deploy client preferences to production beta features - https://phabricator.wikimedia.org/T351339
[21:27:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T348183)', diff saved to https://phabricator.wikimedia.org/P54208 and previous config saved to /var/cache/conftool/dbconfig/20231205-212707-arnaudb.json
[21:27:10] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[21:27:11] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[21:27:13] <James_F>	 Finally!
[21:27:24] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance
[21:27:42] <logmsgbot>	 !log jforrester@deploy2002 Started scap: Backport for [[gerrit:979704|[Zebra] Make .vector-column-start cache compatible (T347712 T351830)]], [[gerrit:980467|Fix nonzebra sticky container scrolling behavior and scrollable indicator (T352464)]]
[21:27:49] <stashbot>	 T347712: [Zebra] remove feature flag & merge Zebra into default styles - https://phabricator.wikimedia.org/T347712
[21:27:50] <stashbot>	 T351830: [Zebra] Make vector-column-start element cache compatible - https://phabricator.wikimedia.org/T351830
[21:27:50] <stashbot>	 T352464: Non zebra scrollable indicators on sticky pinnable elements (toc, page tools, client prefs) are broken - https://phabricator.wikimedia.org/T352464
[21:27:58] <James_F>	 kimberly_sarabia: Once this is done, can I do the two stream config changes together?
[21:28:15] <wikibugs>	 (03PS2) 10Bking: Revert "wdqs: Monitor LDF endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/980468
[21:28:39] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] Revert "wdqs: Monitor LDF endpoint" [puppet] - 10https://gerrit.wikimedia.org/r/980468 (owner: 10Bking)
[21:28:59] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:29:12] <kimberly_sarabia>	 James_F: Thanks for checking. We will need to rebase the second one
[21:29:17] * James_F nods.
[21:29:24] <wikibugs>	 (03PS5) 10Jforrester: Define the corresponding stream for scroll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia)
[21:29:28] <wikibugs>	 (03PS6) 10Jforrester: Add stream config for *webuiactions via Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming)
[21:29:44] <James_F>	 Stacked them so now they should be good to co-deploy.
[21:29:59] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on wdqs1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:30:00] <logmsgbot>	 !log jforrester@deploy2002 jdlrobson and jforrester and jdrewniak: Backport for [[gerrit:979704|[Zebra] Make .vector-column-start cache compatible (T347712 T351830)]], [[gerrit:980467|Fix nonzebra sticky container scrolling behavior and scrollable indicator (T352464)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:30:03] <James_F>	 kimberly_sarabia: OK, Vector back-port ready to test. Please check!
[21:30:16] <kimberly_sarabia>	 Cool sounds good. 
[21:30:18] <kimberly_sarabia>	 Will do
[21:32:20] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/979169/833/" [puppet] - 10https://gerrit.wikimedia.org/r/979169 (https://phabricator.wikimedia.org/T348392) (owner: 10Dzahn)
[21:33:59] <jinxer-wm>	 (PuppetZeroResources) firing: (4) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:34:11] <jan_drewniak>	 James_F: the Vector patches look good to sync 👍
[21:34:15] <James_F>	 Ace.
[21:34:17] <logmsgbot>	 !log jforrester@deploy2002 jdlrobson and jforrester and jdrewniak: Continuing with sync
[21:34:55] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance
[21:34:59] <jinxer-wm>	 (PuppetZeroResources) firing: (3) Puppet has failed generate resources on wdqs1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:35:10] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2102.codfw.wmnet with reason: Maintenance
[21:35:26] <wikibugs>	 (03CR) 10Jdlrobson: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang)
[21:35:58] <wikibugs>	 (03CR) 10Jforrester: Deploy VectorClientPreferences to beta on pl,fr,ca,fa,tr wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang)
[21:36:56] <inflatador>	  !log bking@prometheus1006 re-enable puppet T347355
[21:36:57] <stashbot>	 T347355: Create alerts for https://query.wikidata.org/bigdata/ldf - https://phabricator.wikimedia.org/T347355
[21:38:59] <jinxer-wm>	 (PuppetZeroResources) firing: (5) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:39:59] <jinxer-wm>	 (PuppetZeroResources) resolved: (3) Puppet has failed generate resources on wdqs1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:40:32] <logmsgbot>	 !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:979704|[Zebra] Make .vector-column-start cache compatible (T347712 T351830)]], [[gerrit:980467|Fix nonzebra sticky container scrolling behavior and scrollable indicator (T352464)]] (duration: 12m 50s)
[21:40:38] <James_F>	 OK! Finally. Now for the two event stream patches.
[21:40:38] <stashbot>	 T347712: [Zebra] remove feature flag & merge Zebra into default styles - https://phabricator.wikimedia.org/T347712
[21:40:38] <stashbot>	 T351830: [Zebra] Make vector-column-start element cache compatible - https://phabricator.wikimedia.org/T351830
[21:40:39] <stashbot>	 T352464: Non zebra scrollable indicators on sticky pinnable elements (toc, page tools, client prefs) are broken - https://phabricator.wikimedia.org/T352464
[21:41:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia)
[21:41:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming)
[21:41:52] <wikibugs>	 (03Merged) 10jenkins-bot: Define the corresponding stream for scroll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia)
[21:41:55] <wikibugs>	 (03Merged) 10jenkins-bot: Add stream config for *webuiactions via Metrics Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978947 (https://phabricator.wikimedia.org/T351298) (owner: 10Clare Ming)
[21:42:12] <logmsgbot>	 !log jforrester@deploy2002 Started scap: Backport for [[gerrit:977785|Define the corresponding stream for scroll (T350883)]], [[gerrit:978947|Add stream config for *webuiactions via Metrics Platform (T351298)]]
[21:42:17] <stashbot>	 T350883:  Port WebUIScroll schema to the new metrics platform - https://phabricator.wikimedia.org/T350883
[21:42:17] <stashbot>	 T351298: [User Story] Partial migration of *UIActions instrument to the Core Interaction API - https://phabricator.wikimedia.org/T351298
[21:43:34] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance
[21:43:48] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2112.codfw.wmnet with reason: Maintenance
[21:43:49] <logmsgbot>	 !log jforrester@deploy2002 ksarabia and jforrester and cjming: Backport for [[gerrit:977785|Define the corresponding stream for scroll (T350883)]], [[gerrit:978947|Add stream config for *webuiactions via Metrics Platform (T351298)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:43:59] <jinxer-wm>	 (PuppetZeroResources) firing: (5) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:44:42] <icinga-wm>	 RECOVERY - cassandra-b CQL 10.192.16.238:9042 on restbase2028 is OK: TCP OK - 0.033 second response time on 10.192.16.238 port 9042 https://phabricator.wikimedia.org/T93886
[21:47:47] <James_F>	 kimberly_sarabia: Please test and confirm.
[21:48:24] <kimberly_sarabia>	 James_F: Thanks, one moment
[21:48:31] <James_F>	 Of course. :-)
[21:48:59] <jinxer-wm>	 (PuppetZeroResources) firing: (5) Puppet has failed generate resources on wdqs1011:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:50:40] <icinga-wm>	 RECOVERY - MD RAID on aqs1013 is OK: OK: Active: 12, Working: 12, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[21:51:15] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[21:51:29] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2116.codfw.wmnet with reason: Maintenance
[21:51:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2116 (T348183)', diff saved to https://phabricator.wikimedia.org/P54209 and previous config saved to /var/cache/conftool/dbconfig/20231205-215135-arnaudb.json
[21:51:44] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[21:53:48] <kimberly_sarabia>	 James_F: LGTM
[21:53:51] <James_F>	 Cool.
[21:53:53] <logmsgbot>	 !log jforrester@deploy2002 ksarabia and jforrester and cjming: Continuing with sync
[21:53:59] <jinxer-wm>	 (PuppetZeroResources) resolved: (3) Puppet has failed generate resources on wdqs1016:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[21:57:06] <icinga-wm>	 RECOVERY - cassandra-c service on restbase2028 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:57:32] <icinga-wm>	 RECOVERY - cassandra-c SSL 10.192.16.239:7000 on restbase2028 is OK: SSL OK - Certificate restbase2028-c valid until 2025-12-03 21:33:03 +0000 (expires in 728 days) https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:01:14] <logmsgbot>	 !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:977785|Define the corresponding stream for scroll (T350883)]], [[gerrit:978947|Add stream config for *webuiactions via Metrics Platform (T351298)]] (duration: 19m 01s)
[22:01:27] <stashbot>	 T350883:  Port WebUIScroll schema to the new metrics platform - https://phabricator.wikimedia.org/T350883
[22:01:27] <stashbot>	 T351298: [User Story] Partial migration of *UIActions instrument to the Core Interaction API - https://phabricator.wikimedia.org/T351298
[22:02:13] <James_F>	 OK, deployment window done, finally.
[22:02:57] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T348183)', diff saved to https://phabricator.wikimedia.org/P54210 and previous config saved to /var/cache/conftool/dbconfig/20231205-220256-arnaudb.json
[22:03:01] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[22:03:09] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355)
[22:03:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper)
[22:03:46] <wikibugs>	 (03CR) 10Ryan Kemper: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper)
[22:04:04] <jinxer-wm>	 (ProbeDown) resolved: Service wdqs1015:443 has failed probes (http_query_wikidata_org_ldf_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:04:39] <kimberly_sarabia>	 James_F: Thank you!
[22:04:50] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355)
[22:05:06] <James_F>	 Of course. :-)
[22:14:26] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: monitor ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355)
[22:15:06] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/834/console" [puppet] - 10https://gerrit.wikimedia.org/r/980499 (https://phabricator.wikimedia.org/T347355) (owner: 10Ryan Kemper)
[22:18:04] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P54211 and previous config saved to /var/cache/conftool/dbconfig/20231205-221803-arnaudb.json
[22:19:53] <wikibugs>	 (03PS3) 10JHathaway: apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604)
[22:21:56] <wikibugs>	 (03CR) 10JHathaway: apt_repo: move hiera data into module, to allow for validation (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[22:22:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604) (owner: 10JHathaway)
[22:26:06] <wikibugs>	 (03PS4) 10JHathaway: apt_repo: move hiera data into module, to allow for validation [puppet] - 10https://gerrit.wikimedia.org/r/979470 (https://phabricator.wikimedia.org/T352604)
[22:33:10] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116', diff saved to https://phabricator.wikimedia.org/P54212 and previous config saved to /var/cache/conftool/dbconfig/20231205-223309-arnaudb.json
[22:34:06] <wikibugs>	 (03PS1) 10Bking: trafficserver: revert to using hostname for wdqs ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980503 (https://phabricator.wikimedia.org/T352111)
[22:34:58] <wikibugs>	 (03PS2) 10RhinosF1: test [puppet] - 10https://gerrit.wikimedia.org/r/980470
[22:35:00] <wikibugs>	 (03CR) 10RhinosF1: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980470 (owner: 10RhinosF1)
[22:41:16] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980503 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking)
[22:41:47] <wikibugs>	 (03PS2) 10Bking: trafficserver: revert to using hostname for wdqs ldf endpoint [puppet] - 10https://gerrit.wikimedia.org/r/980503 (https://phabricator.wikimedia.org/T352111)
[22:43:09] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/980503 (https://phabricator.wikimedia.org/T352111) (owner: 10Bking)
[22:48:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2116 (T348183)', diff saved to https://phabricator.wikimedia.org/P54213 and previous config saved to /var/cache/conftool/dbconfig/20231205-224816-arnaudb.json
[22:48:19] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance
[22:48:21] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[22:48:32] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2130.codfw.wmnet with reason: Maintenance
[22:48:39] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2130 (T348183)', diff saved to https://phabricator.wikimedia.org/P54214 and previous config saved to /var/cache/conftool/dbconfig/20231205-224838-arnaudb.json
[22:54:05] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:54:27] <wikibugs>	 (03PS2) 10Ladsgroup: [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253)
[22:57:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Add compare tables periodic job [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup)
[22:58:22] <wikibugs>	 (03CR) 10Ladsgroup: [WIP] Add compare tables periodic job (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup)
[22:59:06] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T348183)', diff saved to https://phabricator.wikimedia.org/P54215 and previous config saved to /var/cache/conftool/dbconfig/20231205-225905-arnaudb.json
[22:59:10] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[23:00:22] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:14:13] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P54216 and previous config saved to /var/cache/conftool/dbconfig/20231205-231412-arnaudb.json
[23:29:19] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P54217 and previous config saved to /var/cache/conftool/dbconfig/20231205-232918-arnaudb.json
[23:42:43] <Jdlrobson>	 Hey James_F are you still around? It seems like there is a problem with the last deployment.
[23:44:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T348183)', diff saved to https://phabricator.wikimedia.org/P54218 and previous config saved to /var/cache/conftool/dbconfig/20231205-234425-arnaudb.json
[23:44:28] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[23:44:30] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[23:44:34] <Jdlrobson>	 Specifically https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/980028 doesn't seem to have had any effect on French Wikipedia
[23:44:42] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[23:44:49] * James_F looks.
[23:45:39] <Jdlrobson>	 I can't replicate this locally. I'd be curious what the value of $wgVectorClientPreferences['beta']  is in the context of French Wikipedia
[23:45:49] <James_F>	 $wgVectorClientPreferences is set correctly in reality (per `mwscript --wiki=frwiki`).
[23:46:22] <Jdlrobson>	 Hmm.. I wonder if there's any clues in the logs
[23:46:39] <James_F>	 Whereas on dewiki it's set to false.
[23:48:24] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:48:29] <James_F>	 No log entries about VectorClientPreferences (or the indirection of CONFIG_KEY_CLIENT_PREFERENCES) that I can see.
[23:49:02] <Jdlrobson>	 What is the value of $wgVectorZebraDesign  for those wikis (just curious)?
[23:49:28] <James_F>	 dewiki is false/false/false; frwiki is true/true/false
[23:49:51] <James_F>	 (So both beta keys false.)
[23:50:50] <Jdlrobson>	 hmm.. It's almost as if Vector's GetBetaFeaturePreferences hook is not running for some reason
[23:50:59] <James_F>	 Oh wait.
[23:51:02] <James_F>	 This is a Beta Feature?
[23:51:09] <James_F>	 Did you go through the new Beta Feature process?
[23:51:34] <James_F>	 (I know you didn't speak to me about this, but maybe you spoke to Greg?)
[23:51:51] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance
[23:52:00] <James_F>	 New Beta Features have to be approved and added to the allowlist (to avoid over-loading our users with too many Beta Features at once).
[23:52:07] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance
[23:52:13] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2145 (T348183)', diff saved to https://phabricator.wikimedia.org/P54219 and previous config saved to /var/cache/conftool/dbconfig/20231205-235213-arnaudb.json
[23:52:17] <Jdlrobson>	 ahhh okay that would explain it.
[23:52:17] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[23:53:33] <Jdlrobson>	 No I don't believe we went through a process (I vaguely recall there is one now you point it out to me, but I don't think I've built a beta feature since 2014 :)).
[23:53:50] <Jdlrobson>	 With that in mind would it make sense to revert the patch or is it safe in the current state?
[23:54:11] <James_F>	 FFS.
[23:54:18] <James_F>	 VECTOR_2022_BETA_KEY indirection is very unhelpful.
[23:54:31] <James_F>	 It's entirely safe, it's just non-operational.
[23:54:48] <Jdlrobson>	 Okay. Is this the specific process you are talking about: https://www.mediawiki.org/wiki/Beta_Features#Creating_your_own
[23:55:38] <James_F>	 Yup.
[23:57:10] <James_F>	 Also your landing page is just https://www.mediawiki.org/wiki/Skin:Vector/2022 rather than specifically about this Beta Feature and why people should think this particular feature is worth opting into, and the talk page is just https://www.mediawiki.org/wiki/Talk:Reading/Web/Desktop_Improvements which isn't ideal either (but as long as it's monitored it's fine).
[23:57:45] <James_F>	 The preference is 'vector-2022-beta-feature' which suggests it's very general, but this is the text accessibility work, right?
[23:58:42] <James_F>	 Oooh, you're trying to register multiple different features under one Beta Feature? Tut, that's going to confuse/upset a bunch of people.