[00:01:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T391056)', diff saved to https://phabricator.wikimedia.org/P74663 and previous config saved to /var/cache/conftool/dbconfig/20250408-000130-fceratto.json
[00:01:33] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[00:09:53] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134773
[00:09:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134773 (owner: 10TrainBranchBot)
[00:12:15] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002"
[00:12:16] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1202.eqiad.wmnet with OS bullseye
[00:16:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P74664 and previous config saved to /var/cache/conftool/dbconfig/20250408-001637-fceratto.json
[00:17:13] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[00:21:03] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1202.eqiad.wmnet
[00:22:52] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1202.eqiad.wmnet
[00:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10719822 (10phaultfinder)
[00:24:39] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[00:24:55] <wikibugs>	 (03PS1) 10Superpes15: [ptwiktionary] Create a Wikisaurus namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134776 (https://phabricator.wikimedia.org/T391299)
[00:26:17] <wikibugs>	 (03PS1) 10Btullis: Bring an-worker1202 into service [puppet] - 10https://gerrit.wikimedia.org/r/1134777 (https://phabricator.wikimedia.org/T390048)
[00:26:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10719827 (10BTullis) a:05Jclark-ctr→03BTullis
[00:27:22] <wikibugs>	 (03CR) 10Btullis: [C:03+2] Bring an-worker1202 into service [puppet] - 10https://gerrit.wikimedia.org/r/1134777 (https://phabricator.wikimedia.org/T390048) (owner: 10Btullis)
[00:27:50] <wikibugs>	 (03PS2) 10Superpes15: [ptwiktionary] Create a Wikisaurus namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134776 (https://phabricator.wikimedia.org/T391299)
[00:31:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P74665 and previous config saved to /var/cache/conftool/dbconfig/20250408-003144-fceratto.json
[00:37:04] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[00:37:13] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[00:40:12] <wikibugs>	 (03PS1) 10Dzahn: phabricator: apply a staging role/profile to host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889)
[00:40:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] phabricator: apply a staging role/profile to host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[00:41:43] <wikibugs>	 (03PS1) 10Dzahn: phabricator: apply phabricator::migration role on host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134779 (https://phabricator.wikimedia.org/T377889)
[00:43:03] <logmsgbot>	 !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1202.eqiad.wmnet
[00:43:16] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10719844 (10ops-monitoring-bot) Host rebooted by btullis@cumin1002 with reason: Reboot after moving vlan and commission...
[00:44:46] <wikibugs>	 (03PS2) 10Dzahn: phabricator: apply phabricator::migration role on host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134779 (https://phabricator.wikimedia.org/T377889)
[00:46:24] <wikibugs>	 (03CR) 10Dzahn: "well, I need to fix the nftables/ferm change to minimize the diff.. but: https://puppet-compiler.wmflabs.org/output/1134779/5226/phab1005." [puppet] - 10https://gerrit.wikimedia.org/r/1134779 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[00:46:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T391056)', diff saved to https://phabricator.wikimedia.org/P74666 and previous config saved to /var/cache/conftool/dbconfig/20250408-004652-fceratto.json
[00:46:56] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[00:47:08] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2180.codfw.wmnet with reason: Maintenance
[00:47:13] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:47:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T391056)', diff saved to https://phabricator.wikimedia.org/P74667 and previous config saved to /var/cache/conftool/dbconfig/20250408-004715-fceratto.json
[00:48:00] <wikibugs>	 (03CR) 10Dzahn: "This is an alternative to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134779 which does not do the scap setup at all and would o" [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[00:48:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T391056)', diff saved to https://phabricator.wikimedia.org/P74668 and previous config saved to /var/cache/conftool/dbconfig/20250408-004827-fceratto.json
[00:48:33] <logmsgbot>	 !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1202.eqiad.wmnet
[00:49:18] <wikibugs>	 (03PS2) 10Dzahn: phabricator: apply a staging role/profile to host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889)
[00:49:41] <wikibugs>	 (03CR) 10CI reject: [V:04-1] phabricator: apply a staging role/profile to host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[00:54:10] <wikibugs>	 (03PS1) 10Dzahn: phabricator::migration: use nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1134781 (https://phabricator.wikimedia.org/T370677)
[00:56:00] <wikibugs>	 (03PS3) 10Dzahn: phabricator: apply a staging role/profile to host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889)
[00:57:04] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[00:58:44] <wikibugs>	 (03CR) 10Dzahn: [V:03+1 C:03+2] "currently no server is using this role (verified with cumin), so self merging this. the real diff will be when the role is applied to phab" [puppet] - 10https://gerrit.wikimedia.org/r/1134781 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn)
[01:01:06] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134773 (owner: 10TrainBranchBot)
[01:03:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P74669 and previous config saved to /var/cache/conftool/dbconfig/20250408-010334-fceratto.json
[01:04:51] <wikibugs>	 (03CR) 10Dzahn: "reduced diff https://puppet-compiler.wmflabs.org/output/1134779/5227/phab1005.eqiad.wmnet/index.html  after https://gerrit.wikimedia.org/r" [puppet] - 10https://gerrit.wikimedia.org/r/1134779 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn)
[01:09:53] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.24 [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134782 (https://phabricator.wikimedia.org/T386219)
[01:09:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.24 [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134782 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot)
[01:18:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P74670 and previous config saved to /var/cache/conftool/dbconfig/20250408-011841-fceratto.json
[01:19:17] <wikibugs>	 (03PS1) 10Scott French: scap: Use PHP 8.1 when executing maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225)
[01:20:53] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.24 [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134782 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot)
[01:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10719947 (10phaultfinder)
[01:29:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[01:33:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T391056)', diff saved to https://phabricator.wikimedia.org/P74671 and previous config saved to /var/cache/conftool/dbconfig/20250408-013348-fceratto.json
[01:33:51] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[01:34:04] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2193.codfw.wmnet with reason: Maintenance
[01:34:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T391056)', diff saved to https://phabricator.wikimedia.org/P74672 and previous config saved to /var/cache/conftool/dbconfig/20250408-013412-fceratto.json
[01:36:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T391056)', diff saved to https://phabricator.wikimedia.org/P74673 and previous config saved to /var/cache/conftool/dbconfig/20250408-013625-fceratto.json
[01:51:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P74674 and previous config saved to /var/cache/conftool/dbconfig/20250408-015132-fceratto.json
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0200)
[02:06:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P74675 and previous config saved to /var/cache/conftool/dbconfig/20250408-020639-fceratto.json
[02:21:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T391056)', diff saved to https://phabricator.wikimedia.org/P74676 and previous config saved to /var/cache/conftool/dbconfig/20250408-022146-fceratto.json
[02:21:50] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[02:22:02] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2197.codfw.wmnet with reason: Maintenance
[02:25:31] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2214.codfw.wmnet with reason: Maintenance
[02:25:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74677 and previous config saved to /var/cache/conftool/dbconfig/20250408-022538-fceratto.json
[02:30:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74678 and previous config saved to /var/cache/conftool/dbconfig/20250408-023047-fceratto.json
[02:30:51] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[02:45:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P74679 and previous config saved to /var/cache/conftool/dbconfig/20250408-024555-fceratto.json
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0300)
[03:01:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P74680 and previous config saved to /var/cache/conftool/dbconfig/20250408-030102-fceratto.json
[03:01:45] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134788 (https://phabricator.wikimedia.org/T386219)
[03:01:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134788 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot)
[03:02:37] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134788 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot)
[03:02:59] <logmsgbot>	 !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.44.0-wmf.24  refs T386219
[03:03:02] <stashbot>	 T386219: 1.44.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T386219
[03:16:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74681 and previous config saved to /var/cache/conftool/dbconfig/20250408-031609-fceratto.json
[03:16:13] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[03:16:25] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2217.codfw.wmnet with reason: Maintenance
[03:16:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T391056)', diff saved to https://phabricator.wikimedia.org/P74682 and previous config saved to /var/cache/conftool/dbconfig/20250408-031632-fceratto.json
[03:21:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T391056)', diff saved to https://phabricator.wikimedia.org/P74683 and previous config saved to /var/cache/conftool/dbconfig/20250408-032145-fceratto.json
[03:21:49] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[03:35:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:36:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P74684 and previous config saved to /var/cache/conftool/dbconfig/20250408-033652-fceratto.json
[03:42:13] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:52:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P74685 and previous config saved to /var/cache/conftool/dbconfig/20250408-035159-fceratto.json
[04:00:05] <jouncebot>	 Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0400)
[04:06:42] <logmsgbot>	 !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.44.0-wmf.24  refs T386219 (duration: 63m 43s)
[04:06:45] <stashbot>	 T386219: 1.44.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T386219
[04:07:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T391056)', diff saved to https://phabricator.wikimedia.org/P74686 and previous config saved to /var/cache/conftool/dbconfig/20250408-040706-fceratto.json
[04:07:09] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[04:07:22] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2224.codfw.wmnet with reason: Maintenance
[04:07:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T391056)', diff saved to https://phabricator.wikimedia.org/P74687 and previous config saved to /var/cache/conftool/dbconfig/20250408-040728-fceratto.json
[04:09:28] <logmsgbot>	 !log mwpresync@deploy1003 Pruned MediaWiki: 1.44.0-wmf.21 (duration: 09m 26s)
[04:12:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T391056)', diff saved to https://phabricator.wikimedia.org/P74688 and previous config saved to /var/cache/conftool/dbconfig/20250408-041241-fceratto.json
[04:12:44] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[04:17:13] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[04:24:39] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[04:27:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P74689 and previous config saved to /var/cache/conftool/dbconfig/20250408-042748-fceratto.json
[04:37:13] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[04:42:04] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[04:42:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P74690 and previous config saved to /var/cache/conftool/dbconfig/20250408-044254-fceratto.json
[04:47:13] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:58:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T391056)', diff saved to https://phabricator.wikimedia.org/P74691 and previous config saved to /var/cache/conftool/dbconfig/20250408-045801-fceratto.json
[04:58:05] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[05:02:04] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[05:10:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:19:51] <jinxer-wm>	 FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:22:27] <wikibugs>	 (03PS3) 10Jelto: Ceph: add types for S3 credential and account [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922)
[05:23:38] <wikibugs>	 (03CR) 10Jelto: Ceph: add types for S3 credential and account (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[05:24:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:29:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[05:59:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0600)
[06:00:05] <jouncebot>	 marostegui, Amir1, and federico3: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0600).
[06:00:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:04:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:19:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:24:51] <jinxer-wm>	 RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[06:35:47] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] trafficserver: switch querybuilder scholarly to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1134697 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[06:36:27] <wikibugs>	 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#10720225 (10Marostegui) What is pending here @Ladsgroup?
[06:41:24] <wikibugs>	 (03PS1) 10Marostegui: db1151,db2144: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1134926 (https://phabricator.wikimedia.org/T391317)
[06:42:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool ms2 T391317', diff saved to https://phabricator.wikimedia.org/P74692 and previous config saved to /var/cache/conftool/dbconfig/20250408-064250-marostegui.json
[06:42:53] <stashbot>	 T391317: Migrate msX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391317
[06:43:31] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2144.codfw.wmnet,db1151.eqiad.wmnet with reason: Maintenance
[06:43:45] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1151,db2144: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1134926 (https://phabricator.wikimedia.org/T391317) (owner: 10Marostegui)
[06:45:36] <marostegui>	 !log Upgrade ms2 to MariaDB 10.11 codfw eqiad dbmaint T391317
[06:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool ms2 T391317', diff saved to https://phabricator.wikimedia.org/P74693 and previous config saved to /var/cache/conftool/dbconfig/20250408-064813-marostegui.json
[06:48:16] <stashbot>	 T391317: Migrate msX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391317
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0700).
[07:00:04] <jouncebot>	 abijeet and kevinbazira: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:36] <kevinbazira>	 o/
[07:02:36] <kart_>	 here for abijeet's change deployment + testing.
[07:02:48] <kart_>	 I can start with first two changes.
[07:03:05] <abijeet>	 o/
[07:03:32] <abijeet>	 kart_, we should do this one first: [config] 1130963 (deploy commands) AX: Enable Quick Surveys extension on Tswana and Venetian wiki - task T390023
[07:03:33] <stashbot>	 T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023
[07:03:40] <kart_>	 abijeet: Starting with the first change
[07:03:43] <kart_>	 yes
[07:04:02] <abijeet>	 kart_, either is fine but preferable to do that one first.
[07:04:34] <wikibugs>	 (03PS6) 10Abijeet Patro: AX: Enable Quick Surveys extension on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023)
[07:04:52] <kart_>	 rebasing it to make sure we don't get merge conflict in 2nd as well.
[07:06:41] <kart_>	 abijeet: should it be 'vecwiki' and 'tnwiki' instead of 'vec' and 'tn' in the patch?
[07:08:07] <abijeet>	 kart_, checking
[07:08:31] <kart_>	 Otherwise it will enable extension in wiktionary/wikisource as well!
[07:08:46] <abijeet>	 yes, you are right. Fixing.
[07:08:49] <kart_>	 See: https://integration.wikimedia.org/ci/job/operations-mw-config-php81-composer-diffConfig/273/console and https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig/4289/console
[07:10:07] <wikibugs>	 (03PS7) 10Abijeet Patro: AX: Enable Quick Surveys extension on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023)
[07:10:26] <abijeet>	 kart_, thanks for spotting that!
[07:10:41] <abijeet>	 kart_, fixed.
[07:11:53] <kart_>	 cool. Starting.
[07:12:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro)
[07:12:56] <wikibugs>	 (03Merged) 10jenkins-bot: AX: Enable Quick Surveys extension on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro)
[07:13:50] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1130963|AX: Enable Quick Surveys extension on Tswana and Venetian wiki (T390023)]]
[07:13:53] <stashbot>	 T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023
[07:15:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[07:16:44] <wikibugs>	 (03PS1) 10Marostegui: production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332)
[07:16:45] <kevinbazira>	 kart_: o/ if you don't mind, please deploy my change too? it's the one following abijeet's 2 changes. thanks in advance!
[07:18:19] <wikibugs>	 (03CR) 10Marostegui: "This is a noop, it is just for tracking" [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[07:18:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[07:19:01] <wikibugs>	 (03PS2) 10Marostegui: production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332)
[07:20:44] <jinxer-wm>	 RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[07:21:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[07:21:22] <logmsgbot>	 !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1130963|AX: Enable Quick Surveys extension on Tswana and Venetian wiki (T390023)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:21:24] <stashbot>	 T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023
[07:21:26] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Thanks for your work on this, this change LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[07:21:27] <kart_>	 kevinbazira: sure!
[07:21:41] <kart_>	 abijeet: testing please :)
[07:22:20] <abijeet>	 kart_, on it
[07:25:07] <abijeet>	 kart_, looks good.
[07:25:08] <wikibugs>	 (03PS3) 10Marostegui: production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332)
[07:25:24] <kart_>	 cool.
[07:25:26] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Add apus-fe2003 to hiera and conftool [puppet] - 10https://gerrit.wikimedia.org/r/1134208 (https://phabricator.wikimedia.org/T390578) (owner: 10MVernon)
[07:25:26] <logmsgbot>	 !log kartik@deploy1003 abi, kartik: Continuing with sync
[07:25:51] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Thanos: add new thanos-fe200[5-7] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1134221 (https://phabricator.wikimedia.org/T389634) (owner: 10MVernon)
[07:26:54] <wikibugs>	 (03CR) 10Marostegui: production-ms.sql.erb: Add file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[07:27:46] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui)
[07:29:51] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] scap: Use PHP 8.1 when executing maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French)
[07:29:57] <wikibugs>	 (03PS1) 10DCausse: cirrus: disable completion indices in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134969 (https://phabricator.wikimedia.org/T388610)
[07:30:03] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: cleanup TLS on volatile storage custom files [puppet] - 10https://gerrit.wikimedia.org/r/1134698 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur)
[07:30:22] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrus: disable completion indices in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134969 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse)
[07:30:30] <wikibugs>	 (03PS1) 10Slyngshede: IDP: Failover to updated host [dns] - 10https://gerrit.wikimedia.org/r/1134970
[07:31:55] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1185 - https://phabricator.wikimedia.org/T391049#10720341 (10Marostegui) p:05Triage→03Medium
[07:32:48] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] IDP: Failover to updated host [dns] - 10https://gerrit.wikimedia.org/r/1134970 (owner: 10Slyngshede)
[07:33:05] <logmsgbot>	 !log slyngshede@dns1004 START - running authdns-update
[07:33:55] <wikibugs>	 (03PS2) 10DCausse: cirrus: disable completion indices in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134969 (https://phabricator.wikimedia.org/T388610)
[07:34:18] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130963|AX: Enable Quick Surveys extension on Tswana and Venetian wiki (T390023)]] (duration: 20m 27s)
[07:34:21] <stashbot>	 T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023
[07:35:29] <logmsgbot>	 !log slyngshede@dns1004 END - running authdns-update
[07:36:42] <kart_>	 abijeet: going with second change now.
[07:36:55] <abijeet>	 kart_, okie
[07:37:05] <abijeet>	 kart_, we should keep an eye on MinT
[07:37:33] <kart_>	 Yes. Check Grafana dashboard.
[07:37:42] <abijeet>	 kart_, ok
[07:38:11] <wikibugs>	 (03PS5) 10Abijeet Patro: AX: Enable entry-points on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130942 (https://phabricator.wikimedia.org/T390023)
[07:39:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130942 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro)
[07:40:41] <wikibugs>	 (03Merged) 10jenkins-bot: AX: Enable entry-points on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130942 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro)
[07:41:02] <dcausse>	 jouncebot: nowandnext
[07:41:02] <jouncebot>	 For the next 0 hour(s) and 18 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0700)
[07:41:02] <jouncebot>	 In 2 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1000)
[07:41:03] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1130942|AX: Enable entry-points on Tswana and Venetian wiki (T390023)]]
[07:41:06] <stashbot>	 T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023
[07:42:13] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:48:20] <logmsgbot>	 !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1130942|AX: Enable entry-points on Tswana and Venetian wiki (T390023)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:48:23] <stashbot>	 T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023
[07:49:00] <kart_>	 abijeet: testing time.
[07:49:21] <abijeet>	 kart_, ok
[07:53:40] <abijeet>	 kart_, looks ok. I can see atleast one entrypoint workign as expected. I can test the others later.
[07:55:35] <kart_>	 OK. Let's go ahead.
[07:55:37] <logmsgbot>	 !log kartik@deploy1003 abi, kartik: Continuing with sync
[07:56:38] <wikibugs>	 (03PS3) 10Kevin Bazira: EventStreamConfig: Add RRLA prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179)
[07:56:40] <abijeet>	 kart_, sorry. one sec
[07:57:33] <abijeet>	 I'm seeing this error on wikis: Error: Cannot require undefined file ../codex.js ArticleFooterEntrypointCard.vue:2
[07:57:34] <abijeet>	     require startup.js:1016
[07:57:34] <abijeet>	     js ext.ax.articlefooter.entrypoint.js:3
[07:58:00] <kart_>	 abijeet: oops I started the deployment. Do you want to revert it?
[07:58:21] <kart_>	 Or we can do followup fix later?
[07:58:53] <abijeet>	 I can submit a patch to fix immediately. Maybe we want to fix it and backport kater?
[07:59:42] <kart_>	 OK. Please submit it, we can backport it and deploy later today.
[08:00:44] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] external_cloud_vendors: Added Google SpecialCaseCrawlers list [puppet] - 10https://gerrit.wikimedia.org/r/1134243 (https://phabricator.wikimedia.org/T391108) (owner: 10Fabfur)
[08:02:37] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130942|AX: Enable entry-points on Tswana and Venetian wiki (T390023)]] (duration: 21m 33s)
[08:02:40] <stashbot>	 T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023
[08:03:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde)
[08:03:55] <kart_>	 kevinbazira: going with your change. Around?
[08:04:04] <kevinbazira>	 yes, I am. tx
[08:04:37] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[08:05:21] <wikibugs>	 (03CR) 10MVernon: [C:03+2] Add apus-fe2003 to hiera and conftool [puppet] - 10https://gerrit.wikimedia.org/r/1134208 (https://phabricator.wikimedia.org/T390578) (owner: 10MVernon)
[08:05:27] <wikibugs>	 (03Merged) 10jenkins-bot: EventStreamConfig: Add RRLA prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira)
[08:05:52] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1133603|EventStreamConfig: Add RRLA prediction_change stream (T326179)]]
[08:05:55] <stashbot>	 T326179: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179
[08:12:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool ms1 T391317', diff saved to https://phabricator.wikimedia.org/P74694 and previous config saved to /var/cache/conftool/dbconfig/20250408-081224-marostegui.json
[08:12:28] <stashbot>	 T391317: Migrate msX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391317
[08:12:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool ms1 T391317', diff saved to https://phabricator.wikimedia.org/P74695 and previous config saved to /var/cache/conftool/dbconfig/20250408-081248-marostegui.json
[08:12:58] <logmsgbot>	 !log kartik@deploy1003 kartik, kevinbazira: Backport for [[gerrit:1133603|EventStreamConfig: Add RRLA prediction_change stream (T326179)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:13:01] <stashbot>	 T326179: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179
[08:13:24] <kart_>	 kevinbazira: possible to test on the testservers?
[08:13:50] <kevinbazira>	 kart-: we should be able to see `mediawiki.page_revert_risk_prediction_change.v1` listed on: 
[08:13:50] <kevinbazira>	 https://meta.wikimedia.org/w/api.php?action=streamconfigs and
[08:13:50] <kevinbazira>	 https://meta.wikimedia.beta.wmflabs.org/w/api.php?action=streamconfigs 
[08:13:50] <kevinbazira>	 but I don't see it yet. could be a cache issue on my end.
[08:16:09] <wikibugs>	 (03PS1) 10Abijeet Patro: ArticleFooterEntrypointCard: Change the way codex is loaded [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134976 (https://phabricator.wikimedia.org/T389176)
[08:16:29] <wikibugs>	 (03PS1) 10Abijeet Patro: ArticleFooterEntrypointCard: Change the way codex is loaded [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134977 (https://phabricator.wikimedia.org/T389176)
[08:16:44] <kart_>	 kevinbazira: I can see it. Did you select k8s-mwdebug to test?
[08:17:12] <kevinbazira>	 great!
[08:17:13] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[08:17:25] <kart_>	 I can see in meta.w.o but not in beta yet.
[08:19:29] <kart_>	 kevinbazira: should we go ahead for deployment?
[08:19:38] <kevinbazira>	 yes please
[08:22:13] <logmsgbot>	 !log kartik@deploy1003 kartik, kevinbazira: Continuing with sync
[08:23:42] <wikibugs>	 (03CR) 10Volans: [C:04-1] "I think there is still a small problem, the rest looks ok." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[08:24:39] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[08:25:07] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134976 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[08:25:15] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134977 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[08:28:42] <Emperor>	 !log pool apus-fe2003 T390578
[08:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:44] <stashbot>	 T390578: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578
[08:29:04] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=apus,name=apus-fe2003.codfw.wmnet
[08:29:04] <kevinbazira>	 kart_: now I can see `mediawiki.page_revert_risk_prediction_change.v1` listed on both: 
[08:29:05] <kevinbazira>	 https://meta.wikimedia.org/w/api.php?action=streamconfigs and
[08:29:05] <kevinbazira>	 https://meta.wikimedia.beta.wmflabs.org/w/api.php?action=streamconfigs 
[08:29:05] <kevinbazira>	 thanks alot for your help. :)
[08:29:09] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=apus,name=apus-fe2003.codfw.wmnet
[08:29:14] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133603|EventStreamConfig: Add RRLA prediction_change stream (T326179)]] (duration: 23m 21s)
[08:29:17] <stashbot>	 T326179: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179
[08:29:54] <kart_>	 Cool
[08:37:13] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[08:39:44] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134064 (https://phabricator.wikimedia.org/T389429) (owner: 10Ebernhardson)
[08:46:29] <wikibugs>	 (03PS1) 10Brouberol: airflow: set saner performance-related configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134985 (https://phabricator.wikimedia.org/T390945)
[08:47:01] <wikibugs>	 (03PS2) 10Wargo: search-redirect: fix case-sensitivity of project name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297)
[08:47:04] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[08:47:13] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:47:51] <wikibugs>	 (03PS2) 10Brouberol: airflow: set saner performance-related configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134985 (https://phabricator.wikimedia.org/T390945)
[08:58:10] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mw-wikifunctions: Remove the temporary -ingress DNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134987
[09:02:45] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Remove mw-wikifunctions-ingress RRs [dns] - 10https://gerrit.wikimedia.org/r/1134282
[09:05:25] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] Remove mw-wikifunctions-ingress RRs [dns] - 10https://gerrit.wikimedia.org/r/1134282 (owner: 10Alexandros Kosiaris)
[09:05:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifunctions: Remove the temporary -ingress DNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134987 (owner: 10Alexandros Kosiaris)
[09:05:46] <logmsgbot>	 !log akosiaris@dns1004 START - running authdns-update
[09:07:04] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[09:08:11] <logmsgbot>	 !log akosiaris@dns1004 END - running authdns-update
[09:11:08] <wikibugs>	 (03Merged) 10jenkins-bot: mw-wikifunctions: Remove the temporary -ingress DNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134987 (owner: 10Alexandros Kosiaris)
[09:24:20] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720641 (10Superpes15)
[09:25:45] <wikibugs>	 (03PS1) 10Jelto: trafficserver: switch all querybuilder backends to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1134988 (https://phabricator.wikimedia.org/T350793)
[09:27:07] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "querybuilder in query-scholarly is working fine from wikikube:" [puppet] - 10https://gerrit.wikimedia.org/r/1134988 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[09:28:46] <wikibugs>	 (03PS1) 10Ozge: ml-services: update edit-check image with pydantic. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989
[09:29:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[09:30:45] <wikibugs>	 (03CR) 10AikoChou: [C:03+1] ml-services: update edit-check image with pydantic. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989 (owner: 10Ozge)
[09:31:45] <wikibugs>	 (03CR) 10Ozge: "ml-services: update edit-check image with pydantic." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989 (owner: 10Ozge)
[09:31:53] <wikibugs>	 (03PS2) 10Ozge: ml-services: update edit-check image with pydantic. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989
[09:32:15] <wikibugs>	 (03CR) 10Ozge: [V:03+2 C:03+2] ml-services: update edit-check image with pydantic. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989 (owner: 10Ozge)
[09:33:41] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update edit-check image with pydantic. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989 (owner: 10Ozge)
[09:33:58] <wikibugs>	 (03CR) 10Jelto: [C:03+2] Ceph: add types for S3 credential and account [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[09:37:13] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:40:37] <wikibugs>	 (03PS3) 10Brouberol: airflow: set saner performance-related configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134985 (https://phabricator.wikimedia.org/T390945)
[09:42:13] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:42:47] <logmsgbot>	 !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[09:52:00] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr2-eqiad:xe-3/1/7 (Core: pfw1-eqiad:xe-7/2/0 {#4027}) - https://phabricator.wikimedia.org/T390869#10720716 (10cmooney) 05Open→03Resolved a:03cmooney Gonna close this one, seems we had a burst of errors when we had the problem last week and...
[09:57:17] <logmsgbot>	 !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1000)
[10:00:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:04:03] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] trafficserver: switch all querybuilder backends to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1134988 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto)
[10:05:39] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10720747 (10cmooney) It seems the work yesterday has not stopped the carrier transitions reported, although the number has decreased:  {F59013584 wid...
[10:06:05] <wikibugs>	 (03PS1) 10Klausman: ml-services/experimental: clean up a few GPU-using services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134991
[10:06:50] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134991 (owner: 10Klausman)
[10:14:50] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134691 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE))
[10:15:04] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720765 (10Jelto)
[10:16:05] <wikibugs>	 (03CR) 10MVernon: [C:03+2] Thanos: add new thanos-fe200[5-7] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1134221 (https://phabricator.wikimedia.org/T389634) (owner: 10MVernon)
[10:18:09] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132674 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert)
[10:18:55] <wikibugs>	 (03PS1) 10Ladsgroup: Bump thumbnail steps to 75% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134999 (https://phabricator.wikimedia.org/T360589)
[10:18:56] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720807 (10Jelto) It looks like bouncing started today at 01:00 https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=now-24h&to=now&viewPanel=2  I'll chec...
[10:21:29] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720816 (10LSobanski) Fixed time dashboard for reference: https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&viewPanel=2&from=1744089600000&to=1744107600000
[10:21:59] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw::periodic_jobs: Migrate deleteOldSurveys [puppet] - 10https://gerrit.wikimedia.org/r/1132674 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert)
[10:23:49] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[10:24:06] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[10:24:13] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T391056)', diff saved to https://phabricator.wikimedia.org/P74698 and previous config saved to /var/cache/conftool/dbconfig/20250408-102412-fceratto.json
[10:24:16] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[10:24:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10720821 (10phaultfinder)
[10:24:40] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "LGTM!  A veritable lesson in well-writted code tbh :)  Perhaps get Luca to give a once over on the Python side but all looks good to me ni" [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[10:25:58] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-fe2005.codfw.wmnet
[10:26:01] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720826 (10Jelto) `mailman3.service` prints a lot of Python stacktraces starting Apr 07 09:06 UTC  ` Apr 07 09:06:41 lists1004 mailman3[2696297]: Apr 07 09:06:41 202...
[10:26:12] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10720827 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: reboot before bringing into service
[10:26:22] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720828 (10Superpes15) >>! In T391330#10720807, @Jelto wrote: > It looks like bouncing started today at 01:00 https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgI...
[10:30:29] <hnowlan>	 jouncebot: nowandnext
[10:30:29] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1000)
[10:30:29] <jouncebot>	 In 1 hour(s) and 29 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1200)
[10:31:50] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2005.codfw.wmnet
[10:32:06] <wikibugs>	 (03PS1) 10Brouberol: airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332)
[10:32:10] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: fasw2-c1[a|b]-eqiad:ge-0/0/27 flapping while admin down - https://phabricator.wikimedia.org/T391257#10720840 (10cmooney) >>! In T391257#10718036, @VRiley-WMF wrote: > It looks like pay-1b1001 is currently connected to these ports. Would you like us to remove the SFPs?  I believe...
[10:32:12] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-fe2006.codfw.wmnet
[10:32:35] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10720844 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: reboot before bringing into service
[10:33:11] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[10:33:28] <jelto>	 !log restart mailman3.service on lists1004 - T391330
[10:33:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:30] <stashbot>	 T391330: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330
[10:33:59] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720846 (10Jelto) I restarted `mailman3.service` on `lists1004` because the service stopped logging any activity right before bouncing increased (Apr 07 23:56:28)....
[10:34:25] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[10:36:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T391056)', diff saved to https://phabricator.wikimedia.org/P74699 and previous config saved to /var/cache/conftool/dbconfig/20250408-103604-fceratto.json
[10:36:07] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[10:36:33] <wikibugs>	 (03PS3) 10Hnowlan: wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791)
[10:38:16] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2006.codfw.wmnet
[10:38:20] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[10:40:42] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720855 (10Jelto) The metrics are back to baseline. So from the system level this issue looks resolved.  I'm lacking a bit of mailman knowledge to verify it processe...
[10:41:27] <wikibugs>	 (03PS1) 10Clément Goubert: kubernetes_periodic_job: Lowercase job name [puppet] - 10https://gerrit.wikimedia.org/r/1135002 (https://phabricator.wikimedia.org/T341555)
[10:41:36] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-fe2007.codfw.wmnet
[10:41:58] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10720858 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: reboot before bringing into service
[10:42:01] <wikibugs>	 (03CR) 10Klausman: [C:03+2] ml-services/experimental: clean up a few GPU-using services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134991 (owner: 10Klausman)
[10:42:13] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135002 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[10:43:29] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services/experimental: clean up a few GPU-using services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134991 (owner: 10Klausman)
[10:44:05] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "In general looks ok.  I'm not 100% sure what the switch side should look like to support this, or if its necessarily the way we want to do" [puppet] - 10https://gerrit.wikimedia.org/r/1134700 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[10:44:29] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[10:44:51] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "I'm broadly ok with this approach but we may need to review the resulting Bird config and amend this role to support something different i" [puppet] - 10https://gerrit.wikimedia.org/r/1134699 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[10:45:46] <logmsgbot>	 !log hnowlan@dns1004 START - running authdns-update
[10:46:33] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:bird: Allow enabling IPv6 without enabling all services on it [puppet] - 10https://gerrit.wikimedia.org/r/1134699 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[10:47:46] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2007.codfw.wmnet
[10:48:22] <logmsgbot>	 !log hnowlan@dns1004 END - running authdns-update
[10:48:29] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] service: remove videoscaler, jobrunner monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1133934 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[10:49:24] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720870 (10Superpes15) >>! In T391330#10720854, @Jelto wrote: > I'm lacking a bit of mailman knowledge to verify it processes fresh mails. @Superpes15 you mentioned...
[10:49:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:49:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10720871 (10phaultfinder)
[10:49:38] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol)
[10:50:18] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] kubernetes_periodic_job: Lowercase job name [puppet] - 10https://gerrit.wikimedia.org/r/1135002 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[10:50:44] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow: set saner performance-related configs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134985 (https://phabricator.wikimedia.org/T390945) (owner: 10Brouberol)
[10:51:10] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe
[10:51:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P74700 and previous config saved to /var/cache/conftool/dbconfig/20250408-105111-fceratto.json
[10:51:56] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] kubernetes_periodic_job: Lowercase job name [puppet] - 10https://gerrit.wikimedia.org/r/1135002 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[10:52:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10720873 (10BTullis) 05Open→03Resolved Thanks @Jclark-ctr - This host is back in the cluster now.
[10:54:07] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] service: remove videoscaler, jobrunner monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1133934 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[10:55:17] <Amir1>	 jouncebot: nowandnext
[10:55:17] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1000)
[10:55:17] <jouncebot>	 In 1 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1200)
[10:56:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134999 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:56:44] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe
[10:56:59] <wikibugs>	 (03Merged) 10jenkins-bot: Bump thumbnail steps to 75% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134999 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:57:22] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1134999|Bump thumbnail steps to 75% (T360589)]]
[10:57:25] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[10:59:44] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs3008.esams.wmnet} and A:liberica
[11:00:19] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs3008.esams.wmnet} and A:liberica
[11:00:47] <Emperor>	 !log pool thanos-fe200[5-7] T389634
[11:00:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:49] <stashbot>	 T389634: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634
[11:00:57] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=100; selector: name=thanos-fe2005.codfw.wmnet
[11:01:10] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: name=thanos-fe2005.codfw.wmnet
[11:01:32] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=100; selector: name=thanos-fe2006.codfw.wmnet
[11:01:39] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/weight=100; selector: name=thanos-fe2007.codfw.wmnet
[11:01:48] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: name=thanos-fe2006.codfw.wmnet
[11:01:54] <logmsgbot>	 !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: name=thanos-fe2007.codfw.wmnet
[11:02:09] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[11:02:14] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[11:04:36] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1134999|Bump thumbnail steps to 75% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[11:04:39] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[11:05:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10720903 (10cmooney) >>! In T387145#10713076, @Vgutierrez wrote: > reimaging them is fine by me  Ok cool.  So what we should do is run the 'decom' workflow against the existing servers, b...
[11:06:10] <wikibugs>	 (03PS1) 10Kamila Součková: alertmanager: route T&S tasks to their Slack [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782)
[11:06:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P74701 and previous config saved to /var/cache/conftool/dbconfig/20250408-110618-fceratto.json
[11:06:48] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[11:10:39] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Looks good to me... I gather this is just an interim patch and we'll apply the other one on top of it to add the functionality for multi-d" [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[11:11:04] <wikibugs>	 (03CR) 10Kamila Součková: "We will send failed job alerts as part of the migration to k8s. Let me know if you'd prefer a separate receiver that creates Phab tasks ra" [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková)
[11:11:10] <wikibugs>	 (03PS1) 10Hnowlan: jobrunner, videoscaler: remove from lvs, backends [puppet] - 10https://gerrit.wikimedia.org/r/1135008 (https://phabricator.wikimedia.org/T354791)
[11:12:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Seems to make sense thanks." [software/homer] - 10https://gerrit.wikimedia.org/r/1134714 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[11:13:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Makes sense!" [software/homer] - 10https://gerrit.wikimedia.org/r/1134713 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[11:13:58] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134999|Bump thumbnail steps to 75% (T360589)]] (duration: 16m 35s)
[11:14:00] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[11:14:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10720933 (10phaultfinder)
[11:17:25] <wikibugs>	 (03CR) 10Clément Goubert: [C:04-1] "Indentation issue" [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková)
[11:17:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+2] Add prepend-as-out variable for each site always [homer/public] - 10https://gerrit.wikimedia.org/r/1130095 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney)
[11:18:13] <wikibugs>	 (03Merged) 10jenkins-bot: Add prepend-as-out variable for each site always [homer/public] - 10https://gerrit.wikimedia.org/r/1130095 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney)
[11:19:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10720963 (10phaultfinder)
[11:21:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T391056)', diff saved to https://phabricator.wikimedia.org/P74702 and previous config saved to /var/cache/conftool/dbconfig/20250408-112124-fceratto.json
[11:21:28] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[11:21:40] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: Maintenance
[11:23:08] <wikibugs>	 (03CR) 10Tacsipacsi: "I’d rather fix the portals to send lower-case family name. If they send _Wiktionary_, they’re likely to send _Wikisłownik_ or _维基词典_ depen" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo)
[11:24:30] <wikibugs>	 (03PS1) 10Peter Fischer: CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135010 (https://phabricator.wikimedia.org/T389053)
[11:24:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10720976 (10phaultfinder)
[11:25:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135010 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer)
[11:26:19] <wikibugs>	 (03CR) 10Tacsipacsi: "(“Portals” is [wikimedia/portals](https://gerrit.wikimedia.org/r/q/project:wikimedia/portals).)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo)
[11:26:43] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] jobrunner, videoscaler: remove from lvs, backends [puppet] - 10https://gerrit.wikimedia.org/r/1135008 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[11:27:19] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341 (10Clement_Goubert) 03NEW
[11:27:54] <logmsgbot>	 !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2142.codfw.wmnet
[11:28:05] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10720990 (10ops-monitoring-bot) depool host wikikube-worker2142.codfw.wmnet by cgoubert@cumin1002 with reason: Hardware failure
[11:28:56] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10721004 (10Clement_Goubert)
[11:29:20] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10721007 (10Clement_Goubert) a:03Papaul
[11:29:38] <wikibugs>	 (03PS2) 10Kamila Součková: alertmanager: route T&S tasks to their Slack [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782)
[11:30:12] <wikibugs>	 (03CR) 10Kamila Součková: alertmanager: route T&S tasks to their Slack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková)
[11:30:18] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] alertmanager: route T&S tasks to their Slack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková)
[11:30:32] <logmsgbot>	 !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host wikikube-worker2142.codfw.wmnet
[11:31:48] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1172.eqiad.wmnet with reason: Maintenance
[11:31:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T391056)', diff saved to https://phabricator.wikimedia.org/P74703 and previous config saved to /var/cache/conftool/dbconfig/20250408-113154-fceratto.json
[11:31:58] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[11:33:04] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10721015 (10Clement_Goubert) Host drained forcefully and depooled.
[11:37:13] <jinxer-wm>	 FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:37:40] <elukey>	 jouncebot: now
[11:37:40] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 22 minute(s)
[11:37:43] <elukey>	 jouncebot: next
[11:37:43] <jouncebot>	 In 0 hour(s) and 22 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1200)
[11:37:58] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:38:44] <jinxer-wm>	 FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[11:39:25] <tappof>	 !incidents
[11:39:26] <sirenbot>	 6024 (ACKED)  [2x] ProbeDown sre (upload-https:443 probes/service eqsin)
[11:39:26] <sirenbot>	 6025 (ACKED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[11:39:45] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:39:59] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:41:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:42:13] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:43:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T391056)', diff saved to https://phabricator.wikimedia.org/P74704 and previous config saved to /var/cache/conftool/dbconfig/20250408-114338-fceratto.json
[11:43:41] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[11:45:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:46:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[11:47:13] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:47:58] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:48:44] <jinxer-wm>	 RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[11:56:50] <wikibugs>	 (03PS3) 10Clément Goubert: alertmanager: route T&S tasks to their Slack [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková)
[11:57:35] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10721144 (10Peachey88)
[11:58:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P74705 and previous config saved to /var/cache/conftool/dbconfig/20250408-115845-fceratto.json
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1200)
[12:02:14] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow: set saner performance-related configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134985 (https://phabricator.wikimedia.org/T390945) (owner: 10Brouberol)
[12:07:00] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[12:07:38] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply
[12:10:37] <wikibugs>	 (03PS2) 10Stang: Add main page on non-English privatewiki to wgWhitelistRead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850266 (https://phabricator.wikimedia.org/T321796)
[12:12:06] <wikibugs>	 06SRE, 06SRE-OnFire, 13Patch-Needs-Improvement: klaxon CLI tool for seeding an oncall handoff - https://phabricator.wikimedia.org/T317159#10721183 (10Aklapper)
[12:12:55] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'.
[12:13:35] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[12:13:47] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[12:13:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P74706 and previous config saved to /var/cache/conftool/dbconfig/20250408-121352-fceratto.json
[12:14:24] <logmsgbot>	 !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[12:15:13] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529#10721205 (10Aklapper) https://gerrit.wikimedia.org/r/c/operations/puppet/+/993068 is the only linked open patch left here.
[12:17:13] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:20:44] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[12:23:11] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Announce OpenStack API over v6 from cloudlb2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1134700 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[12:24:39] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[12:28:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T391056)', diff saved to https://phabricator.wikimedia.org/P74707 and previous config saved to /var/cache/conftool/dbconfig/20250408-122859-fceratto.json
[12:29:02] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[12:29:14] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: Maintenance
[12:29:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74708 and previous config saved to /var/cache/conftool/dbconfig/20250408-122919-fceratto.json
[12:29:58] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] scap: Use PHP 8.1 when executing maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French)
[12:33:04] <wikibugs>	 (03PS1) 10Majavah: bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282)
[12:33:38] <wikibugs>	 (03PS1) 10Peter Fischer: Search update pipeline: 504 handling, weighted tags rename [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135019 (https://phabricator.wikimedia.org/T389053)
[12:35:04] <elukey>	 !log started the rollout of xz-utils' security upgrades (gradual during the next days) 
[12:35:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:36:48] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5228/co" [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[12:37:13] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[12:40:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74709 and previous config saved to /var/cache/conftool/dbconfig/20250408-124042-fceratto.json
[12:40:45] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[12:47:13] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:48:13] <wikibugs>	 (03PS1) 10Effie Mouzeli: logging: add support for php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135020
[12:49:42] <wikibugs>	 (03PS1) 10Effie Mouzeli: switch mwdebug2002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135021
[12:50:30] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logging: add support for php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135020 (owner: 10Effie Mouzeli)
[12:50:38] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1048.eqiad.wmnet
[12:50:48] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet
[12:52:04] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[12:52:32] <wikibugs>	 (03CR) 10Jelto: [C:04-1] "one comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth)
[12:55:04] <elukey>	 jouncebot: now
[12:55:04] <jouncebot>	 For the next 0 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1200)
[12:55:08] <elukey>	 jouncebot: next
[12:55:08] <jouncebot>	 In 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1300)
[12:55:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P74711 and previous config saved to /var/cache/conftool/dbconfig/20250408-125549-fceratto.json
[12:56:22] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1048.eqiad.wmnet
[12:57:14] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet
[12:57:50] <wikibugs>	 (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135021 (owner: 10Effie Mouzeli)
[13:00:04] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1300).
[13:00:05] <jouncebot>	 Superpes, seanleong-wmde, abijeet, dcausse, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] <Superpes>	 Hi :)
[13:00:15] <Lucas_WMDE>	 o/
[13:00:31] <abijeet>	 o/
[13:00:37] <Lucas_WMDE>	 I can deploy today :)
[13:00:44] <kart_>	 I'm here as well. Lucas_WMDE go ahead! :)
[13:00:53] <elukey>	 I am here if needed folks
[13:00:56] <Lucas_WMDE>	 let’s start with the ptwiktionary change
[13:01:02] <elukey>	 if you see anything that appears to be stuck, ping me
[13:01:04] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134776 (https://phabricator.wikimedia.org/T391299) (owner: 10Superpes15)
[13:01:11] <Lucas_WMDE>	 ok
[13:01:40] <Lucas_WMDE>	 there’s a circuit breaker error at the top of logspam-watch that’s about to fall out of the 60min window, looks quiet otherwise
[13:01:54] <wikibugs>	 (03Merged) 10jenkins-bot: [ptwiktionary] Create a Wikisaurus namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134776 (https://phabricator.wikimedia.org/T391299) (owner: 10Superpes15)
[13:02:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1134776|[ptwiktionary] Create a Wikisaurus namespace (T391299)]]
[13:02:19] <stashbot>	 T391299: Add Wikisaurus namespace to Portuguese Wiktionary - https://phabricator.wikimedia.org/T391299
[13:02:21] <dcausse>	 o/
[13:02:33] <Superpes>	 Lucas_WMDE Please remember that after deploy NamespaceDupes.php needs to be run
[13:02:48] <seanleong-wmde>	 Hi, I'm here as well o/
[13:04:06] * Lucas_WMDE idly wonders if mwscript-k8s is considered stable enough to warrant updating https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#namespaceDupes by now
[13:04:16] <Lucas_WMDE>	 I’ll give it a shot later
[13:04:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "❤️" [puppet] - 10https://gerrit.wikimedia.org/r/1127150 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway)
[13:06:19] <wikibugs>	 (03PS2) 10Jelto: ceph: add gitlab dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922)
[13:08:29] <marostegui>	 !log TEST maintenance s1 eqiad dbmaint T391346
[13:08:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:32] <stashbot>	 T391346: Database maintenance map not working - https://phabricator.wikimedia.org/T391346
[13:08:43] <wikibugs>	 (03CR) 10Jelto: "I uploaded a new patchset which uses the new `Ceph::S3::Credential` structure from Id8979165b96d737addc676f3abf3f088a48eda48." [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[13:09:04] <Lucas_WMDE>	 sync-testservers-k8s took 4m22s, that feels unusually slow I think (cc elukey)
[13:09:08] <Lucas_WMDE>	 but not critical yet
[13:09:18] <Superpes>	 :O
[13:09:22] <wikibugs>	 (03CR) 10Elukey: [C:03+1] tox.ini: remove optimization for tox <4 [software/homer] - 10https://gerrit.wikimedia.org/r/1134712 (owner: 10Volans)
[13:09:23] <Lucas_WMDE>	 we can see how long the full deploy takes
[13:09:36] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 superpes, lucaswerkmeister-wmde: Backport for [[gerrit:1134776|[ptwiktionary] Create a Wikisaurus namespace (T391299)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:09:38] <stashbot>	 T391299: Add Wikisaurus namespace to Portuguese Wiktionary - https://phabricator.wikimedia.org/T391299
[13:09:42] <Lucas_WMDE>	 Superpes: please test :)
[13:10:24] <Superpes>	 Looks fine! Thanks Lucas_WMDE
[13:10:30] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 superpes, lucaswerkmeister-wmde: Continuing with sync
[13:10:32] <Lucas_WMDE>	 ok, thanks!
[13:10:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P74712 and previous config saved to /var/cache/conftool/dbconfig/20250408-131056-fceratto.json
[13:11:16] <Lucas_WMDE>	 ok sync-canaries-k8s only took 34s, so that was fine
[13:12:04] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[13:12:07] <wikibugs>	 (03PS11) 10Tiziano Fogli: netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231)
[13:12:07] <wikibugs>	 (03PS41) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231)
[13:12:07] <wikibugs>	 (03PS1) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231)
[13:12:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[13:12:55] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[13:13:13] <wikibugs>	 (03CR) 10Elukey: [C:03+1] capirca: optimization refactor [software/homer] - 10https://gerrit.wikimedia.org/r/1134713 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[13:13:24] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "LGTM, thanks! I added a suggested comment." [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[13:13:28] <Lucas_WMDE>	 I think once this change is done (and I’ve run namespaceDupes), we could probably deploy the changes by seanleong-wmde, dcausse and myself all together
[13:13:30] <Lucas_WMDE>	 they look harmless enough
[13:13:43] <seanleong-wmde>	 Okie
[13:13:44] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "Done" [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[13:13:48] <dcausse>	 yes mine is a noop
[13:14:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[13:14:24] <wikibugs>	 (03PS3) 10Jelto: ceph: add gitlab dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922)
[13:14:28] <wikibugs>	 06SRE, 10Dumps-Generation, 10Wikidata: various weekly and daily dumps run from systemd timers are broken - https://phabricator.wikimedia.org/T281267#10721388 (10fgiunchedi) I'm untagging o11y for now, please reach out as needed
[13:14:30] <wikibugs>	 (03CR) 10Jelto: ceph: add gitlab dummy credentials (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[13:14:30] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Can confirm that the code is gone from wmf.23+:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134064 (https://phabricator.wikimedia.org/T389429) (owner: 10Ebernhardson)
[13:15:15] <wikibugs>	 (03CR) 10Elukey: [C:03+1] homer: move NetboxData initialization [software/homer] - 10https://gerrit.wikimedia.org/r/1134714 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[13:17:41] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134776|[ptwiktionary] Create a Wikisaurus namespace (T391299)]] (duration: 15m 24s)
[13:17:44] <stashbot>	 T391299: Add Wikisaurus namespace to Portuguese Wiktionary - https://phabricator.wikimedia.org/T391299
[13:18:00] <Lucas_WMDE>	 woof, that’s a lot of links to fix
[13:18:14] <Lucas_WMDE>	 but no issues apparently
[13:18:53] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@deploy1003 ~ $ mwscript-k8s --comment=T391299 --follow -- namespaceDupes ptwiktionary --fix
[13:18:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:59] <wikibugs>	 (03CR) 10MVernon: [C:03+1] "I feel gerrit shouldn't remove the +1 when you apply my suggestion, but there we are :-)" [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[13:19:35] <Lucas_WMDE>	 sync-prod-k8s finished in 5m58s btw, which feels like a normal duration
[13:19:39] <wikibugs>	 (03PS2) 10Majavah: bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282)
[13:19:39] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023
[13:19:41] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrus: disable completion indices in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134969 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse)
[13:19:42] <Superpes>	 Yep They used the prefix without having a namespace lmao  
[13:19:59] <wikibugs>	 (03PS2) 10Majavah: P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282)
[13:20:01] <wikibugs>	 (03PS3) 10Majavah: bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282)
[13:21:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde)
[13:21:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134064 (https://phabricator.wikimedia.org/T389429) (owner: 10Ebernhardson)
[13:21:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134691 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE))
[13:21:28] <Superpes>	 Thanks for your assistance Lucas_WMDE :3
[13:21:34] <Lucas_WMDE>	 np :)
[13:22:17] <wikibugs>	 (03Merged) 10jenkins-bot: Increase entityAccessLimit from 400 to 500 for all wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde)
[13:22:21] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134064 (https://phabricator.wikimedia.org/T389429) (owner: 10Ebernhardson)
[13:22:24] <wikibugs>	 (03Merged) 10jenkins-bot: Fix EntitySchema propertyType on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134691 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE))
[13:22:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[13:22:49] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1133317|Increase entityAccessLimit from 400 to 500 for all wikis except commons. (T384455)]], [[gerrit:1134064|Remove unused config vars (T389429)]], [[gerrit:1134691|Fix EntitySchema propertyType on Test Wikidata (T371196)]]
[13:22:54] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[13:22:55] <stashbot>	 T384455: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455
[13:22:55] <stashbot>	 T389429: Investigate whether it’s intentional / correct that default CirrusSearch setups run cirrusSearchElasticaWrite as separate jobs - https://phabricator.wikimedia.org/T389429
[13:22:55] <stashbot>	 T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196
[13:24:22] <wikibugs>	 (03PS3) 10Majavah: P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282)
[13:24:22] <wikibugs>	 (03PS4) 10Majavah: bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282)
[13:25:11] <wikibugs>	 (03CR) 10Bking: "We're OK with temporarily adding these flags. We should review after the maintenance...which reminds me, I need to start a task for undoin" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking)
[13:25:59] <wikibugs>	 (03PS1) 10Stevemunene: zookeeper: onboard an-conf1004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135025 (https://phabricator.wikimedia.org/T374922)
[13:26:01] <wikibugs>	 (03PS1) 10Stevemunene: zookeeper: onboard an-conf1005 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135026 (https://phabricator.wikimedia.org/T374922)
[13:26:02] <wikibugs>	 (03PS1) 10Stevemunene: zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922)
[13:26:04] <wikibugs>	 (03PS1) 10Stevemunene: zookeeper: remove an-conf100[1-3] from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922)
[13:26:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74714 and previous config saved to /var/cache/conftool/dbconfig/20250408-132603-fceratto.json
[13:26:07] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[13:26:19] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: Maintenance
[13:26:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T391056)', diff saved to https://phabricator.wikimedia.org/P74715 and previous config saved to /var/cache/conftool/dbconfig/20250408-132626-fceratto.json
[13:26:38] <wikibugs>	 (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[13:26:52] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[13:29:30] <Lucas_WMDE>	 sync-testservers-k8s feels fairly slow again o_O
[13:29:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[13:29:34] <Lucas_WMDE>	 yeah, just finished after 4m23s
[13:30:03] <wikibugs>	 (03PS3) 10AOkoth: site: revert releases2003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595)
[13:30:16] <wikibugs>	 (03CR) 10AOkoth: site: revert releases2003 to insetup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth)
[13:30:20] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, ebernhardson, seanleong-wmde: Backport for [[gerrit:1133317|Increase entityAccessLimit from 400 to 500 for all wikis except commons. (T384455)]], [[gerrit:1134064|Remove unused config vars (T389429)]], [[gerrit:1134691|Fix EntitySchema propertyType on Test Wikidata (T371196)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:30:25] <stashbot>	 T384455: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455
[13:30:26] <stashbot>	 T389429: Investigate whether it’s intentional / correct that default CirrusSearch setups run cirrusSearchElasticaWrite as separate jobs - https://phabricator.wikimedia.org/T389429
[13:30:26] <stashbot>	 T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196
[13:30:26] <wikibugs>	 (03PS4) 10Majavah: P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282)
[13:30:26] <wikibugs>	 (03PS5) 10Majavah: bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282)
[13:30:36] <Lucas_WMDE>	 my test is working as expected on testwikidatawiki
[13:30:52] <Lucas_WMDE>	 and I realized I can’t 100% test it on wikidatawiki because the code hasn’t rolled out there
[13:30:52] <dcausse>	 Lucas_WMDE: I can't test mine
[13:31:04] <seanleong-wmde>	 mine is working correctly
[13:31:10] <Lucas_WMDE>	 it’s supposed to have no difference, but at the moment I can’t be sure if it has no difference because the config is doing the right thing or because the code isn’t there yet
[13:31:15] <Lucas_WMDE>	 but I’ll just hope that it’s fine
[13:31:17] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, ebernhardson, seanleong-wmde: Continuing with sync
[13:32:45] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2 DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[13:34:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10721500 (10phaultfinder)
[13:34:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10721501 (10phaultfinder)
[13:35:24] <wikibugs>	 (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135029
[13:36:27] <wikibugs>	 (03PS42) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231)
[13:36:27] <wikibugs>	 (03PS2) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231)
[13:36:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[13:37:09] <wikibugs>	 (03CR) 10Btullis: [C:03+1] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking)
[13:37:44] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[13:37:56] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] ceph: add gitlab dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[13:38:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T391056)', diff saved to https://phabricator.wikimedia.org/P74716 and previous config saved to /var/cache/conftool/dbconfig/20250408-133814-fceratto.json
[13:38:18] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[13:38:19] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133317|Increase entityAccessLimit from 400 to 500 for all wikis except commons. (T384455)]], [[gerrit:1134064|Remove unused config vars (T389429)]], [[gerrit:1134691|Fix EntitySchema propertyType on Test Wikidata (T371196)]] (duration: 15m 30s)
[13:38:24] <stashbot>	 T384455: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455
[13:38:24] <stashbot>	 T389429: Investigate whether it’s intentional / correct that default CirrusSearch setups run cirrusSearchElasticaWrite as separate jobs - https://phabricator.wikimedia.org/T389429
[13:38:25] <stashbot>	 T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196
[13:38:31] <Lucas_WMDE>	 right, time for abijeet :)
[13:38:34] <dcausse>	 \o/
[13:38:39] <dcausse>	 Lucas_WMDE: thanks!
[13:38:40] <Lucas_WMDE>	 can the backports for the two branches be deployed at the same time?
[13:38:53] <wikibugs>	 (03CR) 10CI reject: [V:04-1] pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[13:38:57] <Lucas_WMDE>	 dcausse: np :)
[13:38:58] <seanleong-wmde>	 Lucas_WMDE Thanks!
[13:39:10] <abijeet>	 Lucas_WMDE, sounds good
[13:39:20] <abijeet>	 Lucas_WMDE, we can deploy both at the same time, sure
[13:39:22] * lucaswerkmeister is also amused by the course T389429 has taken ;)
[13:39:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10721515 (10phaultfinder)
[13:39:44] <Lucas_WMDE>	 ok
[13:39:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134976 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[13:39:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134977 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[13:39:56] <Lucas_WMDE>	 I should’ve remembered to +2 them in advance, meh
[13:40:13] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "Looks good, should sort out the source IP anyway." [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[13:40:29] <wikibugs>	 (03CR) 10Ssingh: "How does this tie in to:" [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[13:41:24] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[13:41:48] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: use the kafka svc endpoint for Tegola [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133142 (https://phabricator.wikimedia.org/T373115) (owner: 10Elukey)
[13:41:48] <wikibugs>	 (03Merged) 10jenkins-bot: ArticleFooterEntrypointCard: Change the way codex is loaded [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134976 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[13:41:51] <wikibugs>	 (03Merged) 10jenkins-bot: ArticleFooterEntrypointCard: Change the way codex is loaded [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134977 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[13:41:56] <kart_>	 Lucas_WMDE: CI is super fast now :)
[13:42:07] <wikibugs>	 (03CR) 10Volans: [C:03+2] tox.ini: remove optimization for tox <4 [software/homer] - 10https://gerrit.wikimedia.org/r/1134712 (owner: 10Volans)
[13:42:08] <Lucas_WMDE>	 nice!
[13:42:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1134976|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]], [[gerrit:1134977|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]]
[13:42:18] <stashbot>	 T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176
[13:44:12] <wikibugs>	 (03CR) 10Majavah: "My understanding is that configuration ensures that when the system boots up, `anycast-healthchecker.service` is started before `bird.serv" [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[13:45:18] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] jobrunner, videoscaler: remove from lvs, backends [puppet] - 10https://gerrit.wikimedia.org/r/1135008 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[13:45:25] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10721540 (10Jhancock.wm) @elukey all good! yesterday was rack unpacking day and i did almost nothing else =#  i replaced a random drive...
[13:45:43] <logmsgbot>	 !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on releases2003.codfw.wmnet with reason: Bookworm Re-image
[13:47:02] <wikibugs>	 (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1134740/5233/" [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth)
[13:48:24] <wikibugs>	 (03PS1) 10Stevemunene: hdfs: replace an-conf100[1-3] with an-conf100[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/1135031 (https://phabricator.wikimedia.org/T374922)
[13:49:12] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 abi, lucaswerkmeister-wmde: Backport for [[gerrit:1134976|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]], [[gerrit:1134977|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:49:15] <stashbot>	 T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176
[13:49:20] <abijeet>	 Lucas_WMDE, testing
[13:49:51] <Lucas_WMDE>	 elukey: I don’t know if it counts as stuck but sync-testservers-k8s took 4m22s, 4m23s and 4m01s, which seems unusually slow
[13:50:00] <Lucas_WMDE>	 abijeet: thanks!
[13:50:41] <Lucas_WMDE>	 hm, that’s suspiciously close to the “4 minutes” mentioned in T374907 🤔
[13:50:41] <stashbot>	 T374907: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change - https://phabricator.wikimedia.org/T374907
[13:51:06] <elukey>	 Lucas_WMDE: that I don't know, but if it didn't block I am happy
[13:51:25] <Lucas_WMDE>	 ok
[13:51:58] <wikibugs>	 (03CR) 10Btullis: cirrussearch: Add regex data for cirrussearch hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[13:53:08] <wikibugs>	 (03CR) 10Btullis: [C:03+1] cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[13:53:19] <wikibugs>	 (03Merged) 10jenkins-bot: tox.ini: remove optimization for tox <4 [software/homer] - 10https://gerrit.wikimedia.org/r/1134712 (owner: 10Volans)
[13:53:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P74717 and previous config saved to /var/cache/conftool/dbconfig/20250408-135321-fceratto.json
[13:54:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:55:27] <wikibugs>	 (03CR) 10AOkoth: [C:03+2] site: revert releases2003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth)
[13:56:24] <abijeet>	 Lucas_WMDE, we can keep this patch, but there is a separate issue :-( bit silly and messy: 1135032: ArticleFooterEntrypointCard: Add @wikimedia/codex as a dependency | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1135032
[13:56:42] <Lucas_WMDE>	 I see…
[13:56:59] <Lucas_WMDE>	 I was going to ask why that change was made in the first place, I didn’t understand it
[13:57:06] <Lucas_WMDE>	 (I guess I still don’t understand it)
[13:57:09] <Lucas_WMDE>	 but let’s roll it out then…
[13:57:15] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 abi, lucaswerkmeister-wmde: Continuing with sync
[13:57:33] <Lucas_WMDE>	 abijeet: do you have a reviewer? or do you want to try to explain it to me until I’m confident to +2 the change for backporting? :D
[13:57:44] <Lucas_WMDE>	 well, I suppose we have very little time left in the window, meh
[13:57:45] <Lucas_WMDE>	 jouncebot: next
[13:57:45] <jouncebot>	 In 1 hour(s) and 2 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1500)
[13:57:48] <Lucas_WMDE>	 ok we have some time afterwards
[13:57:49] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] jobrunner, videoscaler: remove from lvs, backends [puppet] - 10https://gerrit.wikimedia.org/r/1135008 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[13:58:21] <Lucas_WMDE>	 I thought codex.js was the recommended way to load codex per https://www.mediawiki.org/wiki/Codex#Loading_a_subset_of_Codex_components_(recommended_for_skins_and_extensions)
[13:59:39] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync
[13:59:46] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync
[14:00:19] <Lucas_WMDE>	 abijeet: did you try other parts to codex.js? I wonder if it maybe needed to be ./codex.js or ../../codex.js instead of ../codex.js
[14:00:52] <Lucas_WMDE>	 e.g. https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/7a17e84550f9d3adaefa175363774fb98e3ebb80/repo/resources/wikibase.vector.scopedtypeaheadsearch/ScopedTypeaheadSearch.vue#44 has ../../codex.js
[14:01:23] <wikibugs>	 (03CR) 10Bking: [C:03+2] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking)
[14:01:31] <Lucas_WMDE>	 based on "localBasePath": "minT/entrypoints" and minT/entrypoints/ArticleFooterEntrypointCard.vue, I would suspect you need ./codex.js
[14:01:35] <Lucas_WMDE>	 rather than ..
[14:01:47] <Lucas_WMDE>	 since it should end up in the same directory (minT/entrypoints/)
[14:01:48] <abijeet>	 Lucas_WMDE, the rest of the codebase uses require( '@wikimedia/codex' ); - https://gerrit.wikimedia.org/g/mediawiki/extensions/ContentTranslation/+/0fda23770042887d2530018092844de2ee5b6913/minT/src/ConfirmTopicPage.vue#135
[14:02:08] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.reimage for host releases2003.codfw.wmnet with OS bookworm
[14:02:37] <wikibugs>	 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352 (10MatthewVernon) 03NEW
[14:03:21] <abijeet>	 It just made sense to use the same approach in this file as the rest of the extension.
[14:04:39] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134976|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]], [[gerrit:1134977|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]] (duration: 22m 23s)
[14:04:41] <stashbot>	 T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176
[14:04:54] <Lucas_WMDE>	 ok…
[14:05:10] <Lucas_WMDE>	 but then why do several RL modules still use the CodexModule class?
[14:06:23] <abijeet>	 Lucas_WMDE, thanks though. I'll try to get this reviewed and tested. Its not possible to test this locally hence the back and forth.
[14:06:39] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync
[14:06:49] <Lucas_WMDE>	 ok
[14:06:56] <Lucas_WMDE>	 then I guess we’re done with the window for now?
[14:07:10] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync
[14:07:24] <wikibugs>	 (03PS1) 10Anzx: madwiktionary: add logo, icon, wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135035 (https://phabricator.wikimedia.org/T391318)
[14:07:50] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync
[14:08:10] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync
[14:08:22] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135035 (https://phabricator.wikimedia.org/T391318) (owner: 10Anzx)
[14:08:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P74718 and previous config saved to /var/cache/conftool/dbconfig/20250408-140828-fceratto.json
[14:10:36] <hnowlan>	 !log setting jobrunner and videoscaler to service_setup in puppet 
[14:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:59] <hnowlan>	 some IPVS alerts expected 
[14:11:26] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:11:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:11:41] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354 (10MatthewVernon) 03NEW
[14:12:48] <hnowlan>	 !log restarting pybal on A:lvs-secondary-eqiad to pick up removal of jobrunner and videoscaler 
[14:12:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:43] <wikibugs>	 (03PS2) 10Anzx: arywiki: enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135036
[14:14:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135036 (owner: 10Anzx)
[14:17:10] <wikibugs>	 (03CR) 10Aklapper: [V:03+2 C:03+2] Penalize on nonsensical large story point values [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1134379 (https://phabricator.wikimedia.org/T391204) (owner: 10Aklapper)
[14:19:25] <logmsgbot>	 !log fnegri@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddumps1001.wikimedia.org with reason: down for maintenance
[14:19:29] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10721810 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=591bcb32-8025-4bce-af2c-49d023d1b4ca) set by fnegri@cumin1002 for 1 da...
[14:19:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10721812 (10phaultfinder)
[14:21:34] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10721827 (10Aklapper)
[14:22:16] <hnowlan>	 !log restarting pybal on lvs1019 (low-traffic primary) to pick up removal of jobrunner and videoscaler 
[14:22:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:44] <wikibugs>	 (03CR) 10Superpes15: [C:03+1] madwiktionary: add logo, icon, wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135035 (https://phabricator.wikimedia.org/T391318) (owner: 10Anzx)
[14:23:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T391056)', diff saved to https://phabricator.wikimedia.org/P74720 and previous config saved to /var/cache/conftool/dbconfig/20250408-142335-fceratto.json
[14:23:39] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[14:23:41] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: Maintenance
[14:23:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T391056)', diff saved to https://phabricator.wikimedia.org/P74721 and previous config saved to /var/cache/conftool/dbconfig/20250408-142347-fceratto.json
[14:24:29] <wikibugs>	 (03CR) 10Scott French: "Thank you both for the review! Ahmon, any concerns about giving this a try today?" [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French)
[14:26:19] <wikibugs>	 (03PS1) 10Xcollazo: Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556)
[14:26:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo)
[14:27:12] <abijeet>	 Lucas_WMDE, I changed the code to use './codex.js'; thanks for that recommendation. I think we need to review some of the other RL modules in the extension and how we are using Codex there. Patch: 1135032: ArticleFooterEntrypointCard: Fix path to codex.js | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1135032
[14:27:36] <wikibugs>	 (03PS1) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359)
[14:28:26] <wikibugs>	 (03PS2) 10Xcollazo: Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556)
[14:28:27] <wikibugs>	 (03PS1) 10Kamila Součková: Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1135046
[14:28:33] * Lucas_WMDE looks
[14:28:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 (owner: 10Kamila Součková)
[14:28:48] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo)
[14:28:59] <Lucas_WMDE>	 worth a try imho… can it be tested on beta?
[14:29:32] <wikibugs>	 (03CR) 10Btullis: Absent systemd timers to stop attempting to generate enterprise HTML dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo)
[14:31:17] <wikibugs>	 (03CR) 10Jelto: [V:03+2 C:03+2] ceph: add gitlab dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[14:31:19] <hnowlan>	 !log restarting pybal on A:lvs-secondary-codfw
[14:31:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:48] <abijeet>	 Testing on beta would require a config change. I was able to test it locally by changing some code. Did not see any errors in the console. I should have not been lazy and done that in the first place.
[14:32:26] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10721936 (10Jelto)
[14:33:07] <Lucas_WMDE>	 ok
[14:36:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T391056)', diff saved to https://phabricator.wikimedia.org/P74722 and previous config saved to /var/cache/conftool/dbconfig/20250408-143628-fceratto.json
[14:36:32] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[14:36:46] <hnowlan>	 !log restarting pybal on A:lvs-low-traffic-codfw to remove jobrunner and videoscaler
[14:36:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:31] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[14:42:16] <wikibugs>	 (03PS1) 10Stevemunene: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135049 (https://phabricator.wikimedia.org/T374922)
[14:45:33] <wikibugs>	 (03PS1) 10Vgutierrez: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340)
[14:49:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] "One minor nitpick, otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (owner: 10Elukey)
[14:50:00] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] "Looks reasonable to me.  No concerns about giving it a try today." [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French)
[14:51:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P74723 and previous config saved to /var/cache/conftool/dbconfig/20250408-145136-fceratto.json
[14:54:13] <logmsgbot>	 !log aokoth@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host releases2003.codfw.wmnet with OS bookworm
[14:54:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10722024 (10Gehel)
[14:57:44] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Left a question but everything looks really good, I like the refactoring." [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[14:57:59] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10722040 (10Jgreen) a:05Jgreen→03None
[14:58:10] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10722043 (10Jgreen) a:05Jgreen→03None
[14:58:31] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10722047 (10Jgreen) a:05Jgreen→03None
[14:58:44] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "This is my take as well yes, the callback was split into ask_approval() and print(device_diff), I assume we'll see why in the next set of " [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[14:58:54] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10722055 (10Jgreen) a:05Jgreen→03None
[14:59:05] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10722059 (10Jgreen) a:05Jgreen→03None
[14:59:21] <wikibugs>	 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#10722060 (10Jgreen) a:05Jgreen→03None
[15:00:05] <jouncebot>	 jelto, arnoldokoth, and mutante: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1500).
[15:01:48] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: T391357
[15:01:51] <stashbot>	 T391357: Deploy Phabricator/Phorge 2025-04-08 - https://phabricator.wikimedia.org/T391357
[15:02:06] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@99aa712]: test deploy phab2002 for T391357
[15:02:08] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet with reason: T391357
[15:02:11] <wikibugs>	 (03PS1) 10Clément Goubert: CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867)
[15:02:25] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[15:02:35] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[15:02:48] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@99aa712]: test deploy phab2002 for T391357 (duration: 00m 42s)
[15:03:07] <logmsgbot>	 !log brennen@deploy1003 Started deploy [phabricator/deployment@99aa712]: deploy phab1004 for T391357
[15:03:45] <logmsgbot>	 !log brennen@deploy1003 Finished deploy [phabricator/deployment@99aa712]: deploy phab1004 for T391357 (duration: 00m 38s)
[15:03:53] <wikibugs>	 (03PS13) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389
[15:04:04] <wikibugs>	 (03CR) 10Elukey: services: enable ingress for Kartotherian (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (owner: 10Elukey)
[15:05:03] <wikibugs>	 (03PS2) 10Clément Goubert: CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867)
[15:06:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P74724 and previous config saved to /var/cache/conftool/dbconfig/20250408-150643-fceratto.json
[15:07:15] <wikibugs>	 (03CR) 10CI reject: [V:04-1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[15:07:52] <wikibugs>	 (03PS1) 10Ssingh: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[15:10:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:13:18] <wikibugs>	 (03PS2) 10Vgutierrez: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340)
[15:13:25] <wikibugs>	 (03CR) 10Vgutierrez: sre: Add LibericaEtcdErrors alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[15:14:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10722178 (10phaultfinder)
[15:14:49] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update RRLA output stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135054 (https://phabricator.wikimedia.org/T326179)
[15:15:50] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[15:16:12] <wikibugs>	 (03CR) 10Ssingh: "Ok sorry, that didn't work. Let's look at it again." [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[15:19:23] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[15:20:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10722265 (10phaultfinder)
[15:21:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T391056)', diff saved to https://phabricator.wikimedia.org/P74725 and previous config saved to /var/cache/conftool/dbconfig/20250408-152150-fceratto.json
[15:21:54] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:22:05] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1203.eqiad.wmnet with reason: Maintenance
[15:22:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T391056)', diff saved to https://phabricator.wikimedia.org/P74726 and previous config saved to /var/cache/conftool/dbconfig/20250408-152212-fceratto.json
[15:22:20] <wikibugs>	 (03PS1) 10Hnowlan: spec: update tests to account for jobrunner service being removed [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791)
[15:24:47] <wikibugs>	 (03CR) 10CI reject: [V:04-1] spec: update tests to account for jobrunner service being removed [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[15:25:21] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Left some questions/comments to better clarify my understanding, but it looks really good, feel free to proceed :)" [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[15:26:26] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "I trust that it does what you advertised, I don't have a lot of knowledge about sphinx but it looks consistent :)" [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 (owner: 10Volans)
[15:27:17] <wikibugs>	 (03PS1) 10Jgiannelos: proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135057
[15:28:20] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[15:28:28] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[15:28:38] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[15:28:51] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[15:29:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10722312 (10phaultfinder)
[15:29:50] <wikibugs>	 (03CR) 10Scott French: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert)
[15:30:07] <wikibugs>	 (03PS2) 10Hnowlan: spec: update tests to account for jobrunner service being removed [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791)
[15:30:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10722319 (10phaultfinder)
[15:31:31] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] spec: update tests to account for jobrunner service being removed [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[15:31:32] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] "Sink it!" [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[15:31:47] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] spec: update tests to account for jobrunner service being removed [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[15:31:51] <elukey>	 jouncebot: now
[15:31:51] <jouncebot>	 For the next 0 hour(s) and 28 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1500)
[15:32:19] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+2] proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135057 (owner: 10Jgiannelos)
[15:32:38] <wikibugs>	 (03PS3) 10Clément Goubert: CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867)
[15:32:51] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] "Btw, you can test before this gets merged by running something like `scap sync-world -Dmediawiki_runtime_image:docker-registry.wikimedia.o" [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French)
[15:33:36] <wikibugs>	 (03PS4) 10Elukey: services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[15:33:47] <wikibugs>	 (03Merged) 10jenkins-bot: proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135057 (owner: 10Jgiannelos)
[15:34:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T391056)', diff saved to https://phabricator.wikimedia.org/P74727 and previous config saved to /var/cache/conftool/dbconfig/20250408-153446-fceratto.json
[15:34:52] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:35:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:36:23] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz)
[15:37:38] <wikibugs>	 (03PS3) 10Xcollazo: Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556)
[15:37:49] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync
[15:38:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo)
[15:39:11] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync
[15:40:52] <herron>	 !incidents
[15:40:53] <sirenbot>	 6026 (UNACKED)  Host db1246 (paged) - PING  - Packet loss = 100%
[15:40:53] <sirenbot>	 6025 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[15:40:53] <sirenbot>	 6024 (RESOLVED)  [2x] ProbeDown sre (upload-https:443 probes/service eqsin)
[15:41:47] <herron>	 !ack 6026
[15:41:48] <sirenbot>	 6026 (ACKED)  Host db1246 (paged) - PING  - Packet loss = 100%
[15:42:13] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:43:35] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Temporarily enable mobile sitenotice for fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135066
[15:43:59] <herron>	 Amir1 marostegui I'll plan to dbctl depool db1246 in a moment unless I hear otherwise
[15:44:11] <Amir1>	 one sec
[15:44:24] <herron>	 Amir1: ok
[15:44:25] <Amir1>	 yes please
[15:44:30] <Amir1>	 please depool
[15:44:30] <herron>	 ok doing
[15:44:41] <Amir1>	 this is the same host that goes down constantly
[15:45:10] <logmsgbot>	 !log herron@cumin1002 dbctl commit (dc=all): 'depooling db1246', diff saved to https://phabricator.wikimedia.org/P74728 and previous config saved to /var/cache/conftool/dbconfig/20250408-154509-herron.json
[15:47:19] <volans>	 herron: is it me or I didn't see the page in here?
[15:48:02] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.hosts.dhcp for host nokiatest2001.codfw.wmnet
[15:48:10] <herron>	 volans: same happened to me, I'm checking on the bot
[15:49:15] <wikibugs>	 (03CR) 10Superpes15: [C:03+1] arywiki: enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135036 (owner: 10Anzx)
[15:49:23] <marostegui>	 Thanks herron 
[15:49:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P74729 and previous config saved to /var/cache/conftool/dbconfig/20250408-154954-fceratto.json
[15:50:09] <wikibugs>	 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10722436 (10fnegri) > I'm gonna shut down the server tomorrow for about 1 hour, to check if there's any unexpected impact, then take it back online...
[15:54:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10722472 (10phaultfinder)
[15:56:06] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1246 went down - https://phabricator.wikimedia.org/T391372 (10Marostegui) 03NEW
[15:56:13] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops: db1246 went down - https://phabricator.wikimedia.org/T391372#10722498 (10Marostegui) p:05Triage→03Medium
[15:58:17] <wikibugs>	 (03PS1) 10Marostegui: db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1135070 (https://phabricator.wikimedia.org/T391372)
[15:58:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1135070 (https://phabricator.wikimedia.org/T391372) (owner: 10Marostegui)
[16:00:05] <jouncebot>	 jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:17] <wikibugs>	 (03CR) 10Majavah: [C:03+2] bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[16:03:09] <wikibugs>	 (03CR) 10Volans: "reply inline" [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[16:04:51] <wikibugs>	 10ops-eqiad, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 went down - https://phabricator.wikimedia.org/T391372#10722543 (10Marostegui) File system is corrupted so it was a hard crash (presumably storage?): ` [ 1261.563104] XFS (dm-0): Metadata corruption detected at xfs_agi_verify+0x11a/0x170 [xfs], xfs_ag...
[16:05:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P74730 and previous config saved to /var/cache/conftool/dbconfig/20250408-160501-fceratto.json
[16:05:34] <wikibugs>	 (03PS3) 10Ahmon Dancy: scap.cfg.erb: Allow users in spiderpig-access LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1134291 (https://phabricator.wikimedia.org/T383947)
[16:05:49] <wikibugs>	 (03PS3) 10Ahmon Dancy: idp: spiderpig: Add spiderpig-access to required_groups [puppet] - 10https://gerrit.wikimedia.org/r/1134292 (https://phabricator.wikimedia.org/T383947)
[16:06:27] <wikibugs>	 (03CR) 10Volans: "thanks for the reviews, replies inline" [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[16:06:56] <marostegui>	 !incidents
[16:06:56] <sirenbot>	 6026 (ACKED)  Host db1246 (paged) - PING  - Packet loss = 100%
[16:06:57] <sirenbot>	 6025 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[16:06:57] <sirenbot>	 6024 (RESOLVED)  [2x] ProbeDown sre (upload-https:443 probes/service eqsin)
[16:07:15] <marostegui>	 I am going to resolve db1246 because this will take long to fix, so it doesn't keep paging everyday
[16:07:22] <marostegui>	 !resolve 6026
[16:07:23] <sirenbot>	 6026 (RESOLVED)  Host db1246 (paged) - PING  - Packet loss = 100%
[16:07:36] <herron>	 !incidents
[16:07:36] <sirenbot>	 6026 (RESOLVED)  Host db1246 (paged) - PING  - Packet loss = 100%
[16:07:36] <sirenbot>	 6025 (RESOLVED)  HaproxyUnavailable cache_upload global sre (thanos-rule)
[16:07:37] <sirenbot>	 6024 (RESOLVED)  [2x] ProbeDown sre (upload-https:443 probes/service eqsin)
[16:07:51] <herron>	 thanks marostegui 
[16:12:46] <wikibugs>	 (03PS1) 10DDesouza: miscweb(research & design/strategy): bump versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135071 (https://phabricator.wikimedia.org/T344471)
[16:14:00] <Amir1>	 herron: thank you for depooling!
[16:14:10] <herron>	 np!
[16:15:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10722563 (10phaultfinder)
[16:16:54] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(research & design/strategy): bump versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135071 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[16:17:13] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[16:18:39] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(research & design/strategy): bump versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135071 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[16:20:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T391056)', diff saved to https://phabricator.wikimedia.org/P74731 and previous config saved to /var/cache/conftool/dbconfig/20250408-162007-fceratto.json
[16:20:11] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[16:20:23] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1209.eqiad.wmnet with reason: Maintenance
[16:20:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T391056)', diff saved to https://phabricator.wikimedia.org/P74732 and previous config saved to /var/cache/conftool/dbconfig/20250408-162029-fceratto.json
[16:20:50] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[16:21:07] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[16:21:08] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[16:21:28] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[16:21:29] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[16:21:46] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[16:22:41] <hnowlan>	 !log running 'ipvsadm --delete-service --tcp-service 10.2.2.26:443 && ipvsadm --delete-service --tcp-service 10.2.2.5:443' on codfw lvs to remove videoscaler and jobrunner services
[16:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:23:44] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:23:56] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[16:24:07] <wikibugs>	 (03PS43) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231)
[16:24:07] <wikibugs>	 (03PS3) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231)
[16:24:13] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[16:24:14] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[16:24:33] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[16:24:35] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[16:24:39] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[16:24:41] <hnowlan>	 !log running 'ipvsadm --delete-service --tcp-service 10.2.2.26:443 && ipvsadm --delete-service --tcp-service 10.2.2.5:443' on eqiad lvs to remove videoscaler and jobrunner services
[16:24:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:51] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[16:25:08] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:26:06] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:26:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10722637 (10phaultfinder)
[16:26:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10722638 (10phaultfinder)
[16:28:26] <wikibugs>	 (03PS1) 10Hnowlan: service, conftool: remove videoscaler and jobrunner services [puppet] - 10https://gerrit.wikimedia.org/r/1135072 (https://phabricator.wikimedia.org/T354791)
[16:29:01] <Amir1>	 jouncebot: nowandnext
[16:29:01] <jouncebot>	 For the next 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1600)
[16:29:01] <jouncebot>	 In 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1700)
[16:29:08] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:29:15] <Amir1>	 nothing is being merged for puppet, deploying stuff now
[16:29:53] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Revert "Temporarily enable mobile sitenotice for fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135066 (owner: 10Ladsgroup)
[16:30:24] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135066 (owner: 10Ladsgroup)
[16:30:39] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Temporarily enable mobile sitenotice for fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135066 (owner: 10Ladsgroup)
[16:31:04] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1135066|Revert "Temporarily enable mobile sitenotice for fawiki"]]
[16:31:07] <wikibugs>	 (03CR) 10Tiziano Fogli: "I fixed the inline comments and also split this patch into two separate ones:" [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[16:32:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T391056)', diff saved to https://phabricator.wikimedia.org/P74733 and previous config saved to /var/cache/conftool/dbconfig/20250408-163210-fceratto.json
[16:32:13] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[16:32:39] <wikibugs>	 (03PS1) 10DDesouza: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135073 (https://phabricator.wikimedia.org/T344471)
[16:33:29] <wikibugs>	 (03PS1) 10Abijeet Patro: ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135074 (https://phabricator.wikimedia.org/T389176)
[16:33:46] <wikibugs>	 (03PS1) 10Abijeet Patro: ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135075 (https://phabricator.wikimedia.org/T389176)
[16:34:02] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] service, conftool: remove videoscaler and jobrunner services [puppet] - 10https://gerrit.wikimedia.org/r/1135072 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[16:34:12] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135074 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[16:34:14] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135075 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[16:35:00] <wikibugs>	 (03PS1) 10Cwhite: statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380)
[16:37:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite)
[16:37:13] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[16:37:59] <wikibugs>	 (03PS2) 10Cwhite: statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380)
[16:38:10] <wikibugs>	 (03CR) 10Scott French: "Ah, that's a great idea! Yeah, I'll do that first." [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French)
[16:38:11] <wikibugs>	 (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135073 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[16:38:12] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1135066|Revert "Temporarily enable mobile sitenotice for fawiki"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:38:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10722678 (10phaultfinder)
[16:39:47] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135073 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza)
[16:40:27] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[16:40:29] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[16:40:30] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[16:40:32] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[16:40:33] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[16:40:36] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[16:40:46] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[16:40:59] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[16:41:01] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[16:41:17] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[16:41:18] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[16:41:37] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[16:41:48] <wikibugs>	 (03CR) 10Scott French: [C:03+1] service, conftool: remove videoscaler and jobrunner services [puppet] - 10https://gerrit.wikimedia.org/r/1135072 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan)
[16:44:57] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply
[16:45:11] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup: Continuing with sync
[16:45:52] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply
[16:47:13] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:47:17] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P74734 and previous config saved to /var/cache/conftool/dbconfig/20250408-164717-fceratto.json
[16:50:19] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[16:50:26] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[16:50:37] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[16:50:42] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[16:50:47] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[16:50:57] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[16:51:54] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135066|Revert "Temporarily enable mobile sitenotice for fawiki"]] (duration: 20m 49s)
[16:55:27] <wikibugs>	 (03CR) 10Slyngshede: [C:03+2] idp: spiderpig: Add spiderpig-access to required_groups [puppet] - 10https://gerrit.wikimedia.org/r/1134292 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy)
[16:56:55] <wikibugs>	 (03PS3) 10Ssingh: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[16:57:04] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[16:59:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[17:00:05] <jouncebot>	 swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1700).
[17:02:21] <swfrench-wmf>	 o/
[17:02:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P74735 and previous config saved to /var/cache/conftool/dbconfig/20250408-170224-fceratto.json
[17:05:27] <swfrench-wmf>	 just wrapping up a couple of checks. should be starting in the next 10m or so.
[17:14:24] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Pilot stop-before-sync scap run using PHP 8.1 container image for maintenance scripts - T390225
[17:14:28] <stashbot>	 T390225: Migrate scap's maintenance script invocations to PHP 8.1 - https://phabricator.wikimedia.org/T390225
[17:15:25] <logmsgbot>	 !log swfrench@deploy1003 Stopping before sync operations
[17:17:01] <wikibugs>	 (03PS4) 10Ssingh: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[17:17:04] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[17:17:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T391056)', diff saved to https://phabricator.wikimedia.org/P74736 and previous config saved to /var/cache/conftool/dbconfig/20250408-171731-fceratto.json
[17:17:35] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[17:17:47] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1211.eqiad.wmnet with reason: Maintenance
[17:17:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T391056)', diff saved to https://phabricator.wikimedia.org/P74737 and previous config saved to /var/cache/conftool/dbconfig/20250408-171753-fceratto.json
[17:19:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[17:20:21] <wikibugs>	 (03CR) 10Dzahn: "The best reviewers would be the people involved in creating this group and subscribed to the linked tickets." [puppet] - 10https://gerrit.wikimedia.org/r/1134291 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy)
[17:20:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10722768 (10phaultfinder)
[17:21:19] <wikibugs>	 (03PS5) 10Ssingh: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[17:22:20] <logmsgbot>	 !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host nokiatest2001.codfw.wmnet
[17:23:28] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:23:28] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:23:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] "Assuming the group exists, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1134291 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy)
[17:23:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[17:24:28] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:25:18] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:26:18] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:26:18] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:26:56] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10722814 (10RobH) Summary of updates:  * Engineer went to pickup the shipment from a FedEx point and was told it was dispatched to the office. * Engineer provided me with the FedEx tracking numbers. * IT...
[17:29:13] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply
[17:29:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T391056)', diff saved to https://phabricator.wikimedia.org/P74738 and previous config saved to /var/cache/conftool/dbconfig/20250408-172929-fceratto.json
[17:29:33] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[17:29:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[17:30:36] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[17:30:45] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply
[17:32:02] <logmsgbot>	 !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply
[17:35:02] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Pilot scap run using PHP 8.1 container image for maintenance scripts - T390225
[17:35:06] <stashbot>	 T390225: Migrate scap's maintenance script invocations to PHP 8.1 - https://phabricator.wikimedia.org/T390225
[17:37:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10722857 (10phaultfinder)
[17:38:22] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Pilot scap run using PHP 8.1 container image for maintenance scripts - T390225 (duration: 03m 19s)
[17:39:24] <dancy>	 jouncebot nowandnext
[17:39:24] <jouncebot>	 For the next 0 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1700)
[17:39:24] <jouncebot>	 In 0 hour(s) and 20 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1800)
[17:42:27] <wikibugs>	 (03PS1) 10BCornwall: Remove varnish-staging, add varnish6 components [puppet] - 10https://gerrit.wikimedia.org/r/1135080 (https://phabricator.wikimedia.org/T391334)
[17:44:10] <wikibugs>	 (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite)
[17:44:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P74739 and previous config saved to /var/cache/conftool/dbconfig/20250408-174436-fceratto.json
[17:44:54] <wikibugs>	 (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5234/console" [puppet] - 10https://gerrit.wikimedia.org/r/1135080 (https://phabricator.wikimedia.org/T391334) (owner: 10BCornwall)
[17:44:59] <brennen>	 dancy: fwiw i was planning to roll train 10 or 15 minutes after the hour.  need to stretch my legs a bit.
[17:45:25] <dancy>	 ack.
[17:47:39] <swfrench-wmf>	 FYI, I'm out of the way for today
[17:50:27] <wikibugs>	 (03PS6) 10Ssingh: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[17:51:21] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Remove varnish-staging, add varnish6 components [puppet] - 10https://gerrit.wikimedia.org/r/1135080 (https://phabricator.wikimedia.org/T391334) (owner: 10BCornwall)
[17:53:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[17:56:54] <wikibugs>	 (03CR) 10BCornwall: [V:03+1 C:03+2] Remove varnish-staging, add varnish6 components [puppet] - 10https://gerrit.wikimedia.org/r/1135080 (https://phabricator.wikimedia.org/T391334) (owner: 10BCornwall)
[17:59:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P74740 and previous config saved to /var/cache/conftool/dbconfig/20250408-175944-fceratto.json
[17:59:47] <wikibugs>	 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10722979 (10bking) @jhathaway in addition to site.pp (which everyone uses),  we are also using it to add row/rack awareness to our Elastic ([[ https://phabricator.wikimedia...
[18:00:05] <jouncebot>	 brennen and dancy: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1800).
[18:03:51] <brett>	 !log import varnish 6.0.13-1wm1 to component/varnish6 bullseyw-wikimedia (T391334)
[18:03:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:03:54] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[18:06:02] <brennen>	 o/
[18:08:11] <brennen>	 !log 1.44.0-wmf.24 train status: no current blockers, moving to group0
[18:08:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:14:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T391056)', diff saved to https://phabricator.wikimedia.org/P74741 and previous config saved to /var/cache/conftool/dbconfig/20250408-181450-fceratto.json
[18:14:54] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[18:15:06] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1214.eqiad.wmnet with reason: Maintenance
[18:15:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74742 and previous config saved to /var/cache/conftool/dbconfig/20250408-181513-fceratto.json
[18:16:21] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135083 (https://phabricator.wikimedia.org/T386219)
[18:16:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135083 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot)
[18:17:10] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135083 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot)
[18:22:14] <wikibugs>	 (03PS1) 10Jforrester: Move to new async Parsoid fragment provision [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135084 (https://phabricator.wikimedia.org/T373253)
[18:23:47] <wikibugs>	 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#10723150 (10Ladsgroup) We need to do some more work on this. I'll get there.
[18:26:29] <brennen>	 hrm: Check 'check_testservers_baremetal-1_of_1' failed: Sending to 4 hosts...
[18:26:42] <brennen>	 having a look at mwdebug1001
[18:26:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74743 and previous config saved to /var/cache/conftool/dbconfig/20250408-182654-fceratto.json
[18:26:58] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[18:27:54] <brennen>	 succeeded on a retry
[18:28:16] <brennen>	 (and was unable to reproduce any errors)
[18:28:37] <wikibugs>	 (03CR) 10Wargo: "Anyway, this change still can be accepted. It will work both if we modify portals or not. It fixes the main issue. And to prevent situatio" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo)
[18:29:34] <dancy>	 brennen: Was it a 500 error?
[18:29:38] <brennen>	 yeah
[18:30:06] <dancy>	 Sadly https://phabricator.wikimedia.org/T380958
[18:30:22] <brennen>	 https://phabricator.wikimedia.org/P74744
[18:30:30] <brennen>	 ah, right.
[18:34:23] <logmsgbot>	 !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.24  refs T386219
[18:34:26] <stashbot>	 T386219: 1.44.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T386219
[18:42:02] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P74745 and previous config saved to /var/cache/conftool/dbconfig/20250408-184201-fceratto.json
[18:44:13] <logmsgbot>	 !log dancy@deploy1003 Installing scap version "4.152.0" for 2 host(s)
[18:46:02] <logmsgbot>	 !log dancy@deploy1003 Installation of scap version "4.152.0" completed for 2 hosts
[18:47:32] <wikibugs>	 06SRE-OnFire, 06Release-Engineering-Team, 10Scap, 06serviceops, and 2 others: Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#10723243 (10dancy) scap 4.152.0 has been deployed to address the `update_helmfile_files()` issue.
[18:55:29] <wikibugs>	 (03PS1) 10AOkoth: releases: add force puppet 7 hiera [puppet] - 10https://gerrit.wikimedia.org/r/1135089 (https://phabricator.wikimedia.org/T384595)
[18:57:08] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P74746 and previous config saved to /var/cache/conftool/dbconfig/20250408-185708-fceratto.json
[19:12:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74747 and previous config saved to /var/cache/conftool/dbconfig/20250408-191215-fceratto.json
[19:12:18] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[19:12:30] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1216.eqiad.wmnet with reason: Maintenance
[19:21:40] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1226.eqiad.wmnet with reason: Maintenance
[19:21:47] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T391056)', diff saved to https://phabricator.wikimedia.org/P74748 and previous config saved to /var/cache/conftool/dbconfig/20250408-192147-fceratto.json
[19:21:50] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[19:27:00] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "You don't need it because it's already in hieradata/role/common/insetup/collaboration_services_nftables.yaml on the role level" [puppet] - 10https://gerrit.wikimedia.org/r/1135089 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth)
[19:31:38] <wikibugs>	 (03CR) 10Eamedina: [C:03+1] ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135075 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[19:31:46] <wikibugs>	 (03CR) 10Eamedina: [C:03+1] ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135074 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[19:33:10] <wikibugs>	 (03PS3) 10Bking: cirrussearch: Add regex data for cirrussearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610)
[19:33:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T391056)', diff saved to https://phabricator.wikimedia.org/P74749 and previous config saved to /var/cache/conftool/dbconfig/20250408-193324-fceratto.json
[19:33:27] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[19:33:34] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrussearch: Add regex data for cirrussearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[19:33:58] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.reimage for host releases2003.codfw.wmnet with OS bookworm
[19:35:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:36:06] <wikibugs>	 (03PS4) 10Bking: cirrussearch: Add regex data for cirrussearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610)
[19:42:13] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:44:12] <wikibugs>	 (03CR) 10Bking: cirrussearch: Add regex data for cirrussearch hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[19:45:44] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: Add regex data for cirrussearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[19:48:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P74750 and previous config saved to /var/cache/conftool/dbconfig/20250408-194831-fceratto.json
[19:53:15] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on releases2003.codfw.wmnet with reason: host reimage
[19:56:09] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases2003.codfw.wmnet with reason: host reimage
[19:57:04] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10723434 (10Quiddity) Here's a representative example of 2 emails that I noticed are missing from my inbox, but included in the [[https://list...
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T2000).
[20:00:05] <jouncebot>	 anzx and abijeet: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:30] <abijeet>	 hello o/
[20:03:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P74751 and previous config saved to /var/cache/conftool/dbconfig/20250408-200338-fceratto.json
[20:03:53] <abijeet>	 hello, is anyone around to help with the deployment?
[20:06:24] <Amir1>	 what's up
[20:06:34] <Amir1>	 I can take care of it
[20:07:35] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135074 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[20:07:39] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135075 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[20:09:46] <wikibugs>	 (03Merged) 10jenkins-bot: ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135074 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[20:09:48] <wikibugs>	 (03Merged) 10jenkins-bot: ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135075 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro)
[20:09:51] <Amir1>	 that was fast
[20:10:09] <abijeet>	 woah
[20:10:15] <abijeet>	 thanks Amir1 
[20:10:36] <Amir1>	 We skip browser tests in branches I think
[20:11:22] <abijeet>	 sorry, I did not understand. I can still verify it on testservers with the wmf.23 branch right?
[20:11:44] <Amir1>	 yeah
[20:11:49] <Amir1>	 I meant CI tests
[20:12:02] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host releases2003.codfw.wmnet with OS bookworm
[20:12:08] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1135074|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]], [[gerrit:1135075|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]]
[20:12:10] <stashbot>	 T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176
[20:12:32] <abijeet>	 ah understood
[20:14:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10723507 (10phaultfinder)
[20:16:09] <bd808>	 Amir1, abijeet: That's "success caching" at work -- https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/KTP34HIR5D66QLGHC3ZAIZKQWE46O5F4/
[20:17:13] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[20:17:19] <logmsgbot>	 !log ladsgroup@deploy1003 abi, ladsgroup: Backport for [[gerrit:1135074|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]], [[gerrit:1135075|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:17:21] <stashbot>	 T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176
[20:17:29] <Amir1>	 abijeet: it's in test servers
[20:17:31] <abijeet>	 testing
[20:18:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T391056)', diff saved to https://phabricator.wikimedia.org/P74752 and previous config saved to /var/cache/conftool/dbconfig/20250408-201845-fceratto.json
[20:18:50] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[20:19:01] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1255.eqiad.wmnet with reason: Maintenance
[20:19:42] <abijeet>	 Amir1, looks good. 
[20:19:47] <logmsgbot>	 !log ladsgroup@deploy1003 abi, ladsgroup: Continuing with sync
[20:21:48] <Amir1>	 I'm not seeing the second person who scheduled patches
[20:21:51] <abijeet>	 bd808, that's a big QOL improvement. Thanks
[20:22:07] <brett>	 !log import libvmod-re2 1.5.3-4 to component/varnish6 bullseyw-wikimedia (T391334)
[20:22:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:10] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[20:22:24] <dancy>	 mutante: Would you be willing to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134291 now that Alexandros has approved?
[20:24:39] <jinxer-wm>	 FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[20:25:49] <brett>	 !log import varnishkafka 1.1.0-4 to component/varnish6 bullseyw-wikimedia (T391334)
[20:25:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:25] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135074|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]], [[gerrit:1135075|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]] (duration: 14m 16s)
[20:26:27] <stashbot>	 T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176
[20:27:35] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1256.eqiad.wmnet with reason: Maintenance
[20:27:54] <abijeet>	 Amir1, thanks for your help!
[20:28:01] <Amir1>	 \o/
[20:31:52] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 went down - https://phabricator.wikimedia.org/T391372#10723635 (10Jclark-ctr) a:03VRiley-WMF
[20:34:29] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10723653 (10bd808) >>! In T391330#10720826, @Jelto wrote: > ` > Apr 07 09:06:41 lists1004 mailman3[2696297]: (pymysql.err.OperationalError) (1...
[20:35:26] <wikibugs>	 (03PS2) 10Bking: search: allow any cirrussearch host to join cluster [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper)
[20:35:30] <wikibugs>	 (03CR) 10Bking: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper)
[20:35:43] <wikibugs>	 (03PS3) 10Bking: search: allow any cirrussearch host to join cluster [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper)
[20:35:46] <wikibugs>	 (03CR) 10Bking: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper)
[20:36:12] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1257.eqiad.wmnet with reason: Maintenance
[20:36:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1257 (T391056)', diff saved to https://phabricator.wikimedia.org/P74753 and previous config saved to /var/cache/conftool/dbconfig/20250408-203618-fceratto.json
[20:36:21] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[20:37:07] <brett>	 !log import varnish-modules 0.15.0-3 to component/varnish6 bullseye-wikimedia (T391334)
[20:37:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:09] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[20:37:13] <jinxer-wm>	 FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing  - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh
[20:39:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10723669 (10phaultfinder)
[20:39:51] <wikibugs>	 (03PS1) 10Jforrester: [BETA CLUSTER] Decommission Beta Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135100 (https://phabricator.wikimedia.org/T362200)
[20:44:13] <wikibugs>	 (03CR) 10Bking: [C:03+2] search: allow any cirrussearch host to join cluster [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper)
[20:44:44] <wikibugs>	 (03PS1) 10Ladsgroup: mariadb: Add cn_notice_projects to the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1135101 (https://phabricator.wikimedia.org/T363581)
[20:46:15] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1257 (T391056)', diff saved to https://phabricator.wikimedia.org/P74754 and previous config saved to /var/cache/conftool/dbconfig/20250408-204615-fceratto.json
[20:46:18] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[20:47:13] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:49:44] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Add cn_notice_projects to the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1135101 (https://phabricator.wikimedia.org/T363581)
[20:49:50] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Add cn_notice_projects to the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1135101 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup)
[20:51:33] <brett>	 !log import libvmod-querysort 0.4-2 to component/varnish6 bullseye-wikimedia (T391334)
[20:51:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:36] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[20:53:36] <wikibugs>	 (03PS3) 10Bking: cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610)
[20:53:39] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[20:54:20] <wikibugs>	 (03PS1) 10Ladsgroup: openstack: wikireplica_dns: Add termstore aliases for s8 [puppet] - 10https://gerrit.wikimedia.org/r/1135107 (https://phabricator.wikimedia.org/T390954)
[20:55:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10723743 (10phaultfinder)
[20:55:55] <wikibugs>	 (03CR) 10Bking: [C:04-1] "The role for cirrussearch is incorrect. Fixing..." [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:00:05] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T2100)
[21:01:08] <wikibugs>	 (03PS1) 10Ladsgroup: LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135109 (https://phabricator.wikimedia.org/T390514)
[21:01:20] <wikibugs>	 (03PS1) 10Ladsgroup: LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135110 (https://phabricator.wikimedia.org/T390514)
[21:01:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1257', diff saved to https://phabricator.wikimedia.org/P74755 and previous config saved to /var/cache/conftool/dbconfig/20250408-210121-fceratto.json
[21:01:25] <Amir1>	 jouncebot: nowandnext
[21:01:25] <jouncebot>	 For the next 0 hour(s) and 58 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T2100)
[21:01:25] <jouncebot>	 In 8 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T0600)
[21:01:35] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135109 (https://phabricator.wikimedia.org/T390514) (owner: 10Ladsgroup)
[21:01:35] <wikibugs>	 (03PS4) 10Bking: cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610)
[21:01:38] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135110 (https://phabricator.wikimedia.org/T390514) (owner: 10Ladsgroup)
[21:02:04] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[21:02:50] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:03:24] <wikibugs>	 (03PS7) 10Andrea Denisse: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[21:03:35] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10723774 (10Jdforrester-WMF) >>! In T355914#10717142, @Ladsgroup wrote: > It'd be nice to add this to next week's tech news. Worth mentioning this has bee...
[21:05:59] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[21:06:42] <wikibugs>	 (03PS5) 10Bking: cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610)
[21:06:45] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:08:00] <James_F>	 Amir1: I was going to sling out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1135100 but I don't want to step on your deployment toes. ;-)
[21:08:57] <Amir1>	 I can deploy it
[21:09:03] <James_F>	 <3
[21:09:06] <Amir1>	 the backport patches take a while
[21:09:11] <James_F>	 Yeah.
[21:09:13] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] [BETA CLUSTER] Decommission Beta Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135100 (https://phabricator.wikimedia.org/T362200) (owner: 10Jforrester)
[21:09:17] <James_F>	 Whee.
[21:09:28] <James_F>	 Now I need to drop the servers from horizon.
[21:10:17] <wikibugs>	 (03Merged) 10jenkins-bot: [BETA CLUSTER] Decommission Beta Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135100 (https://phabricator.wikimedia.org/T362200) (owner: 10Jforrester)
[21:10:24] <wikibugs>	 (03PS8) 10Andrea Denisse: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[21:12:23] <wikibugs>	 (03Merged) 10jenkins-bot: LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135109 (https://phabricator.wikimedia.org/T390514) (owner: 10Ladsgroup)
[21:12:57] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[21:14:03] <wikibugs>	 (03Merged) 10jenkins-bot: LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135110 (https://phabricator.wikimedia.org/T390514) (owner: 10Ladsgroup)
[21:16:29] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1257', diff saved to https://phabricator.wikimedia.org/P74756 and previous config saved to /var/cache/conftool/dbconfig/20250408-211629-fceratto.json
[21:18:04] <Amir1>	 urandom: deploying right now
[21:18:21] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1135110|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135109|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135100|[BETA CLUSTER] Decommission Beta Wikifunctions (T362200 T363397 T368161 T373464 T389274)]]
[21:18:31] <stashbot>	 T362200: [QA task] wikifunction betacluster failures  - https://phabricator.wikimedia.org/T362200
[21:18:32] <stashbot>	 T363397: wasmedge CLI Resource Limits Break Beta Cluster - https://phabricator.wikimedia.org/T363397
[21:18:32] <stashbot>	 T368161: Creation of object fails in betacluster with Unspecified error - https://phabricator.wikimedia.org/T368161
[21:18:33] <stashbot>	 T373464: Port routing on deployment-docker-wikifunctions01 port routing (?) seems broken, making Beta Cluster Wikifunctions orchestrator unable to talk to its evaluator - https://phabricator.wikimedia.org/T373464
[21:18:33] <stashbot>	 T389274: "Exec error in changeprop" for wikifunctions.beta.wmflabs.org - https://phabricator.wikimedia.org/T389274
[21:19:24] <brett>	 !log import libvmod-netmapper 1.9-4 to component/varnish6 bullseye-wikimedia (T391334)
[21:19:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:19:26] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[21:19:53] <wikibugs>	 (03PS1) 10Dzahn: cloud: re-add gitlab runner docker_gc Hiera settings in cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1135114 (https://phabricator.wikimedia.org/T390948)
[21:19:55] <wikibugs>	 (03PS1) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115
[21:20:05] <urandom>	 Amir1: 👍
[21:21:31] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] "partial revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135114" [puppet] - 10https://gerrit.wikimedia.org/r/1133992 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn)
[21:21:31] <wikibugs>	 (03PS9) 10Andrea Denisse: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[21:21:37] <wikibugs>	 (03PS2) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115
[21:21:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] cloud: re-add gitlab runner docker_gc Hiera settings in cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1135114 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn)
[21:22:04] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[21:22:36] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: Absent systemd timers to stop attempting to generate enterprise HTML dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo)
[21:23:50] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: Absent systemd timers to stop attempting to generate enterprise HTML dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo)
[21:24:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[21:25:39] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup, jforrester: Backport for [[gerrit:1135110|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135109|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135100|[BETA CLUSTER] Decommission Beta Wikifunctions (T362200 T363397 T368161 T373464 T389274)]] synced to the testservers (https://wikitech.wikimed
[21:25:39] <logmsgbot>	 ia.org/wiki/Mwdebug)
[21:25:47] <stashbot>	 T362200: [QA task] wikifunction betacluster failures  - https://phabricator.wikimedia.org/T362200
[21:25:47] <stashbot>	 T363397: wasmedge CLI Resource Limits Break Beta Cluster - https://phabricator.wikimedia.org/T363397
[21:25:47] <stashbot>	 T368161: Creation of object fails in betacluster with Unspecified error - https://phabricator.wikimedia.org/T368161
[21:25:48] <stashbot>	 T373464: Port routing on deployment-docker-wikifunctions01 port routing (?) seems broken, making Beta Cluster Wikifunctions orchestrator unable to talk to its evaluator - https://phabricator.wikimedia.org/T373464
[21:25:48] <stashbot>	 T389274: "Exec error in changeprop" for wikifunctions.beta.wmflabs.org - https://phabricator.wikimedia.org/T389274
[21:26:58] <wikibugs>	 (03PS3) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115
[21:27:20] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup, jforrester: Continuing with sync
[21:27:35] <wikibugs>	 (03PS10) 10Andrea Denisse: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[21:27:37] <wikibugs>	 (03PS1) 10Andrew Bogott: Add cloudcontrol1011 as an eqiad1 cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/1135117 (https://phabricator.wikimedia.org/T391300)
[21:27:39] <wikibugs>	 (03PS1) 10Andrew Bogott: Replace cloudcontrol1005 with cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300)
[21:28:35] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135117 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott)
[21:29:33] <jinxer-wm>	 FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[21:29:59] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[21:30:36] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[21:30:40] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[21:31:27] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Add cloudcontrol1011 as an eqiad1 cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/1135117 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott)
[21:31:36] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1257 (T391056)', diff saved to https://phabricator.wikimedia.org/P74757 and previous config saved to /var/cache/conftool/dbconfig/20250408-213136-fceratto.json
[21:31:39] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[21:31:41] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[21:31:52] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[21:32:27] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply
[21:33:46] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:33:49] <wikibugs>	 (03CR) 10Bking: [C:03+2] "I fixed the role designation...merging" [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:34:03] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135110|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135109|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135100|[BETA CLUSTER] Decommission Beta Wikifunctions (T362200 T363397 T368161 T373464 T389274)]] (duration: 15m 42s)
[21:34:10] <stashbot>	 T362200: [QA task] wikifunction betacluster failures  - https://phabricator.wikimedia.org/T362200
[21:34:11] <stashbot>	 T363397: wasmedge CLI Resource Limits Break Beta Cluster - https://phabricator.wikimedia.org/T363397
[21:34:11] <stashbot>	 T368161: Creation of object fails in betacluster with Unspecified error - https://phabricator.wikimedia.org/T368161
[21:34:11] <stashbot>	 T373464: Port routing on deployment-docker-wikifunctions01 port routing (?) seems broken, making Beta Cluster Wikifunctions orchestrator unable to talk to its evaluator - https://phabricator.wikimedia.org/T373464
[21:34:11] <stashbot>	 T389274: "Exec error in changeprop" for wikifunctions.beta.wmflabs.org - https://phabricator.wikimedia.org/T389274
[21:35:30] <Amir1>	 the deployment just finished
[21:37:57] <wikibugs>	 (03CR) 10Andrea Denisse: "Hey! Quick heads-up on the `LibericaEtcdErrors` test, I had to make a few changes to get CI to pass." [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez)
[21:37:59] <wikibugs>	 (03PS2) 10Andrew Bogott: Replace cloudcontrol1005 with cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300)
[21:38:00] <wikibugs>	 (03PS1) 10Andrew Bogott: Add cloudcontrol role to cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135120 (https://phabricator.wikimedia.org/T391300)
[21:38:24] <Amir1>	 https://usercontent.irccloud-cdn.com/file/SBGPrPqR/grafik.png
[21:38:54] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135120 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott)
[21:39:01] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott)
[21:39:43] <swfrench-wmf>	 Amir1: nice! is that the rate of "persisting for unknown reason" events?
[21:40:05] <Amir1>	 swfrench-wmf: POST to sessionstore altogether 
[21:40:06] <Amir1>	 https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=now-3h&to=now&viewPanel=11
[21:40:41] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Add cloudcontrol role to cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135120 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott)
[21:40:43] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2152.codfw.wmnet with reason: Maintenance
[21:40:45] <swfrench-wmf>	 that looks quite promising :)
[21:40:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T391056)', diff saved to https://phabricator.wikimedia.org/P74758 and previous config saved to /var/cache/conftool/dbconfig/20250408-214049-fceratto.json
[21:40:53] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[21:43:42] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: remove row A worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135121 (https://phabricator.wikimedia.org/T388610)
[21:44:15] <urandom>	 Amir1: that's a pretty significant drop
[21:44:30] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: remove row A worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135121 (https://phabricator.wikimedia.org/T388610)
[21:45:07] <Amir1>	 it's not back to the pre-SUL3 era but much much better
[21:45:42] <urandom>	 yeah, what is that...about 20%
[21:46:22] <swfrench-wmf>	 Amir1: I've not had a chance today to get up to speed on the details, but is this SUL3-specific code path that was duplicating? or was this an existing inefficiency?
[21:46:29] <urandom>	 hrmm, maybe more like 16%?
[21:46:36] <urandom>	 but for a one line code change, I like it!
[21:47:09] <wikibugs>	 (03CR) 10Bking: [C:03+1] elastic: remove row A worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135121 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper)
[21:47:45] <Amir1>	 swfrench-wmf: I think it was existing already. But most importantly it might have been masked by something somewhere calling the login token 
[21:48:00] <Amir1>	 and some improvements unmasked it
[21:48:08] <Amir1>	 regardless. It's nice to have for sure
[21:48:25] <swfrench-wmf>	 got it, thanks! and yeah, very nice to have either way :)
[21:48:41] <urandom>	 Amir1: yeah, and these fell into the senseless overwrite bucket, right?
[21:48:52] <Amir1>	 yup
[21:48:56] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+2] elastic: remove row A worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135121 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper)
[21:49:03] <urandom>	 yeah, eliminating these will help
[21:49:49] <Amir1>	 got slightly even lower https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=now-3h&to=now&viewPanel=11
[21:51:05] <urandom>	 I think we're at the time of day when request volume is declining
[21:51:31] <urandom>	 partway between the peak and trough 
[21:51:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T391056)', diff saved to https://phabricator.wikimedia.org/P74759 and previous config saved to /var/cache/conftool/dbconfig/20250408-215159-fceratto.json
[21:52:02] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[21:56:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3615 MB (3% inode=98%): /tmp 3615 MB (3% inode=98%): /var/tmp 3615 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[21:57:04] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610
[21:57:07] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[21:58:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2087 to cirrussearch2087
[21:58:35] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[21:58:38] <wikibugs>	 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10724011 (10Jclark-ctr)
[21:58:58] <urandom>	 Amir1: here it is from the other end — https://grafana-rw.wikimedia.org/d/4plhqSPGk/bagostuff-stats-by-key-group?orgId=1&var-kClass=MWSession&from=1744138716070&to=1744149516070&forceLogin=&viewPanel=40
[21:59:06] <wikibugs>	 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10724027 (10Jclark-ctr) @Andrew  @dcaro installed 8tb ssd drive
[22:00:02] <urandom>	 yeah, I'd call that a solid 15%.  Nothing to sneeze at for a one-liner!
[22:00:31] <Amir1>	 I'm trying to debug further
[22:02:04] <urandom>	 Amir1: don't forget to eat and get some sleep too!
[22:02:23] <Amir1>	 shit, I forgot to make dinner
[22:02:45] <Amir1>	 I go eat something, I will check again afterwards
[22:02:58] <ryankemper>	 !log T388610 Elasticsearch->Opensearch row a data node migration ongoing
[22:03:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:03:01] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[22:03:33] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2087 to cirrussearch2087 - bking@cumin2002"
[22:04:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2087 to cirrussearch2087 - bking@cumin2002"
[22:04:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:04:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2087
[22:04:51] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2087
[22:05:31] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2087 to cirrussearch2087
[22:07:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P74760 and previous config saved to /var/cache/conftool/dbconfig/20250408-220706-fceratto.json
[22:12:33] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610
[22:12:36] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[22:22:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P74761 and previous config saved to /var/cache/conftool/dbconfig/20250408-222213-fceratto.json
[22:28:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2087.codfw.wmnet on all recursors
[22:28:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2087.codfw.wmnet on all recursors
[22:29:44] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[22:30:26] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2087.codfw.wmnet with OS bullseye
[22:30:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2087
[22:32:52] <wikibugs>	 (03PS3) 10Andrew Bogott: Replace cloudcontrol1005 with cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300)
[22:33:04] <wikibugs>	 (03PS4) 10Andrew Bogott: Replace cloudcontrol1005 with cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300)
[22:33:34] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott)
[22:33:36] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Effie! This LGTM from first principles, but I'm also minimally familiar with the logstash configuration here." [puppet] - 10https://gerrit.wikimedia.org/r/1135020 (owner: 10Effie Mouzeli)
[22:33:58] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002"
[22:34:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002"
[22:34:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:34:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[22:37:05] <wikibugs>	 (03PS1) 10Ryan Kemper: sre.elasticsearch.rolling-operation: handle negative caches between rename/reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1135133 (https://phabricator.wikimedia.org/T383811)
[22:37:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T391056)', diff saved to https://phabricator.wikimedia.org/P74762 and previous config saved to /var/cache/conftool/dbconfig/20250408-223721-fceratto.json
[22:37:24] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[22:37:37] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2154.codfw.wmnet with reason: Maintenance
[22:37:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T391056)', diff saved to https://phabricator.wikimedia.org/P74763 and previous config saved to /var/cache/conftool/dbconfig/20250408-223744-fceratto.json
[22:38:56] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2087 - bking@cumin2002"
[22:39:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2087 - bking@cumin2002"
[22:39:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:39:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2087.codfw.wmnet 90.0.192.10.in-addr.arpa 0.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[22:39:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2087.codfw.wmnet 90.0.192.10.in-addr.arpa 0.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[22:39:06] <wikibugs>	 (03CR) 10Bking: [C:03+1] sre.elasticsearch.rolling-operation: handle negative caches between rename/reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1135133 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[22:39:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2087
[22:39:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2087
[22:39:19] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2087
[22:40:08] <wikibugs>	 (03CR) 10Andrew Bogott: [C:03+2] Replace cloudcontrol1005 with cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott)
[22:40:49] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[22:40:51] <logmsgbot>	 !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[22:40:52] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[22:40:54] <logmsgbot>	 !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[22:40:55] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply
[22:40:58] <logmsgbot>	 !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[22:41:18] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper)
[22:47:42] <jinxer-wm>	 FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[22:48:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610
[22:48:03] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[22:49:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2069 to cirrussearch2069
[22:49:19] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thank you, Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1135021 (owner: 10Effie Mouzeli)
[22:49:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[22:50:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T391056)', diff saved to https://phabricator.wikimedia.org/P74764 and previous config saved to /var/cache/conftool/dbconfig/20250408-225028-fceratto.json
[22:50:31] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[22:53:41] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2069 to cirrussearch2069 - bking@cumin2002"
[22:54:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2069 to cirrussearch2069 - bking@cumin2002"
[22:54:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:54:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2069
[22:54:59] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2069
[22:55:02] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove final traces of cloudcontrol1005.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1135136 (https://phabricator.wikimedia.org/T391413)
[22:55:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2069 to cirrussearch2069
[22:55:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2069.codfw.wmnet on all recursors
[22:55:43] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2069.codfw.wmnet on all recursors
[22:56:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2069.codfw.wmnet with OS bullseye
[22:56:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2069
[22:56:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[22:56:54] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2087.codfw.wmnet with reason: host reimage
[23:02:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2069 - bking@cumin2002"
[23:02:23] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2069 - bking@cumin2002"
[23:02:23] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:02:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2069.codfw.wmnet 142.0.192.10.in-addr.arpa 2.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[23:02:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2069.codfw.wmnet 142.0.192.10.in-addr.arpa 2.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[23:02:28] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2069
[23:02:38] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2087.codfw.wmnet with reason: host reimage
[23:02:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2069
[23:02:40] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2069
[23:05:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P74765 and previous config saved to /var/cache/conftool/dbconfig/20250408-230535-fceratto.json
[23:16:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3531 MB (3% inode=98%): /tmp 3531 MB (3% inode=98%): /var/tmp 3531 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[23:20:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P74766 and previous config saved to /var/cache/conftool/dbconfig/20250408-232042-fceratto.json
[23:22:42] <jinxer-wm>	 RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[23:24:52] <wikibugs>	 (03PS2) 10Effie Mouzeli: logging: add support for php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135020
[23:28:10] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2087.codfw.wmnet with OS bullseye
[23:29:32] <wikibugs>	 (03PS1) 10Bking: cirrussearch: fix rack a7 regex [puppet] - 10https://gerrit.wikimedia.org/r/1135140 (https://phabricator.wikimedia.org/T388610)
[23:31:42] <wikibugs>	 (03CR) 10Bking: [C:03+2] "Self-merging interest of time" [puppet] - 10https://gerrit.wikimedia.org/r/1135140 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[23:33:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1132:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1132 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[23:35:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2069.codfw.wmnet with reason: host reimage
[23:35:43] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:35:50] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T391056)', diff saved to https://phabricator.wikimedia.org/P74767 and previous config saved to /var/cache/conftool/dbconfig/20250408-233549-fceratto.json
[23:35:53] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[23:36:05] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2162.codfw.wmnet with reason: Maintenance
[23:36:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T391056)', diff saved to https://phabricator.wikimedia.org/P74768 and previous config saved to /var/cache/conftool/dbconfig/20250408-233611-fceratto.json
[23:38:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2069.codfw.wmnet with reason: host reimage
[23:39:44] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135141
[23:39:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135141 (owner: 10TrainBranchBot)
[23:42:13] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:47:42] <jinxer-wm>	 FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem
[23:48:51] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T391056)', diff saved to https://phabricator.wikimedia.org/P74769 and previous config saved to /var/cache/conftool/dbconfig/20250408-234850-fceratto.json
[23:48:54] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[23:51:27] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135141 (owner: 10TrainBranchBot)
[23:59:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2069.codfw.wmnet with OS bullseye
[23:59:36] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610
[23:59:41] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610