[00:01:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T391056)', diff saved to https://phabricator.wikimedia.org/P74663 and previous config saved to /var/cache/conftool/dbconfig/20250408-000130-fceratto.json [00:01:33] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [00:09:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134773 [00:09:53] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134773 (owner: 10TrainBranchBot) [00:12:15] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [00:12:16] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1202.eqiad.wmnet with OS bullseye [00:16:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P74664 and previous config saved to /var/cache/conftool/dbconfig/20250408-001637-fceratto.json [00:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:21:03] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1202.eqiad.wmnet [00:22:52] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1202.eqiad.wmnet [00:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10719822 (10phaultfinder) [00:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:24:55] (03PS1) 10Superpes15: [ptwiktionary] Create a Wikisaurus namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134776 (https://phabricator.wikimedia.org/T391299) [00:26:17] (03PS1) 10Btullis: Bring an-worker1202 into service [puppet] - 10https://gerrit.wikimedia.org/r/1134777 (https://phabricator.wikimedia.org/T390048) [00:26:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10719827 (10BTullis) a:05Jclark-ctr→03BTullis [00:27:22] (03CR) 10Btullis: [C:03+2] Bring an-worker1202 into service [puppet] - 10https://gerrit.wikimedia.org/r/1134777 (https://phabricator.wikimedia.org/T390048) (owner: 10Btullis) [00:27:50] (03PS2) 10Superpes15: [ptwiktionary] Create a Wikisaurus namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134776 (https://phabricator.wikimedia.org/T391299) [00:31:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P74665 and previous config saved to /var/cache/conftool/dbconfig/20250408-003144-fceratto.json [00:37:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [00:40:12] (03PS1) 10Dzahn: phabricator: apply a staging role/profile to host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889) [00:40:36] (03CR) 10CI reject: [V:04-1] phabricator: apply a staging role/profile to host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [00:41:43] (03PS1) 10Dzahn: phabricator: apply phabricator::migration role on host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134779 (https://phabricator.wikimedia.org/T377889) [00:43:03] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1202.eqiad.wmnet [00:43:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10719844 (10ops-monitoring-bot) Host rebooted by btullis@cumin1002 with reason: Reboot after moving vlan and commission... [00:44:46] (03PS2) 10Dzahn: phabricator: apply phabricator::migration role on host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134779 (https://phabricator.wikimedia.org/T377889) [00:46:24] (03CR) 10Dzahn: "well, I need to fix the nftables/ferm change to minimize the diff.. but: https://puppet-compiler.wmflabs.org/output/1134779/5226/phab1005." [puppet] - 10https://gerrit.wikimedia.org/r/1134779 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [00:46:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T391056)', diff saved to https://phabricator.wikimedia.org/P74666 and previous config saved to /var/cache/conftool/dbconfig/20250408-004652-fceratto.json [00:46:56] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [00:47:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2180.codfw.wmnet with reason: Maintenance [00:47:13] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:47:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T391056)', diff saved to https://phabricator.wikimedia.org/P74667 and previous config saved to /var/cache/conftool/dbconfig/20250408-004715-fceratto.json [00:48:00] (03CR) 10Dzahn: "This is an alternative to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134779 which does not do the scap setup at all and would o" [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [00:48:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T391056)', diff saved to https://phabricator.wikimedia.org/P74668 and previous config saved to /var/cache/conftool/dbconfig/20250408-004827-fceratto.json [00:48:33] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1202.eqiad.wmnet [00:49:18] (03PS2) 10Dzahn: phabricator: apply a staging role/profile to host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889) [00:49:41] (03CR) 10CI reject: [V:04-1] phabricator: apply a staging role/profile to host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [00:54:10] (03PS1) 10Dzahn: phabricator::migration: use nftables as firewall provider [puppet] - 10https://gerrit.wikimedia.org/r/1134781 (https://phabricator.wikimedia.org/T370677) [00:56:00] (03PS3) 10Dzahn: phabricator: apply a staging role/profile to host phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134778 (https://phabricator.wikimedia.org/T377889) [00:57:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:58:44] (03CR) 10Dzahn: [V:03+1 C:03+2] "currently no server is using this role (verified with cumin), so self merging this. the real diff will be when the role is applied to phab" [puppet] - 10https://gerrit.wikimedia.org/r/1134781 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [01:01:06] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134773 (owner: 10TrainBranchBot) [01:03:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P74669 and previous config saved to /var/cache/conftool/dbconfig/20250408-010334-fceratto.json [01:04:51] (03CR) 10Dzahn: "reduced diff https://puppet-compiler.wmflabs.org/output/1134779/5227/phab1005.eqiad.wmnet/index.html after https://gerrit.wikimedia.org/r" [puppet] - 10https://gerrit.wikimedia.org/r/1134779 (https://phabricator.wikimedia.org/T377889) (owner: 10Dzahn) [01:09:53] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.24 [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134782 (https://phabricator.wikimedia.org/T386219) [01:09:55] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.24 [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134782 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot) [01:18:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P74670 and previous config saved to /var/cache/conftool/dbconfig/20250408-011841-fceratto.json [01:19:17] (03PS1) 10Scott French: scap: Use PHP 8.1 when executing maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) [01:20:53] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.24 [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134782 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot) [01:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10719947 (10phaultfinder) [01:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:33:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T391056)', diff saved to https://phabricator.wikimedia.org/P74671 and previous config saved to /var/cache/conftool/dbconfig/20250408-013348-fceratto.json [01:33:51] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [01:34:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2193.codfw.wmnet with reason: Maintenance [01:34:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T391056)', diff saved to https://phabricator.wikimedia.org/P74672 and previous config saved to /var/cache/conftool/dbconfig/20250408-013412-fceratto.json [01:36:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T391056)', diff saved to https://phabricator.wikimedia.org/P74673 and previous config saved to /var/cache/conftool/dbconfig/20250408-013625-fceratto.json [01:51:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P74674 and previous config saved to /var/cache/conftool/dbconfig/20250408-015132-fceratto.json [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0200) [02:06:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P74675 and previous config saved to /var/cache/conftool/dbconfig/20250408-020639-fceratto.json [02:21:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T391056)', diff saved to https://phabricator.wikimedia.org/P74676 and previous config saved to /var/cache/conftool/dbconfig/20250408-022146-fceratto.json [02:21:50] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [02:22:02] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2197.codfw.wmnet with reason: Maintenance [02:25:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2214.codfw.wmnet with reason: Maintenance [02:25:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74677 and previous config saved to /var/cache/conftool/dbconfig/20250408-022538-fceratto.json [02:30:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74678 and previous config saved to /var/cache/conftool/dbconfig/20250408-023047-fceratto.json [02:30:51] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [02:45:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P74679 and previous config saved to /var/cache/conftool/dbconfig/20250408-024555-fceratto.json [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0300) [03:01:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P74680 and previous config saved to /var/cache/conftool/dbconfig/20250408-030102-fceratto.json [03:01:45] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134788 (https://phabricator.wikimedia.org/T386219) [03:01:47] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134788 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot) [03:02:37] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134788 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot) [03:02:59] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.44.0-wmf.24 refs T386219 [03:03:02] T386219: 1.44.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T386219 [03:16:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74681 and previous config saved to /var/cache/conftool/dbconfig/20250408-031609-fceratto.json [03:16:13] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [03:16:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2217.codfw.wmnet with reason: Maintenance [03:16:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T391056)', diff saved to https://phabricator.wikimedia.org/P74682 and previous config saved to /var/cache/conftool/dbconfig/20250408-031632-fceratto.json [03:21:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T391056)', diff saved to https://phabricator.wikimedia.org/P74683 and previous config saved to /var/cache/conftool/dbconfig/20250408-032145-fceratto.json [03:21:49] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [03:35:42] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:36:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P74684 and previous config saved to /var/cache/conftool/dbconfig/20250408-033652-fceratto.json [03:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:52:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P74685 and previous config saved to /var/cache/conftool/dbconfig/20250408-035159-fceratto.json [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0400) [04:06:42] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.44.0-wmf.24 refs T386219 (duration: 63m 43s) [04:06:45] T386219: 1.44.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T386219 [04:07:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T391056)', diff saved to https://phabricator.wikimedia.org/P74686 and previous config saved to /var/cache/conftool/dbconfig/20250408-040706-fceratto.json [04:07:09] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [04:07:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2224.codfw.wmnet with reason: Maintenance [04:07:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2224 (T391056)', diff saved to https://phabricator.wikimedia.org/P74687 and previous config saved to /var/cache/conftool/dbconfig/20250408-040728-fceratto.json [04:09:28] !log mwpresync@deploy1003 Pruned MediaWiki: 1.44.0-wmf.21 (duration: 09m 26s) [04:12:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T391056)', diff saved to https://phabricator.wikimedia.org/P74688 and previous config saved to /var/cache/conftool/dbconfig/20250408-041241-fceratto.json [04:12:44] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [04:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:27:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P74689 and previous config saved to /var/cache/conftool/dbconfig/20250408-042748-fceratto.json [04:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [04:42:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [04:42:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P74690 and previous config saved to /var/cache/conftool/dbconfig/20250408-044254-fceratto.json [04:47:13] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:58:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T391056)', diff saved to https://phabricator.wikimedia.org/P74691 and previous config saved to /var/cache/conftool/dbconfig/20250408-045801-fceratto.json [04:58:05] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [05:02:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:10:42] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:19:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:22:27] (03PS3) 10Jelto: Ceph: add types for S3 credential and account [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) [05:23:38] (03CR) 10Jelto: Ceph: add types for S3 credential and account (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [05:24:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:59:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0600) [06:00:05] marostegui, Amir1, and federico3: That opportune time for a Primary database switchover deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0600). [06:00:42] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:04:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:19:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:24:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:et-1/1/2 (Transport: cr1-codfw:et-1/0/2 (Arelion, IC-374549) {#20231106}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:35:47] (03CR) 10Jelto: [V:03+1 C:03+2] trafficserver: switch querybuilder scholarly to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1134697 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [06:36:27] 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#10720225 (10Marostegui) What is pending here @Ladsgroup? [06:41:24] (03PS1) 10Marostegui: db1151,db2144: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1134926 (https://phabricator.wikimedia.org/T391317) [06:42:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool ms2 T391317', diff saved to https://phabricator.wikimedia.org/P74692 and previous config saved to /var/cache/conftool/dbconfig/20250408-064250-marostegui.json [06:42:53] T391317: Migrate msX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391317 [06:43:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2144.codfw.wmnet,db1151.eqiad.wmnet with reason: Maintenance [06:43:45] (03CR) 10Marostegui: [C:03+2] db1151,db2144: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1134926 (https://phabricator.wikimedia.org/T391317) (owner: 10Marostegui) [06:45:36] !log Upgrade ms2 to MariaDB 10.11 codfw eqiad dbmaint T391317 [06:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool ms2 T391317', diff saved to https://phabricator.wikimedia.org/P74693 and previous config saved to /var/cache/conftool/dbconfig/20250408-064813-marostegui.json [06:48:16] T391317: Migrate msX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391317 [07:00:04] Amir1, Urbanecm, and awight: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0700). [07:00:04] abijeet and kevinbazira: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:36] o/ [07:02:36] here for abijeet's change deployment + testing. [07:02:48] I can start with first two changes. [07:03:05] o/ [07:03:32] kart_, we should do this one first: [config] 1130963 (deploy commands) AX: Enable Quick Surveys extension on Tswana and Venetian wiki - task T390023 [07:03:33] T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023 [07:03:40] abijeet: Starting with the first change [07:03:43] yes [07:04:02] kart_, either is fine but preferable to do that one first. [07:04:34] (03PS6) 10Abijeet Patro: AX: Enable Quick Surveys extension on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023) [07:04:52] rebasing it to make sure we don't get merge conflict in 2nd as well. [07:06:41] abijeet: should it be 'vecwiki' and 'tnwiki' instead of 'vec' and 'tn' in the patch? [07:08:07] kart_, checking [07:08:31] Otherwise it will enable extension in wiktionary/wikisource as well! [07:08:46] yes, you are right. Fixing. [07:08:49] See: https://integration.wikimedia.org/ci/job/operations-mw-config-php81-composer-diffConfig/273/console and https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig/4289/console [07:10:07] (03PS7) 10Abijeet Patro: AX: Enable Quick Surveys extension on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023) [07:10:26] kart_, thanks for spotting that! [07:10:41] kart_, fixed. [07:11:53] cool. Starting. [07:12:07] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [07:12:56] (03Merged) 10jenkins-bot: AX: Enable Quick Surveys extension on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [07:13:50] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1130963|AX: Enable Quick Surveys extension on Tswana and Venetian wiki (T390023)]] [07:13:53] T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023 [07:15:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:16:44] (03PS1) 10Marostegui: production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) [07:16:45] kart_: o/ if you don't mind, please deploy my change too? it's the one following abijeet's 2 changes. thanks in advance! [07:18:19] (03CR) 10Marostegui: "This is a noop, it is just for tracking" [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [07:18:58] (03CR) 10CI reject: [V:04-1] production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [07:19:01] (03PS2) 10Marostegui: production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) [07:20:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:21:16] (03CR) 10CI reject: [V:04-1] production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [07:21:22] !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1130963|AX: Enable Quick Surveys extension on Tswana and Venetian wiki (T390023)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:21:24] T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023 [07:21:26] (03CR) 10MVernon: [C:03+1] "Thanks for your work on this, this change LGTM :)" [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [07:21:27] kevinbazira: sure! [07:21:41] abijeet: testing please :) [07:22:20] kart_, on it [07:25:07] kart_, looks good. [07:25:08] (03PS3) 10Marostegui: production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) [07:25:24] cool. [07:25:26] (03CR) 10Marostegui: [C:03+1] Add apus-fe2003 to hiera and conftool [puppet] - 10https://gerrit.wikimedia.org/r/1134208 (https://phabricator.wikimedia.org/T390578) (owner: 10MVernon) [07:25:26] !log kartik@deploy1003 abi, kartik: Continuing with sync [07:25:51] (03CR) 10Marostegui: [C:03+1] Thanos: add new thanos-fe200[5-7] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1134221 (https://phabricator.wikimedia.org/T389634) (owner: 10MVernon) [07:26:54] (03CR) 10Marostegui: production-ms.sql.erb: Add file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [07:27:46] (03CR) 10Marostegui: [C:03+2] production-ms.sql.erb: Add file [puppet] - 10https://gerrit.wikimedia.org/r/1134928 (https://phabricator.wikimedia.org/T387332) (owner: 10Marostegui) [07:29:51] (03CR) 10Effie Mouzeli: [C:03+1] scap: Use PHP 8.1 when executing maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [07:29:57] (03PS1) 10DCausse: cirrus: disable completion indices in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134969 (https://phabricator.wikimedia.org/T388610) [07:30:03] (03CR) 10Fabfur: [C:03+2] hiera: cleanup TLS on volatile storage custom files [puppet] - 10https://gerrit.wikimedia.org/r/1134698 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [07:30:22] (03CR) 10CI reject: [V:04-1] cirrus: disable completion indices in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134969 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [07:30:30] (03PS1) 10Slyngshede: IDP: Failover to updated host [dns] - 10https://gerrit.wikimedia.org/r/1134970 [07:31:55] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1185 - https://phabricator.wikimedia.org/T391049#10720341 (10Marostegui) p:05Triage→03Medium [07:32:48] (03CR) 10Slyngshede: [C:03+2] IDP: Failover to updated host [dns] - 10https://gerrit.wikimedia.org/r/1134970 (owner: 10Slyngshede) [07:33:05] !log slyngshede@dns1004 START - running authdns-update [07:33:55] (03PS2) 10DCausse: cirrus: disable completion indices in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134969 (https://phabricator.wikimedia.org/T388610) [07:34:18] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130963|AX: Enable Quick Surveys extension on Tswana and Venetian wiki (T390023)]] (duration: 20m 27s) [07:34:21] T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023 [07:35:29] !log slyngshede@dns1004 END - running authdns-update [07:36:42] abijeet: going with second change now. [07:36:55] kart_, okie [07:37:05] kart_, we should keep an eye on MinT [07:37:33] Yes. Check Grafana dashboard. [07:37:42] kart_, ok [07:38:11] (03PS5) 10Abijeet Patro: AX: Enable entry-points on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130942 (https://phabricator.wikimedia.org/T390023) [07:39:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130942 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [07:40:41] (03Merged) 10jenkins-bot: AX: Enable entry-points on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130942 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [07:41:02] jouncebot: nowandnext [07:41:02] For the next 0 hour(s) and 18 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T0700) [07:41:02] In 2 hour(s) and 18 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1000) [07:41:03] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1130942|AX: Enable entry-points on Tswana and Venetian wiki (T390023)]] [07:41:06] T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023 [07:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:48:20] !log kartik@deploy1003 abi, kartik: Backport for [[gerrit:1130942|AX: Enable entry-points on Tswana and Venetian wiki (T390023)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:48:23] T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023 [07:49:00] abijeet: testing time. [07:49:21] kart_, ok [07:53:40] kart_, looks ok. I can see atleast one entrypoint workign as expected. I can test the others later. [07:55:35] OK. Let's go ahead. [07:55:37] !log kartik@deploy1003 abi, kartik: Continuing with sync [07:56:38] (03PS3) 10Kevin Bazira: EventStreamConfig: Add RRLA prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) [07:56:40] kart_, sorry. one sec [07:57:33] I'm seeing this error on wikis: Error: Cannot require undefined file ../codex.js ArticleFooterEntrypointCard.vue:2 [07:57:34] require startup.js:1016 [07:57:34] js ext.ax.articlefooter.entrypoint.js:3 [07:58:00] abijeet: oops I started the deployment. Do you want to revert it? [07:58:21] Or we can do followup fix later? [07:58:53] I can submit a patch to fix immediately. Maybe we want to fix it and backport kater? [07:59:42] OK. Please submit it, we can backport it and deploy later today. [08:00:44] (03CR) 10Fabfur: [C:03+2] external_cloud_vendors: Added Google SpecialCaseCrawlers list [puppet] - 10https://gerrit.wikimedia.org/r/1134243 (https://phabricator.wikimedia.org/T391108) (owner: 10Fabfur) [08:02:37] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130942|AX: Enable entry-points on Tswana and Venetian wiki (T390023)]] (duration: 21m 33s) [08:02:40] T390023: MinT for Wiki Readers MVP: Pre-Pilot enablement on 4 wikis - https://phabricator.wikimedia.org/T390023 [08:03:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [08:03:55] kevinbazira: going with your change. Around? [08:04:04] yes, I am. tx [08:04:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [08:05:21] (03CR) 10MVernon: [C:03+2] Add apus-fe2003 to hiera and conftool [puppet] - 10https://gerrit.wikimedia.org/r/1134208 (https://phabricator.wikimedia.org/T390578) (owner: 10MVernon) [08:05:27] (03Merged) 10jenkins-bot: EventStreamConfig: Add RRLA prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [08:05:52] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1133603|EventStreamConfig: Add RRLA prediction_change stream (T326179)]] [08:05:55] T326179: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179 [08:12:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool ms1 T391317', diff saved to https://phabricator.wikimedia.org/P74694 and previous config saved to /var/cache/conftool/dbconfig/20250408-081224-marostegui.json [08:12:28] T391317: Migrate msX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391317 [08:12:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool ms1 T391317', diff saved to https://phabricator.wikimedia.org/P74695 and previous config saved to /var/cache/conftool/dbconfig/20250408-081248-marostegui.json [08:12:58] !log kartik@deploy1003 kartik, kevinbazira: Backport for [[gerrit:1133603|EventStreamConfig: Add RRLA prediction_change stream (T326179)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:13:01] T326179: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179 [08:13:24] kevinbazira: possible to test on the testservers? [08:13:50] kart-: we should be able to see `mediawiki.page_revert_risk_prediction_change.v1` listed on: [08:13:50] https://meta.wikimedia.org/w/api.php?action=streamconfigs and [08:13:50] https://meta.wikimedia.beta.wmflabs.org/w/api.php?action=streamconfigs [08:13:50] but I don't see it yet. could be a cache issue on my end. [08:16:09] (03PS1) 10Abijeet Patro: ArticleFooterEntrypointCard: Change the way codex is loaded [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134976 (https://phabricator.wikimedia.org/T389176) [08:16:29] (03PS1) 10Abijeet Patro: ArticleFooterEntrypointCard: Change the way codex is loaded [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134977 (https://phabricator.wikimedia.org/T389176) [08:16:44] kevinbazira: I can see it. Did you select k8s-mwdebug to test? [08:17:12] great! [08:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:17:25] I can see in meta.w.o but not in beta yet. [08:19:29] kevinbazira: should we go ahead for deployment? [08:19:38] yes please [08:22:13] !log kartik@deploy1003 kartik, kevinbazira: Continuing with sync [08:23:42] (03CR) 10Volans: [C:04-1] "I think there is still a small problem, the rest looks ok." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [08:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:25:07] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134976 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [08:25:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134977 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [08:28:42] !log pool apus-fe2003 T390578 [08:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:44] T390578: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578 [08:29:04] !log mvernon@cumin2002 conftool action : set/weight=40; selector: service=apus,name=apus-fe2003.codfw.wmnet [08:29:04] kart_: now I can see `mediawiki.page_revert_risk_prediction_change.v1` listed on both: [08:29:05] https://meta.wikimedia.org/w/api.php?action=streamconfigs and [08:29:05] https://meta.wikimedia.beta.wmflabs.org/w/api.php?action=streamconfigs [08:29:05] thanks alot for your help. :) [08:29:09] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: service=apus,name=apus-fe2003.codfw.wmnet [08:29:14] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133603|EventStreamConfig: Add RRLA prediction_change stream (T326179)]] (duration: 23m 21s) [08:29:17] T326179: Emit revision revert risk scores as a stream and expose in EventStreams API - https://phabricator.wikimedia.org/T326179 [08:29:54] Cool [08:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [08:39:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134064 (https://phabricator.wikimedia.org/T389429) (owner: 10Ebernhardson) [08:46:29] (03PS1) 10Brouberol: airflow: set saner performance-related configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134985 (https://phabricator.wikimedia.org/T390945) [08:47:01] (03PS2) 10Wargo: search-redirect: fix case-sensitivity of project name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) [08:47:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [08:47:13] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:51] (03PS2) 10Brouberol: airflow: set saner performance-related configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134985 (https://phabricator.wikimedia.org/T390945) [08:58:10] (03PS1) 10Alexandros Kosiaris: mw-wikifunctions: Remove the temporary -ingress DNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134987 [09:02:45] (03PS2) 10Alexandros Kosiaris: Remove mw-wikifunctions-ingress RRs [dns] - 10https://gerrit.wikimedia.org/r/1134282 [09:05:25] (03CR) 10Alexandros Kosiaris: [C:03+2] Remove mw-wikifunctions-ingress RRs [dns] - 10https://gerrit.wikimedia.org/r/1134282 (owner: 10Alexandros Kosiaris) [09:05:30] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifunctions: Remove the temporary -ingress DNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134987 (owner: 10Alexandros Kosiaris) [09:05:46] !log akosiaris@dns1004 START - running authdns-update [09:07:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:08:11] !log akosiaris@dns1004 END - running authdns-update [09:11:08] (03Merged) 10jenkins-bot: mw-wikifunctions: Remove the temporary -ingress DNS [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134987 (owner: 10Alexandros Kosiaris) [09:24:20] 06SRE, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720641 (10Superpes15) [09:25:45] (03PS1) 10Jelto: trafficserver: switch all querybuilder backends to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1134988 (https://phabricator.wikimedia.org/T350793) [09:27:07] (03CR) 10Jelto: [V:03+1] "querybuilder in query-scholarly is working fine from wikikube:" [puppet] - 10https://gerrit.wikimedia.org/r/1134988 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [09:28:46] (03PS1) 10Ozge: ml-services: update edit-check image with pydantic. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989 [09:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:30:45] (03CR) 10AikoChou: [C:03+1] ml-services: update edit-check image with pydantic. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989 (owner: 10Ozge) [09:31:45] (03CR) 10Ozge: "ml-services: update edit-check image with pydantic." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989 (owner: 10Ozge) [09:31:53] (03PS2) 10Ozge: ml-services: update edit-check image with pydantic. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989 [09:32:15] (03CR) 10Ozge: [V:03+2 C:03+2] ml-services: update edit-check image with pydantic. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989 (owner: 10Ozge) [09:33:41] (03Merged) 10jenkins-bot: ml-services: update edit-check image with pydantic. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134989 (owner: 10Ozge) [09:33:58] (03CR) 10Jelto: [C:03+2] Ceph: add types for S3 credential and account [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [09:37:13] FIRING: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:40:37] (03PS3) 10Brouberol: airflow: set saner performance-related configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134985 (https://phabricator.wikimedia.org/T390945) [09:42:13] RESOLVED: [2x] ProbeDown: Service ml-staging-ctrl2001:6443 has failed probes (http_ml_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#ml-staging-ctrl2001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:42:47] !log ozge@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [09:52:00] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface cr2-eqiad:xe-3/1/7 (Core: pfw1-eqiad:xe-7/2/0 {#4027}) - https://phabricator.wikimedia.org/T390869#10720716 (10cmooney) 05Open→03Resolved a:03cmooney Gonna close this one, seems we had a burst of errors when we had the problem last week and... [09:57:17] !log klausman@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1000) [10:00:42] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:04:03] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] trafficserver: switch all querybuilder backends to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1134988 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [10:05:39] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10720747 (10cmooney) It seems the work yesterday has not stopped the carrier transitions reported, although the number has decreased: {F59013584 wid... [10:06:05] (03PS1) 10Klausman: ml-services/experimental: clean up a few GPU-using services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134991 [10:06:50] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134991 (owner: 10Klausman) [10:14:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134691 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [10:15:04] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720765 (10Jelto) [10:16:05] (03CR) 10MVernon: [C:03+2] Thanos: add new thanos-fe200[5-7] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1134221 (https://phabricator.wikimedia.org/T389634) (owner: 10MVernon) [10:18:09] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132674 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [10:18:55] (03PS1) 10Ladsgroup: Bump thumbnail steps to 75% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134999 (https://phabricator.wikimedia.org/T360589) [10:18:56] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720807 (10Jelto) It looks like bouncing started today at 01:00 https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=now-24h&to=now&viewPanel=2 I'll chec... [10:21:29] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720816 (10LSobanski) Fixed time dashboard for reference: https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&viewPanel=2&from=1744089600000&to=1744107600000 [10:21:59] (03CR) 10Clément Goubert: [C:03+2] mw::periodic_jobs: Migrate deleteOldSurveys [puppet] - 10https://gerrit.wikimedia.org/r/1132674 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [10:23:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1167.eqiad.wmnet with reason: Maintenance [10:24:06] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [10:24:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1167 (T391056)', diff saved to https://phabricator.wikimedia.org/P74698 and previous config saved to /var/cache/conftool/dbconfig/20250408-102412-fceratto.json [10:24:16] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:24:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10720821 (10phaultfinder) [10:24:40] (03CR) 10Cathal Mooney: [C:03+1] "LGTM! A veritable lesson in well-writted code tbh :) Perhaps get Luca to give a once over on the Python side but all looks good to me ni" [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [10:25:58] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-fe2005.codfw.wmnet [10:26:01] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720826 (10Jelto) `mailman3.service` prints a lot of Python stacktraces starting Apr 07 09:06 UTC ` Apr 07 09:06:41 lists1004 mailman3[2696297]: Apr 07 09:06:41 202... [10:26:12] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, and 2 others: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10720827 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: reboot before bringing into service [10:26:22] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720828 (10Superpes15) >>! In T391330#10720807, @Jelto wrote: > It looks like bouncing started today at 01:00 https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgI... [10:30:29] jouncebot: nowandnext [10:30:29] For the next 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1000) [10:30:29] In 1 hour(s) and 29 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1200) [10:31:50] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2005.codfw.wmnet [10:32:06] (03PS1) 10Brouberol: airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) [10:32:10] 10ops-eqiad, 06SRE, 06DC-Ops: fasw2-c1[a|b]-eqiad:ge-0/0/27 flapping while admin down - https://phabricator.wikimedia.org/T391257#10720840 (10cmooney) >>! In T391257#10718036, @VRiley-WMF wrote: > It looks like pay-1b1001 is currently connected to these ports. Would you like us to remove the SFPs? I believe... [10:32:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-fe2006.codfw.wmnet [10:32:35] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10720844 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: reboot before bringing into service [10:33:11] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:33:28] !log restart mailman3.service on lists1004 - T391330 [10:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:30] T391330: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330 [10:33:59] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720846 (10Jelto) I restarted `mailman3.service` on `lists1004` because the service stopped logging any activity right before bouncing increased (Apr 07 23:56:28).... [10:34:25] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:36:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T391056)', diff saved to https://phabricator.wikimedia.org/P74699 and previous config saved to /var/cache/conftool/dbconfig/20250408-103604-fceratto.json [10:36:07] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:36:33] (03PS3) 10Hnowlan: wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791) [10:38:16] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2006.codfw.wmnet [10:38:20] (03CR) 10Effie Mouzeli: [C:03+1] wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [10:40:42] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720855 (10Jelto) The metrics are back to baseline. So from the system level this issue looks resolved. I'm lacking a bit of mailman knowledge to verify it processe... [10:41:27] (03PS1) 10Clément Goubert: kubernetes_periodic_job: Lowercase job name [puppet] - 10https://gerrit.wikimedia.org/r/1135002 (https://phabricator.wikimedia.org/T341555) [10:41:36] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host thanos-fe2007.codfw.wmnet [10:41:58] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10720858 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: reboot before bringing into service [10:42:01] (03CR) 10Klausman: [C:03+2] ml-services/experimental: clean up a few GPU-using services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134991 (owner: 10Klausman) [10:42:13] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135002 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [10:43:29] (03Merged) 10jenkins-bot: ml-services/experimental: clean up a few GPU-using services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134991 (owner: 10Klausman) [10:44:05] (03CR) 10Cathal Mooney: [C:03+1] "In general looks ok. I'm not 100% sure what the switch side should look like to support this, or if its necessarily the way we want to do" [puppet] - 10https://gerrit.wikimedia.org/r/1134700 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [10:44:29] (03CR) 10Hnowlan: [C:03+2] wmnet: remove jobrunner and videoscaler records [dns] - 10https://gerrit.wikimedia.org/r/1133931 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [10:44:51] (03CR) 10Cathal Mooney: [C:03+1] "I'm broadly ok with this approach but we may need to review the resulting Bird config and amend this role to support something different i" [puppet] - 10https://gerrit.wikimedia.org/r/1134699 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [10:45:46] !log hnowlan@dns1004 START - running authdns-update [10:46:33] (03CR) 10Majavah: [V:03+1 C:03+2] P:bird: Allow enabling IPv6 without enabling all services on it [puppet] - 10https://gerrit.wikimedia.org/r/1134699 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [10:47:46] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2007.codfw.wmnet [10:48:22] !log hnowlan@dns1004 END - running authdns-update [10:48:29] (03CR) 10Clément Goubert: [C:03+1] service: remove videoscaler, jobrunner monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1133934 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [10:49:24] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10720870 (10Superpes15) >>! In T391330#10720854, @Jelto wrote: > I'm lacking a bit of mailman knowledge to verify it processes fresh mails. @Superpes15 you mentioned... [10:49:25] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:49:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10720871 (10phaultfinder) [10:49:38] (03CR) 10Btullis: [C:03+1] airflow: scrape additional metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135001 (https://phabricator.wikimedia.org/T391332) (owner: 10Brouberol) [10:50:18] (03CR) 10Hnowlan: [C:03+1] kubernetes_periodic_job: Lowercase job name [puppet] - 10https://gerrit.wikimedia.org/r/1135002 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [10:50:44] (03CR) 10Btullis: [C:03+1] airflow: set saner performance-related configs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134985 (https://phabricator.wikimedia.org/T390945) (owner: 10Brouberol) [10:51:10] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe [10:51:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P74700 and previous config saved to /var/cache/conftool/dbconfig/20250408-105111-fceratto.json [10:51:56] (03CR) 10Clément Goubert: [C:03+2] kubernetes_periodic_job: Lowercase job name [puppet] - 10https://gerrit.wikimedia.org/r/1135002 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [10:52:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10720873 (10BTullis) 05Open→03Resolved Thanks @Jclark-ctr - This host is back in the cluster now. [10:54:07] (03CR) 10Hnowlan: [C:03+2] service: remove videoscaler, jobrunner monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1133934 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [10:55:17] jouncebot: nowandnext [10:55:17] For the next 0 hour(s) and 4 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1000) [10:55:17] In 1 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1200) [10:56:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134999 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:56:44] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe [10:56:59] (03Merged) 10jenkins-bot: Bump thumbnail steps to 75% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134999 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:57:22] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1134999|Bump thumbnail steps to 75% (T360589)]] [10:57:25] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:59:44] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.upgrade restarting P{lvs3008.esams.wmnet} and A:liberica [11:00:19] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.upgrade (exit_code=0) restarting P{lvs3008.esams.wmnet} and A:liberica [11:00:47] !log pool thanos-fe200[5-7] T389634 [11:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:49] T389634: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634 [11:00:57] !log mvernon@cumin2002 conftool action : set/weight=100; selector: name=thanos-fe2005.codfw.wmnet [11:01:10] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: name=thanos-fe2005.codfw.wmnet [11:01:32] !log mvernon@cumin2002 conftool action : set/weight=100; selector: name=thanos-fe2006.codfw.wmnet [11:01:39] !log mvernon@cumin2002 conftool action : set/weight=100; selector: name=thanos-fe2007.codfw.wmnet [11:01:48] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: name=thanos-fe2006.codfw.wmnet [11:01:54] !log mvernon@cumin2002 conftool action : set/pooled=yes; selector: name=thanos-fe2007.codfw.wmnet [11:02:09] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [11:02:14] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [11:04:36] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1134999|Bump thumbnail steps to 75% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:04:39] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:05:15] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10720903 (10cmooney) >>! In T387145#10713076, @Vgutierrez wrote: > reimaging them is fine by me Ok cool. So what we should do is run the 'decom' workflow against the existing servers, b... [11:06:10] (03PS1) 10Kamila Součková: alertmanager: route T&S tasks to their Slack [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782) [11:06:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P74701 and previous config saved to /var/cache/conftool/dbconfig/20250408-110618-fceratto.json [11:06:48] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [11:10:39] (03CR) 10Cathal Mooney: [C:03+1] "Looks good to me... I gather this is just an interim patch and we'll apply the other one on top of it to add the functionality for multi-d" [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [11:11:04] (03CR) 10Kamila Součková: "We will send failed job alerts as part of the migration to k8s. Let me know if you'd prefer a separate receiver that creates Phab tasks ra" [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [11:11:10] (03PS1) 10Hnowlan: jobrunner, videoscaler: remove from lvs, backends [puppet] - 10https://gerrit.wikimedia.org/r/1135008 (https://phabricator.wikimedia.org/T354791) [11:12:13] (03CR) 10Cathal Mooney: [C:03+1] "Seems to make sense thanks." [software/homer] - 10https://gerrit.wikimedia.org/r/1134714 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [11:13:33] (03CR) 10Cathal Mooney: [C:03+1] "Makes sense!" [software/homer] - 10https://gerrit.wikimedia.org/r/1134713 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [11:13:58] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134999|Bump thumbnail steps to 75% (T360589)]] (duration: 16m 35s) [11:14:00] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [11:14:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10720933 (10phaultfinder) [11:17:25] (03CR) 10Clément Goubert: [C:04-1] "Indentation issue" [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [11:17:33] (03CR) 10Cathal Mooney: [C:03+2] Add prepend-as-out variable for each site always [homer/public] - 10https://gerrit.wikimedia.org/r/1130095 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney) [11:18:13] (03Merged) 10jenkins-bot: Add prepend-as-out variable for each site always [homer/public] - 10https://gerrit.wikimedia.org/r/1130095 (https://phabricator.wikimedia.org/T389606) (owner: 10Cathal Mooney) [11:19:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10720963 (10phaultfinder) [11:21:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T391056)', diff saved to https://phabricator.wikimedia.org/P74702 and previous config saved to /var/cache/conftool/dbconfig/20250408-112124-fceratto.json [11:21:28] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:21:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1171.eqiad.wmnet with reason: Maintenance [11:23:08] (03CR) 10Tacsipacsi: "I’d rather fix the portals to send lower-case family name. If they send _Wiktionary_, they’re likely to send _Wikisłownik_ or _维基词典_ depen" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo) [11:24:30] (03PS1) 10Peter Fischer: CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135010 (https://phabricator.wikimedia.org/T389053) [11:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10720976 (10phaultfinder) [11:25:16] (03CR) 10CI reject: [V:04-1] CirrusSearch: weighted tags mapping (during maintenance inflicted reindexing) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135010 (https://phabricator.wikimedia.org/T389053) (owner: 10Peter Fischer) [11:26:19] (03CR) 10Tacsipacsi: "(“Portals” is [wikimedia/portals](https://gerrit.wikimedia.org/r/q/project:wikimedia/portals).)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo) [11:26:43] (03CR) 10Kamila Součková: [C:03+1] jobrunner, videoscaler: remove from lvs, backends [puppet] - 10https://gerrit.wikimedia.org/r/1135008 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [11:27:19] 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341 (10Clement_Goubert) 03NEW [11:27:54] !log cgoubert@cumin1002 START - Cookbook sre.k8s.pool-depool-node depool for host wikikube-worker2142.codfw.wmnet [11:28:05] 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10720990 (10ops-monitoring-bot) depool host wikikube-worker2142.codfw.wmnet by cgoubert@cumin1002 with reason: Hardware failure [11:28:56] 10ops-codfw, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10721004 (10Clement_Goubert) [11:29:20] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10721007 (10Clement_Goubert) a:03Papaul [11:29:38] (03PS2) 10Kamila Součková: alertmanager: route T&S tasks to their Slack [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782) [11:30:12] (03CR) 10Kamila Součková: alertmanager: route T&S tasks to their Slack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [11:30:18] (03CR) 10Clément Goubert: [C:03+1] alertmanager: route T&S tasks to their Slack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [11:30:32] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.k8s.pool-depool-node (exit_code=99) depool for host wikikube-worker2142.codfw.wmnet [11:31:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1172.eqiad.wmnet with reason: Maintenance [11:31:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1172 (T391056)', diff saved to https://phabricator.wikimedia.org/P74703 and previous config saved to /var/cache/conftool/dbconfig/20250408-113154-fceratto.json [11:31:58] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:33:04] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: hw troubleshooting: hard down for wikikube-worker2142 - https://phabricator.wikimedia.org/T391341#10721015 (10Clement_Goubert) Host drained forcefully and depooled. [11:37:13] FIRING: ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:40] jouncebot: now [11:37:40] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [11:37:43] jouncebot: next [11:37:43] In 0 hour(s) and 22 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1200) [11:37:58] FIRING: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:38:44] FIRING: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:39:25] !incidents [11:39:26] 6024 (ACKED) [2x] ProbeDown sre (upload-https:443 probes/service eqsin) [11:39:26] 6025 (ACKED) HaproxyUnavailable cache_upload global sre (thanos-rule) [11:39:45] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [11:39:59] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [11:41:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T391056)', diff saved to https://phabricator.wikimedia.org/P74704 and previous config saved to /var/cache/conftool/dbconfig/20250408-114338-fceratto.json [11:43:41] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:45:42] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:46:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://maps.wikimedia.org - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqsin - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [11:47:13] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:47:58] RESOLVED: [2x] ProbeDown: Service upload-https:443 has failed probes (http_upload-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#upload-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:48:44] RESOLVED: HaproxyUnavailable: HAProxy (cache_upload) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable [11:56:50] (03PS3) 10Clément Goubert: alertmanager: route T&S tasks to their Slack [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [11:57:35] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10721144 (10Peachey88) [11:58:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P74705 and previous config saved to /var/cache/conftool/dbconfig/20250408-115845-fceratto.json [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1200) [12:02:14] (03CR) 10Brouberol: [C:03+2] airflow: set saner performance-related configs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134985 (https://phabricator.wikimedia.org/T390945) (owner: 10Brouberol) [12:07:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:07:38] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [12:10:37] (03PS2) 10Stang: Add main page on non-English privatewiki to wgWhitelistRead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/850266 (https://phabricator.wikimedia.org/T321796) [12:12:06] 06SRE, 06SRE-OnFire, 13Patch-Needs-Improvement: klaxon CLI tool for seeding an oncall handoff - https://phabricator.wikimedia.org/T317159#10721183 (10Aklapper) [12:12:55] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [12:13:35] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [12:13:47] !log akosiaris@deploy1003 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [12:13:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P74706 and previous config saved to /var/cache/conftool/dbconfig/20250408-121352-fceratto.json [12:14:24] !log akosiaris@deploy1003 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [12:15:13] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Adapt profile::nginx to new packaging scheme introduced in Bookworm - https://phabricator.wikimedia.org/T329529#10721205 (10Aklapper) https://gerrit.wikimedia.org/r/c/operations/puppet/+/993068 is the only linked open patch left here. [12:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:20:44] (03CR) 10Tiziano Fogli: [C:03+2] snmp-exporter: adding pro4x module (pdu) [puppet] - 10https://gerrit.wikimedia.org/r/1123619 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [12:23:11] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Announce OpenStack API over v6 from cloudlb2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1134700 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [12:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:28:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T391056)', diff saved to https://phabricator.wikimedia.org/P74707 and previous config saved to /var/cache/conftool/dbconfig/20250408-122859-fceratto.json [12:29:02] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:29:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1177.eqiad.wmnet with reason: Maintenance [12:29:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74708 and previous config saved to /var/cache/conftool/dbconfig/20250408-122919-fceratto.json [12:29:58] (03CR) 10Alexandros Kosiaris: [C:03+1] scap: Use PHP 8.1 when executing maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [12:33:04] (03PS1) 10Majavah: bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) [12:33:38] (03PS1) 10Peter Fischer: Search update pipeline: 504 handling, weighted tags rename [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135019 (https://phabricator.wikimedia.org/T389053) [12:35:04] !log started the rollout of xz-utils' security upgrades (gradual during the next days) [12:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:48] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5228/co" [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [12:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [12:40:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74709 and previous config saved to /var/cache/conftool/dbconfig/20250408-124042-fceratto.json [12:40:45] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:47:13] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:13] (03PS1) 10Effie Mouzeli: logging: add support for php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135020 [12:49:42] (03PS1) 10Effie Mouzeli: switch mwdebug2002 to php8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135021 [12:50:30] (03CR) 10CI reject: [V:04-1] logging: add support for php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135020 (owner: 10Effie Mouzeli) [12:50:38] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1048.eqiad.wmnet [12:50:48] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2048.codfw.wmnet [12:52:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [12:52:32] (03CR) 10Jelto: [C:04-1] "one comment in-line" [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [12:55:04] jouncebot: now [12:55:04] For the next 0 hour(s) and 4 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1200) [12:55:08] jouncebot: next [12:55:08] In 0 hour(s) and 4 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1300) [12:55:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P74711 and previous config saved to /var/cache/conftool/dbconfig/20250408-125549-fceratto.json [12:56:22] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1048.eqiad.wmnet [12:57:14] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2048.codfw.wmnet [12:57:50] (03CR) 10Effie Mouzeli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135021 (owner: 10Effie Mouzeli) [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1300). [13:00:05] Superpes, seanleong-wmde, abijeet, dcausse, and Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] Hi :) [13:00:15] o/ [13:00:31] o/ [13:00:37] I can deploy today :) [13:00:44] I'm here as well. Lucas_WMDE go ahead! :) [13:00:53] I am here if needed folks [13:00:56] let’s start with the ptwiktionary change [13:01:02] if you see anything that appears to be stuck, ping me [13:01:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134776 (https://phabricator.wikimedia.org/T391299) (owner: 10Superpes15) [13:01:11] ok [13:01:40] there’s a circuit breaker error at the top of logspam-watch that’s about to fall out of the 60min window, looks quiet otherwise [13:01:54] (03Merged) 10jenkins-bot: [ptwiktionary] Create a Wikisaurus namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134776 (https://phabricator.wikimedia.org/T391299) (owner: 10Superpes15) [13:02:16] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1134776|[ptwiktionary] Create a Wikisaurus namespace (T391299)]] [13:02:19] T391299: Add Wikisaurus namespace to Portuguese Wiktionary - https://phabricator.wikimedia.org/T391299 [13:02:21] o/ [13:02:33] Lucas_WMDE Please remember that after deploy NamespaceDupes.php needs to be run [13:02:48] Hi, I'm here as well o/ [13:04:06] * Lucas_WMDE idly wonders if mwscript-k8s is considered stable enough to warrant updating https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers#namespaceDupes by now [13:04:16] I’ll give it a shot later [13:04:56] (03CR) 10Alexandros Kosiaris: "❤️" [puppet] - 10https://gerrit.wikimedia.org/r/1127150 (https://phabricator.wikimedia.org/T385995) (owner: 10JHathaway) [13:06:19] (03PS2) 10Jelto: ceph: add gitlab dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) [13:08:29] !log TEST maintenance s1 eqiad dbmaint T391346 [13:08:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:32] T391346: Database maintenance map not working - https://phabricator.wikimedia.org/T391346 [13:08:43] (03CR) 10Jelto: "I uploaded a new patchset which uses the new `Ceph::S3::Credential` structure from Id8979165b96d737addc676f3abf3f088a48eda48." [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:09:04] sync-testservers-k8s took 4m22s, that feels unusually slow I think (cc elukey) [13:09:08] but not critical yet [13:09:18] :O [13:09:22] (03CR) 10Elukey: [C:03+1] tox.ini: remove optimization for tox <4 [software/homer] - 10https://gerrit.wikimedia.org/r/1134712 (owner: 10Volans) [13:09:23] we can see how long the full deploy takes [13:09:36] !log lucaswerkmeister-wmde@deploy1003 superpes, lucaswerkmeister-wmde: Backport for [[gerrit:1134776|[ptwiktionary] Create a Wikisaurus namespace (T391299)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:09:38] T391299: Add Wikisaurus namespace to Portuguese Wiktionary - https://phabricator.wikimedia.org/T391299 [13:09:42] Superpes: please test :) [13:10:24] Looks fine! Thanks Lucas_WMDE [13:10:30] !log lucaswerkmeister-wmde@deploy1003 superpes, lucaswerkmeister-wmde: Continuing with sync [13:10:32] ok, thanks! [13:10:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P74712 and previous config saved to /var/cache/conftool/dbconfig/20250408-131056-fceratto.json [13:11:16] ok sync-canaries-k8s only took 34s, so that was fine [13:12:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [13:12:07] (03PS11) 10Tiziano Fogli: netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) [13:12:07] (03PS41) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [13:12:07] (03PS1) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) [13:12:39] (03CR) 10CI reject: [V:04-1] pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [13:12:55] (03CR) 10CI reject: [V:04-1] pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [13:13:13] (03CR) 10Elukey: [C:03+1] capirca: optimization refactor [software/homer] - 10https://gerrit.wikimedia.org/r/1134713 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [13:13:24] (03CR) 10MVernon: [C:03+1] "LGTM, thanks! I added a suggested comment." [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:13:28] I think once this change is done (and I’ve run namespaceDupes), we could probably deploy the changes by seanleong-wmde, dcausse and myself all together [13:13:30] they look harmless enough [13:13:43] Okie [13:13:44] (03CR) 10MVernon: [C:03+1] "Done" [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:13:48] yes mine is a noop [13:14:17] (03CR) 10CI reject: [V:04-1] netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [13:14:24] (03PS3) 10Jelto: ceph: add gitlab dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) [13:14:28] 06SRE, 10Dumps-Generation, 10Wikidata: various weekly and daily dumps run from systemd timers are broken - https://phabricator.wikimedia.org/T281267#10721388 (10fgiunchedi) I'm untagging o11y for now, please reach out as needed [13:14:30] (03CR) 10Jelto: ceph: add gitlab dummy credentials (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:14:30] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Can confirm that the code is gone from wmf.23+:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134064 (https://phabricator.wikimedia.org/T389429) (owner: 10Ebernhardson) [13:15:15] (03CR) 10Elukey: [C:03+1] homer: move NetboxData initialization [software/homer] - 10https://gerrit.wikimedia.org/r/1134714 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [13:17:41] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134776|[ptwiktionary] Create a Wikisaurus namespace (T391299)]] (duration: 15m 24s) [13:17:44] T391299: Add Wikisaurus namespace to Portuguese Wiktionary - https://phabricator.wikimedia.org/T391299 [13:18:00] woof, that’s a lot of links to fix [13:18:14] but no issues apparently [13:18:53] !log lucaswerkmeister-wmde@deploy1003 ~ $ mwscript-k8s --comment=T391299 --follow -- namespaceDupes ptwiktionary --fix [13:18:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:59] (03CR) 10MVernon: [C:03+1] "I feel gerrit shouldn't remove the +1 when you apply my suggestion, but there we are :-)" [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:19:35] sync-prod-k8s finished in 5m58s btw, which feels like a normal duration [13:19:39] (03PS2) 10Majavah: bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) [13:19:39] (03PS1) 10Majavah: P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 [13:19:41] (03CR) 10Bking: [C:03+2] cirrus: disable completion indices in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134969 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [13:19:42] Yep They used the prefix without having a namespace lmao [13:19:59] (03PS2) 10Majavah: P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) [13:20:01] (03PS3) 10Majavah: bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) [13:21:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [13:21:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134064 (https://phabricator.wikimedia.org/T389429) (owner: 10Ebernhardson) [13:21:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134691 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [13:21:28] Thanks for your assistance Lucas_WMDE :3 [13:21:34] np :) [13:22:17] (03Merged) 10jenkins-bot: Increase entityAccessLimit from 400 to 500 for all wikis except commons. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133317 (https://phabricator.wikimedia.org/T384455) (owner: 10Seanleong-wmde) [13:22:21] (03Merged) 10jenkins-bot: Remove unused config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134064 (https://phabricator.wikimedia.org/T389429) (owner: 10Ebernhardson) [13:22:24] (03Merged) 10jenkins-bot: Fix EntitySchema propertyType on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134691 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [13:22:34] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:22:49] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1133317|Increase entityAccessLimit from 400 to 500 for all wikis except commons. (T384455)]], [[gerrit:1134064|Remove unused config vars (T389429)]], [[gerrit:1134691|Fix EntitySchema propertyType on Test Wikidata (T371196)]] [13:22:54] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:22:55] T384455: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455 [13:22:55] T389429: Investigate whether it’s intentional / correct that default CirrusSearch setups run cirrusSearchElasticaWrite as separate jobs - https://phabricator.wikimedia.org/T389429 [13:22:55] T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196 [13:24:22] (03PS3) 10Majavah: P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) [13:24:22] (03PS4) 10Majavah: bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) [13:25:11] (03CR) 10Bking: "We're OK with temporarily adding these flags. We should review after the maintenance...which reminds me, I need to start a task for undoin" [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [13:25:59] (03PS1) 10Stevemunene: zookeeper: onboard an-conf1004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135025 (https://phabricator.wikimedia.org/T374922) [13:26:01] (03PS1) 10Stevemunene: zookeeper: onboard an-conf1005 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135026 (https://phabricator.wikimedia.org/T374922) [13:26:02] (03PS1) 10Stevemunene: zookeeper: onboard an-conf1006 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135027 (https://phabricator.wikimedia.org/T374922) [13:26:04] (03PS1) 10Stevemunene: zookeeper: remove an-conf100[1-3] from the cluster [puppet] - 10https://gerrit.wikimedia.org/r/1135028 (https://phabricator.wikimedia.org/T374922) [13:26:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74714 and previous config saved to /var/cache/conftool/dbconfig/20250408-132603-fceratto.json [13:26:07] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:26:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1178.eqiad.wmnet with reason: Maintenance [13:26:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1178 (T391056)', diff saved to https://phabricator.wikimedia.org/P74715 and previous config saved to /var/cache/conftool/dbconfig/20250408-132626-fceratto.json [13:26:38] (03CR) 10CI reject: [V:04-1] P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:26:52] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:29:30] sync-testservers-k8s feels fairly slow again o_O [13:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:29:34] yeah, just finished after 4m23s [13:30:03] (03PS3) 10AOkoth: site: revert releases2003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) [13:30:16] (03CR) 10AOkoth: site: revert releases2003 to insetup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [13:30:20] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, ebernhardson, seanleong-wmde: Backport for [[gerrit:1133317|Increase entityAccessLimit from 400 to 500 for all wikis except commons. (T384455)]], [[gerrit:1134064|Remove unused config vars (T389429)]], [[gerrit:1134691|Fix EntitySchema propertyType on Test Wikidata (T371196)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:30:25] T384455: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455 [13:30:26] T389429: Investigate whether it’s intentional / correct that default CirrusSearch setups run cirrusSearchElasticaWrite as separate jobs - https://phabricator.wikimedia.org/T389429 [13:30:26] T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196 [13:30:26] (03PS4) 10Majavah: P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) [13:30:26] (03PS5) 10Majavah: bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) [13:30:36] my test is working as expected on testwikidatawiki [13:30:52] and I realized I can’t 100% test it on wikidatawiki because the code hasn’t rolled out there [13:30:52] Lucas_WMDE: I can't test mine [13:31:04] mine is working correctly [13:31:10] it’s supposed to have no difference, but at the moment I can’t be sure if it has no difference because the config is doing the right thing or because the code isn’t there yet [13:31:15] but I’ll just hope that it’s fine [13:31:17] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, ebernhardson, seanleong-wmde: Continuing with sync [13:32:45] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 2 DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:34:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10721500 (10phaultfinder) [13:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10721501 (10phaultfinder) [13:35:24] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135029 [13:36:27] (03PS42) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [13:36:27] (03PS2) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) [13:36:53] (03CR) 10CI reject: [V:04-1] pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [13:37:09] (03CR) 10Btullis: [C:03+1] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [13:37:44] (03CR) 10Cathal Mooney: [C:03+1] bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:37:56] (03CR) 10Arnaudb: [C:03+1] ceph: add gitlab dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:38:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T391056)', diff saved to https://phabricator.wikimedia.org/P74716 and previous config saved to /var/cache/conftool/dbconfig/20250408-133814-fceratto.json [13:38:18] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:38:19] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133317|Increase entityAccessLimit from 400 to 500 for all wikis except commons. (T384455)]], [[gerrit:1134064|Remove unused config vars (T389429)]], [[gerrit:1134691|Fix EntitySchema propertyType on Test Wikidata (T371196)]] (duration: 15m 30s) [13:38:24] T384455: Increase entityAccessLimit for WikibaseClient wikis - https://phabricator.wikimedia.org/T384455 [13:38:24] T389429: Investigate whether it’s intentional / correct that default CirrusSearch setups run cirrusSearchElasticaWrite as separate jobs - https://phabricator.wikimedia.org/T389429 [13:38:25] T371196: The EntitySchema type URI is missing from the Wikibase ontology - https://phabricator.wikimedia.org/T371196 [13:38:31] right, time for abijeet :) [13:38:34] \o/ [13:38:39] Lucas_WMDE: thanks! [13:38:40] can the backports for the two branches be deployed at the same time? [13:38:53] (03CR) 10CI reject: [V:04-1] pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [13:38:57] dcausse: np :) [13:38:58] Lucas_WMDE Thanks! [13:39:10] Lucas_WMDE, sounds good [13:39:20] Lucas_WMDE, we can deploy both at the same time, sure [13:39:22] * lucaswerkmeister is also amused by the course T389429 has taken ;) [13:39:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10721515 (10phaultfinder) [13:39:44] ok [13:39:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134976 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [13:39:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134977 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [13:39:56] I should’ve remembered to +2 them in advance, meh [13:40:13] (03CR) 10Cathal Mooney: [C:03+1] "Looks good, should sort out the source IP anyway." [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:40:29] (03CR) 10Ssingh: "How does this tie in to:" [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:41:24] (03CR) 10Majavah: [V:03+1 C:03+2] P:wmcs::cloud_private_subnet: Set correct v6 BGP local address [puppet] - 10https://gerrit.wikimedia.org/r/1135023 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:41:48] (03CR) 10Elukey: [C:03+2] services: use the kafka svc endpoint for Tegola [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133142 (https://phabricator.wikimedia.org/T373115) (owner: 10Elukey) [13:41:48] (03Merged) 10jenkins-bot: ArticleFooterEntrypointCard: Change the way codex is loaded [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134976 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [13:41:51] (03Merged) 10jenkins-bot: ArticleFooterEntrypointCard: Change the way codex is loaded [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1134977 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [13:41:56] Lucas_WMDE: CI is super fast now :) [13:42:07] (03CR) 10Volans: [C:03+2] tox.ini: remove optimization for tox <4 [software/homer] - 10https://gerrit.wikimedia.org/r/1134712 (owner: 10Volans) [13:42:08] nice! [13:42:15] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1134976|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]], [[gerrit:1134977|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]] [13:42:18] T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176 [13:44:12] (03CR) 10Majavah: "My understanding is that configuration ensures that when the system boots up, `anycast-healthchecker.service` is started before `bird.serv" [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:45:18] (03CR) 10Ssingh: [C:03+1] jobrunner, videoscaler: remove from lvs, backends [puppet] - 10https://gerrit.wikimedia.org/r/1135008 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [13:45:25] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10721540 (10Jhancock.wm) @elukey all good! yesterday was rack unpacking day and i did almost nothing else =# i replaced a random drive... [13:45:43] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on releases2003.codfw.wmnet with reason: Bookworm Re-image [13:47:02] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1134740/5233/" [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [13:48:24] (03PS1) 10Stevemunene: hdfs: replace an-conf100[1-3] with an-conf100[4-6] [puppet] - 10https://gerrit.wikimedia.org/r/1135031 (https://phabricator.wikimedia.org/T374922) [13:49:12] !log lucaswerkmeister-wmde@deploy1003 abi, lucaswerkmeister-wmde: Backport for [[gerrit:1134976|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]], [[gerrit:1134977|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:49:15] T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176 [13:49:20] Lucas_WMDE, testing [13:49:51] elukey: I don’t know if it counts as stuck but sync-testservers-k8s took 4m22s, 4m23s and 4m01s, which seems unusually slow [13:50:00] abijeet: thanks! [13:50:41] hm, that’s suspiciously close to the “4 minutes” mentioned in T374907 🤔 [13:50:41] T374907: sync-testservers-k8s takes 4 minutes when deploying a mediawiki-config change - https://phabricator.wikimedia.org/T374907 [13:51:06] Lucas_WMDE: that I don't know, but if it didn't block I am happy [13:51:25] ok [13:51:58] (03CR) 10Btullis: cirrussearch: Add regex data for cirrussearch hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:53:08] (03CR) 10Btullis: [C:03+1] cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:53:19] (03Merged) 10jenkins-bot: tox.ini: remove optimization for tox <4 [software/homer] - 10https://gerrit.wikimedia.org/r/1134712 (owner: 10Volans) [13:53:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P74717 and previous config saved to /var/cache/conftool/dbconfig/20250408-135321-fceratto.json [13:54:25] RESOLVED: SystemdUnitFailed: git_pull_charts.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:27] (03CR) 10AOkoth: [C:03+2] site: revert releases2003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [13:56:24] Lucas_WMDE, we can keep this patch, but there is a separate issue :-( bit silly and messy: 1135032: ArticleFooterEntrypointCard: Add @wikimedia/codex as a dependency | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1135032 [13:56:42] I see… [13:56:59] I was going to ask why that change was made in the first place, I didn’t understand it [13:57:06] (I guess I still don’t understand it) [13:57:09] but let’s roll it out then… [13:57:15] !log lucaswerkmeister-wmde@deploy1003 abi, lucaswerkmeister-wmde: Continuing with sync [13:57:33] abijeet: do you have a reviewer? or do you want to try to explain it to me until I’m confident to +2 the change for backporting? :D [13:57:44] well, I suppose we have very little time left in the window, meh [13:57:45] jouncebot: next [13:57:45] In 1 hour(s) and 2 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1500) [13:57:48] ok we have some time afterwards [13:57:49] (03CR) 10Hnowlan: [C:03+2] jobrunner, videoscaler: remove from lvs, backends [puppet] - 10https://gerrit.wikimedia.org/r/1135008 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [13:58:21] I thought codex.js was the recommended way to load codex per https://www.mediawiki.org/wiki/Codex#Loading_a_subset_of_Codex_components_(recommended_for_skins_and_extensions) [13:59:39] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/tegola-vector-tiles: sync [13:59:46] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/tegola-vector-tiles: sync [14:00:19] abijeet: did you try other parts to codex.js? I wonder if it maybe needed to be ./codex.js or ../../codex.js instead of ../codex.js [14:00:52] e.g. https://gerrit.wikimedia.org/g/mediawiki/extensions/Wikibase/+/7a17e84550f9d3adaefa175363774fb98e3ebb80/repo/resources/wikibase.vector.scopedtypeaheadsearch/ScopedTypeaheadSearch.vue#44 has ../../codex.js [14:01:23] (03CR) 10Bking: [C:03+2] elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T383811) (owner: 10Bking) [14:01:31] based on "localBasePath": "minT/entrypoints" and minT/entrypoints/ArticleFooterEntrypointCard.vue, I would suspect you need ./codex.js [14:01:35] rather than .. [14:01:47] since it should end up in the same directory (minT/entrypoints/) [14:01:48] Lucas_WMDE, the rest of the codebase uses require( '@wikimedia/codex' ); - https://gerrit.wikimedia.org/g/mediawiki/extensions/ContentTranslation/+/0fda23770042887d2530018092844de2ee5b6913/minT/src/ConfirmTopicPage.vue#135 [14:02:08] !log aokoth@cumin1002 START - Cookbook sre.hosts.reimage for host releases2003.codfw.wmnet with OS bookworm [14:02:37] 06SRE, 10SRE-swift-storage: Q4 Thanos hardware refresh - https://phabricator.wikimedia.org/T391352 (10MatthewVernon) 03NEW [14:03:21] It just made sense to use the same approach in this file as the rest of the extension. [14:04:39] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134976|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]], [[gerrit:1134977|ArticleFooterEntrypointCard: Change the way codex is loaded (T389176)]] (duration: 22m 23s) [14:04:41] T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176 [14:04:54] ok… [14:05:10] but then why do several RL modules still use the CodexModule class? [14:06:23] Lucas_WMDE, thanks though. I'll try to get this reviewed and tested. Its not possible to test this locally hence the back and forth. [14:06:39] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [14:06:49] ok [14:06:56] then I guess we’re done with the window for now? [14:07:10] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [14:07:24] (03PS1) 10Anzx: madwiktionary: add logo, icon, wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135035 (https://phabricator.wikimedia.org/T391318) [14:07:50] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [14:08:10] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [14:08:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135035 (https://phabricator.wikimedia.org/T391318) (owner: 10Anzx) [14:08:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178', diff saved to https://phabricator.wikimedia.org/P74718 and previous config saved to /var/cache/conftool/dbconfig/20250408-140828-fceratto.json [14:10:36] !log setting jobrunner and videoscaler to service_setup in puppet [14:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:59] some IPVS alerts expected [14:11:26] !log UTC afternoon backport+config window done [14:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:41] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354 (10MatthewVernon) 03NEW [14:12:48] !log restarting pybal on A:lvs-secondary-eqiad to pick up removal of jobrunner and videoscaler [14:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:43] (03PS2) 10Anzx: arywiki: enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135036 [14:14:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135036 (owner: 10Anzx) [14:17:10] (03CR) 10Aklapper: [V:03+2 C:03+2] Penalize on nonsensical large story point values [phabricator/antivandalism] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1134379 (https://phabricator.wikimedia.org/T391204) (owner: 10Aklapper) [14:19:25] !log fnegri@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddumps1001.wikimedia.org with reason: down for maintenance [14:19:29] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10721810 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=591bcb32-8025-4bce-af2c-49d023d1b4ca) set by fnegri@cumin1002 for 1 da... [14:19:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10721812 (10phaultfinder) [14:21:34] 06SRE, 10SRE-swift-storage, 10Ceph: Q4 object storage hardware tasks - https://phabricator.wikimedia.org/T391354#10721827 (10Aklapper) [14:22:16] !log restarting pybal on lvs1019 (low-traffic primary) to pick up removal of jobrunner and videoscaler [14:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:44] (03CR) 10Superpes15: [C:03+1] madwiktionary: add logo, icon, wordmark and tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135035 (https://phabricator.wikimedia.org/T391318) (owner: 10Anzx) [14:23:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1178 (T391056)', diff saved to https://phabricator.wikimedia.org/P74720 and previous config saved to /var/cache/conftool/dbconfig/20250408-142335-fceratto.json [14:23:39] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:23:41] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1192.eqiad.wmnet with reason: Maintenance [14:23:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1192 (T391056)', diff saved to https://phabricator.wikimedia.org/P74721 and previous config saved to /var/cache/conftool/dbconfig/20250408-142347-fceratto.json [14:24:29] (03CR) 10Scott French: "Thank you both for the review! Ahmon, any concerns about giving this a try today?" [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [14:26:19] (03PS1) 10Xcollazo: Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) [14:26:48] (03CR) 10CI reject: [V:04-1] Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [14:27:12] Lucas_WMDE, I changed the code to use './codex.js'; thanks for that recommendation. I think we need to review some of the other RL modules in the extension and how we are using Codex there. Patch: 1135032: ArticleFooterEntrypointCard: Fix path to codex.js | https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ContentTranslation/+/1135032 [14:27:36] (03PS1) 10Stevemunene: airflow: cleanup deployment charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135045 (https://phabricator.wikimedia.org/T391359) [14:28:26] (03PS2) 10Xcollazo: Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) [14:28:27] (03PS1) 10Kamila Součková: Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 [14:28:33] * Lucas_WMDE looks [14:28:44] (03CR) 10CI reject: [V:04-1] Revert^2 "k8s::client: Allow for install of all kubectl versions" [puppet] - 10https://gerrit.wikimedia.org/r/1135046 (owner: 10Kamila Součková) [14:28:48] (03CR) 10CI reject: [V:04-1] Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [14:28:59] worth a try imho… can it be tested on beta? [14:29:32] (03CR) 10Btullis: Absent systemd timers to stop attempting to generate enterprise HTML dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [14:31:17] (03CR) 10Jelto: [V:03+2 C:03+2] ceph: add gitlab dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1132643 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:31:19] !log restarting pybal on A:lvs-secondary-codfw [14:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:48] Testing on beta would require a config change. I was able to test it locally by changing some code. Did not see any errors in the console. I should have not been lazy and done that in the first place. [14:32:26] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 3 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10721936 (10Jelto) [14:33:07] ok [14:36:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T391056)', diff saved to https://phabricator.wikimedia.org/P74722 and previous config saved to /var/cache/conftool/dbconfig/20250408-143628-fceratto.json [14:36:32] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:36:46] !log restarting pybal on A:lvs-low-traffic-codfw to remove jobrunner and videoscaler [14:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:31] (03CR) 10Ssingh: [C:03+1] bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:42:16] (03PS1) 10Stevemunene: replace an-conf100[1-3] with an-conf100[4-6] [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135049 (https://phabricator.wikimedia.org/T374922) [14:45:33] (03PS1) 10Vgutierrez: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) [14:49:06] (03CR) 10Alexandros Kosiaris: [C:03+1] "One minor nitpick, otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (owner: 10Elukey) [14:50:00] (03CR) 10Ahmon Dancy: [C:03+1] "Looks reasonable to me. No concerns about giving it a try today." [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [14:51:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P74723 and previous config saved to /var/cache/conftool/dbconfig/20250408-145136-fceratto.json [14:54:13] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host releases2003.codfw.wmnet with OS bookworm [14:54:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3:rack/setup/install elastic1111-elastic1122, relforge1008-1010 - https://phabricator.wikimedia.org/T384966#10722024 (10Gehel) [14:57:44] (03CR) 10Elukey: [C:03+1] "Left a question but everything looks really good, I like the refactoring." [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [14:57:59] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frban1002 - https://phabricator.wikimedia.org/T369947#10722040 (10Jgreen) a:05Jgreen→03None [14:58:10] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fran1002 - https://phabricator.wikimedia.org/T369940#10722043 (10Jgreen) a:05Jgreen→03None [14:58:31] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frdb1007 - https://phabricator.wikimedia.org/T369922#10722047 (10Jgreen) a:05Jgreen→03None [14:58:44] (03CR) 10Elukey: [C:03+1] "This is my take as well yes, the callback was split into ask_approval() and print(device_diff), I assume we'll see why in the next set of " [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [14:58:54] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install franio100[1-3] - https://phabricator.wikimedia.org/T367820#10722055 (10Jgreen) a:05Jgreen→03None [14:59:05] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransc1001 - https://phabricator.wikimedia.org/T367814#10722059 (10Jgreen) a:05Jgreen→03None [14:59:21] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install fransw1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T367801#10722060 (10Jgreen) a:05Jgreen→03None [15:00:05] jelto, arnoldokoth, and mutante: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1500). [15:01:48] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: T391357 [15:01:51] T391357: Deploy Phabricator/Phorge 2025-04-08 - https://phabricator.wikimedia.org/T391357 [15:02:06] !log brennen@deploy1003 Started deploy [phabricator/deployment@99aa712]: test deploy phab2002 for T391357 [15:02:08] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab2002.codfw.wmnet with reason: T391357 [15:02:11] (03PS1) 10Clément Goubert: CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) [15:02:25] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [15:02:35] (03CR) 10CI reject: [V:04-1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [15:02:48] !log brennen@deploy1003 Finished deploy [phabricator/deployment@99aa712]: test deploy phab2002 for T391357 (duration: 00m 42s) [15:03:07] !log brennen@deploy1003 Started deploy [phabricator/deployment@99aa712]: deploy phab1004 for T391357 [15:03:45] !log brennen@deploy1003 Finished deploy [phabricator/deployment@99aa712]: deploy phab1004 for T391357 (duration: 00m 38s) [15:03:53] (03PS13) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 [15:04:04] (03CR) 10Elukey: services: enable ingress for Kartotherian (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (owner: 10Elukey) [15:05:03] (03PS2) 10Clément Goubert: CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) [15:06:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192', diff saved to https://phabricator.wikimedia.org/P74724 and previous config saved to /var/cache/conftool/dbconfig/20250408-150643-fceratto.json [15:07:15] (03CR) 10CI reject: [V:04-1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [15:07:52] (03PS1) 10Ssingh: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [15:10:42] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:13:18] (03PS2) 10Vgutierrez: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) [15:13:25] (03CR) 10Vgutierrez: sre: Add LibericaEtcdErrors alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [15:14:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10722178 (10phaultfinder) [15:14:49] (03PS1) 10Kevin Bazira: ml-services: update RRLA output stream name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135054 (https://phabricator.wikimedia.org/T326179) [15:15:50] (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [15:16:12] (03CR) 10Ssingh: "Ok sorry, that didn't work. Let's look at it again." [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [15:19:23] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [15:20:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10722265 (10phaultfinder) [15:21:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1192 (T391056)', diff saved to https://phabricator.wikimedia.org/P74725 and previous config saved to /var/cache/conftool/dbconfig/20250408-152150-fceratto.json [15:21:54] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:22:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1203.eqiad.wmnet with reason: Maintenance [15:22:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T391056)', diff saved to https://phabricator.wikimedia.org/P74726 and previous config saved to /var/cache/conftool/dbconfig/20250408-152212-fceratto.json [15:22:20] (03PS1) 10Hnowlan: spec: update tests to account for jobrunner service being removed [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791) [15:24:47] (03CR) 10CI reject: [V:04-1] spec: update tests to account for jobrunner service being removed [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:25:21] (03CR) 10Elukey: [C:03+1] "Left some questions/comments to better clarify my understanding, but it looks really good, feel free to proceed :)" [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [15:26:26] (03CR) 10Elukey: [C:03+1] "I trust that it does what you advertised, I don't have a lot of knowledge about sphinx but it looks consistent :)" [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 (owner: 10Volans) [15:27:17] (03PS1) 10Jgiannelos: proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135057 [15:28:20] (03CR) 10Clément Goubert: [V:03+2] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [15:28:28] (03CR) 10Kamila Součková: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [15:28:38] (03CR) 10Hnowlan: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [15:28:51] (03CR) 10Clément Goubert: [V:03+2 C:03+2] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [15:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10722312 (10phaultfinder) [15:29:50] (03CR) 10Scott French: [C:03+1] CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) (owner: 10Clément Goubert) [15:30:07] (03PS2) 10Hnowlan: spec: update tests to account for jobrunner service being removed [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791) [15:30:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10722319 (10phaultfinder) [15:31:31] (03CR) 10Ssingh: [C:03+1] spec: update tests to account for jobrunner service being removed [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:31:32] (03CR) 10Clément Goubert: [C:03+1] "Sink it!" [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:31:47] (03CR) 10Hnowlan: [C:03+2] spec: update tests to account for jobrunner service being removed [puppet] - 10https://gerrit.wikimedia.org/r/1135056 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [15:31:51] jouncebot: now [15:31:51] For the next 0 hour(s) and 28 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1500) [15:32:19] (03CR) 10Jgiannelos: [C:03+2] proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135057 (owner: 10Jgiannelos) [15:32:38] (03PS3) 10Clément Goubert: CampaignEvents: Migrate updateutcts-test2wiki [puppet] - 10https://gerrit.wikimedia.org/r/1135051 (https://phabricator.wikimedia.org/T385867) [15:32:51] (03CR) 10Ahmon Dancy: [C:03+1] "Btw, you can test before this gets merged by running something like `scap sync-world -Dmediawiki_runtime_image:docker-registry.wikimedia.o" [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [15:33:36] (03PS4) 10Elukey: services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [15:33:47] (03Merged) 10jenkins-bot: proton: Bump to latest image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135057 (owner: 10Jgiannelos) [15:34:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T391056)', diff saved to https://phabricator.wikimedia.org/P74727 and previous config saved to /var/cache/conftool/dbconfig/20250408-153446-fceratto.json [15:34:52] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:35:42] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:23] (03CR) 10Elukey: [C:03+2] services: update eqiad changeprop-jobqueue Docker image to one using node 20 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1126217 (https://phabricator.wikimedia.org/T381588) (owner: 10Aaron Schulz) [15:37:38] (03PS3) 10Xcollazo: Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) [15:37:49] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: sync [15:38:09] (03CR) 10CI reject: [V:04-1] Absent systemd timers to stop attempting to generate enterprise HTML dumps [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [15:39:11] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: sync [15:40:52] !incidents [15:40:53] 6026 (UNACKED) Host db1246 (paged) - PING - Packet loss = 100% [15:40:53] 6025 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [15:40:53] 6024 (RESOLVED) [2x] ProbeDown sre (upload-https:443 probes/service eqsin) [15:41:47] !ack 6026 [15:41:48] 6026 (ACKED) Host db1246 (paged) - PING - Packet loss = 100% [15:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:35] (03PS1) 10Ladsgroup: Revert "Temporarily enable mobile sitenotice for fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135066 [15:43:59] Amir1 marostegui I'll plan to dbctl depool db1246 in a moment unless I hear otherwise [15:44:11] one sec [15:44:24] Amir1: ok [15:44:25] yes please [15:44:30] please depool [15:44:30] ok doing [15:44:41] this is the same host that goes down constantly [15:45:10] !log herron@cumin1002 dbctl commit (dc=all): 'depooling db1246', diff saved to https://phabricator.wikimedia.org/P74728 and previous config saved to /var/cache/conftool/dbconfig/20250408-154509-herron.json [15:47:19] herron: is it me or I didn't see the page in here? [15:48:02] !log cmooney@cumin1002 START - Cookbook sre.hosts.dhcp for host nokiatest2001.codfw.wmnet [15:48:10] volans: same happened to me, I'm checking on the bot [15:49:15] (03CR) 10Superpes15: [C:03+1] arywiki: enable wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135036 (owner: 10Anzx) [15:49:23] Thanks herron [15:49:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P74729 and previous config saved to /var/cache/conftool/dbconfig/20250408-154954-fceratto.json [15:50:09] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10722436 (10fnegri) > I'm gonna shut down the server tomorrow for about 1 hour, to check if there's any unexpected impact, then take it back online... [15:54:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10722472 (10phaultfinder) [15:56:06] 10ops-eqiad, 06DBA, 06DC-Ops: db1246 went down - https://phabricator.wikimedia.org/T391372 (10Marostegui) 03NEW [15:56:13] 10ops-eqiad, 06DBA, 06DC-Ops: db1246 went down - https://phabricator.wikimedia.org/T391372#10722498 (10Marostegui) p:05Triage→03Medium [15:58:17] (03PS1) 10Marostegui: db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1135070 (https://phabricator.wikimedia.org/T391372) [15:58:57] (03CR) 10Marostegui: [C:03+2] db1246: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1135070 (https://phabricator.wikimedia.org/T391372) (owner: 10Marostegui) [16:00:05] jhathaway and rzl: Time to snap out of that daydream and deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:17] (03CR) 10Majavah: [C:03+2] bird: Ensure anycast_healthchecker service is restarted before bird [puppet] - 10https://gerrit.wikimedia.org/r/1135018 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [16:03:09] (03CR) 10Volans: "reply inline" [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [16:04:51] 10ops-eqiad, 06DBA, 06DC-Ops, 13Patch-For-Review: db1246 went down - https://phabricator.wikimedia.org/T391372#10722543 (10Marostegui) File system is corrupted so it was a hard crash (presumably storage?): ` [ 1261.563104] XFS (dm-0): Metadata corruption detected at xfs_agi_verify+0x11a/0x170 [xfs], xfs_ag... [16:05:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P74730 and previous config saved to /var/cache/conftool/dbconfig/20250408-160501-fceratto.json [16:05:34] (03PS3) 10Ahmon Dancy: scap.cfg.erb: Allow users in spiderpig-access LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1134291 (https://phabricator.wikimedia.org/T383947) [16:05:49] (03PS3) 10Ahmon Dancy: idp: spiderpig: Add spiderpig-access to required_groups [puppet] - 10https://gerrit.wikimedia.org/r/1134292 (https://phabricator.wikimedia.org/T383947) [16:06:27] (03CR) 10Volans: "thanks for the reviews, replies inline" [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [16:06:56] !incidents [16:06:56] 6026 (ACKED) Host db1246 (paged) - PING - Packet loss = 100% [16:06:57] 6025 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [16:06:57] 6024 (RESOLVED) [2x] ProbeDown sre (upload-https:443 probes/service eqsin) [16:07:15] I am going to resolve db1246 because this will take long to fix, so it doesn't keep paging everyday [16:07:22] !resolve 6026 [16:07:23] 6026 (RESOLVED) Host db1246 (paged) - PING - Packet loss = 100% [16:07:36] !incidents [16:07:36] 6026 (RESOLVED) Host db1246 (paged) - PING - Packet loss = 100% [16:07:36] 6025 (RESOLVED) HaproxyUnavailable cache_upload global sre (thanos-rule) [16:07:37] 6024 (RESOLVED) [2x] ProbeDown sre (upload-https:443 probes/service eqsin) [16:07:51] thanks marostegui [16:12:46] (03PS1) 10DDesouza: miscweb(research & design/strategy): bump versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135071 (https://phabricator.wikimedia.org/T344471) [16:14:00] herron: thank you for depooling! [16:14:10] np! [16:15:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10722563 (10phaultfinder) [16:16:54] (03CR) 10DDesouza: [C:03+2] miscweb(research & design/strategy): bump versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135071 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [16:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:18:39] (03Merged) 10jenkins-bot: miscweb(research & design/strategy): bump versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135071 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [16:20:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T391056)', diff saved to https://phabricator.wikimedia.org/P74731 and previous config saved to /var/cache/conftool/dbconfig/20250408-162007-fceratto.json [16:20:11] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:20:23] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1209.eqiad.wmnet with reason: Maintenance [16:20:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T391056)', diff saved to https://phabricator.wikimedia.org/P74732 and previous config saved to /var/cache/conftool/dbconfig/20250408-162029-fceratto.json [16:20:50] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:21:07] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:21:08] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:21:28] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:21:29] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:21:46] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:22:41] !log running 'ipvsadm --delete-service --tcp-service 10.2.2.26:443 && ipvsadm --delete-service --tcp-service 10.2.2.5:443' on codfw lvs to remove videoscaler and jobrunner services [16:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:44] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:23:56] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [16:24:07] (03PS43) 10Tiziano Fogli: pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) [16:24:07] (03PS3) 10Tiziano Fogli: pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387231) [16:24:13] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:24:14] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:24:33] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:24:35] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:24:41] !log running 'ipvsadm --delete-service --tcp-service 10.2.2.26:443 && ipvsadm --delete-service --tcp-service 10.2.2.5:443' on eqiad lvs to remove videoscaler and jobrunner services [16:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:24:51] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:25:08] RECOVERY - PyBal IPVS diff check on lvs2013 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:26:06] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:26:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10722637 (10phaultfinder) [16:26:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10722638 (10phaultfinder) [16:28:26] (03PS1) 10Hnowlan: service, conftool: remove videoscaler and jobrunner services [puppet] - 10https://gerrit.wikimedia.org/r/1135072 (https://phabricator.wikimedia.org/T354791) [16:29:01] jouncebot: nowandnext [16:29:01] For the next 0 hour(s) and 30 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1600) [16:29:01] In 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1700) [16:29:08] RECOVERY - PyBal IPVS diff check on lvs2014 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:29:15] nothing is being merged for puppet, deploying stuff now [16:29:53] (03CR) 10Ladsgroup: [C:03+2] Revert "Temporarily enable mobile sitenotice for fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135066 (owner: 10Ladsgroup) [16:30:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135066 (owner: 10Ladsgroup) [16:30:39] (03Merged) 10jenkins-bot: Revert "Temporarily enable mobile sitenotice for fawiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135066 (owner: 10Ladsgroup) [16:31:04] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1135066|Revert "Temporarily enable mobile sitenotice for fawiki"]] [16:31:07] (03CR) 10Tiziano Fogli: "I fixed the inline comments and also split this patch into two separate ones:" [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [16:32:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T391056)', diff saved to https://phabricator.wikimedia.org/P74733 and previous config saved to /var/cache/conftool/dbconfig/20250408-163210-fceratto.json [16:32:13] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:32:39] (03PS1) 10DDesouza: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135073 (https://phabricator.wikimedia.org/T344471) [16:33:29] (03PS1) 10Abijeet Patro: ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135074 (https://phabricator.wikimedia.org/T389176) [16:33:46] (03PS1) 10Abijeet Patro: ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135075 (https://phabricator.wikimedia.org/T389176) [16:34:02] (03CR) 10Kamila Součková: [C:03+1] service, conftool: remove videoscaler and jobrunner services [puppet] - 10https://gerrit.wikimedia.org/r/1135072 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:34:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135074 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [16:34:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135075 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [16:35:00] (03PS1) 10Cwhite: statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) [16:37:08] (03CR) 10CI reject: [V:04-1] statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite) [16:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [16:37:59] (03PS2) 10Cwhite: statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) [16:38:10] (03CR) 10Scott French: "Ah, that's a great idea! Yeah, I'll do that first." [puppet] - 10https://gerrit.wikimedia.org/r/1134758 (https://phabricator.wikimedia.org/T390225) (owner: 10Scott French) [16:38:11] (03CR) 10DDesouza: [C:03+2] miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135073 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [16:38:12] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1135066|Revert "Temporarily enable mobile sitenotice for fawiki"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:38:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10722678 (10phaultfinder) [16:39:47] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135073 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [16:40:27] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [16:40:29] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:40:30] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:40:32] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:40:33] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:40:36] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:40:46] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [16:40:59] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:41:01] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:41:17] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:41:18] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:41:37] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:41:48] (03CR) 10Scott French: [C:03+1] service, conftool: remove videoscaler and jobrunner services [puppet] - 10https://gerrit.wikimedia.org/r/1135072 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [16:44:57] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [16:45:11] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [16:45:52] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [16:47:13] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:47:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P74734 and previous config saved to /var/cache/conftool/dbconfig/20250408-164717-fceratto.json [16:50:19] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [16:50:26] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [16:50:37] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [16:50:42] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [16:50:47] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [16:50:57] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [16:51:54] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135066|Revert "Temporarily enable mobile sitenotice for fawiki"]] (duration: 20m 49s) [16:55:27] (03CR) 10Slyngshede: [C:03+2] idp: spiderpig: Add spiderpig-access to required_groups [puppet] - 10https://gerrit.wikimedia.org/r/1134292 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy) [16:56:55] (03PS3) 10Ssingh: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [16:57:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [16:59:29] (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [17:00:05] swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1700). [17:02:21] o/ [17:02:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P74735 and previous config saved to /var/cache/conftool/dbconfig/20250408-170224-fceratto.json [17:05:27] just wrapping up a couple of checks. should be starting in the next 10m or so. [17:14:24] !log swfrench@deploy1003 Started scap sync-world: Pilot stop-before-sync scap run using PHP 8.1 container image for maintenance scripts - T390225 [17:14:28] T390225: Migrate scap's maintenance script invocations to PHP 8.1 - https://phabricator.wikimedia.org/T390225 [17:15:25] !log swfrench@deploy1003 Stopping before sync operations [17:17:01] (03PS4) 10Ssingh: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [17:17:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:17:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T391056)', diff saved to https://phabricator.wikimedia.org/P74736 and previous config saved to /var/cache/conftool/dbconfig/20250408-171731-fceratto.json [17:17:35] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:17:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1211.eqiad.wmnet with reason: Maintenance [17:17:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T391056)', diff saved to https://phabricator.wikimedia.org/P74737 and previous config saved to /var/cache/conftool/dbconfig/20250408-171753-fceratto.json [17:19:34] (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [17:20:21] (03CR) 10Dzahn: "The best reviewers would be the people involved in creating this group and subscribed to the linked tickets." [puppet] - 10https://gerrit.wikimedia.org/r/1134291 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy) [17:20:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10722768 (10phaultfinder) [17:21:19] (03PS5) 10Ssingh: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [17:22:20] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host nokiatest2001.codfw.wmnet [17:23:28] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:23:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:23:37] (03CR) 10Alexandros Kosiaris: [C:03+1] "Assuming the group exists, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1134291 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy) [17:23:51] (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [17:24:28] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:25:18] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 08 Jun 2025 10:16:06 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:26:18] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:26:18] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:26:56] 10ops-ulsfo, 06SRE, 06DC-Ops: cp4047 flapped (host went down) - https://phabricator.wikimedia.org/T387238#10722814 (10RobH) Summary of updates: * Engineer went to pickup the shipment from a FedEx point and was told it was dispatched to the office. * Engineer provided me with the FedEx tracking numbers. * IT... [17:29:13] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [17:29:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T391056)', diff saved to https://phabricator.wikimedia.org/P74738 and previous config saved to /var/cache/conftool/dbconfig/20250408-172929-fceratto.json [17:29:33] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:30:36] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [17:30:45] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [17:32:02] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [17:35:02] !log swfrench@deploy1003 Started scap sync-world: Pilot scap run using PHP 8.1 container image for maintenance scripts - T390225 [17:35:06] T390225: Migrate scap's maintenance script invocations to PHP 8.1 - https://phabricator.wikimedia.org/T390225 [17:37:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10722857 (10phaultfinder) [17:38:22] !log swfrench@deploy1003 Finished scap sync-world: Pilot scap run using PHP 8.1 container image for maintenance scripts - T390225 (duration: 03m 19s) [17:39:24] jouncebot nowandnext [17:39:24] For the next 0 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1700) [17:39:24] In 0 hour(s) and 20 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1800) [17:42:27] (03PS1) 10BCornwall: Remove varnish-staging, add varnish6 components [puppet] - 10https://gerrit.wikimedia.org/r/1135080 (https://phabricator.wikimedia.org/T391334) [17:44:10] (03CR) 10Cwhite: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite) [17:44:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P74739 and previous config saved to /var/cache/conftool/dbconfig/20250408-174436-fceratto.json [17:44:54] (03CR) 10BCornwall: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5234/console" [puppet] - 10https://gerrit.wikimedia.org/r/1135080 (https://phabricator.wikimedia.org/T391334) (owner: 10BCornwall) [17:44:59] dancy: fwiw i was planning to roll train 10 or 15 minutes after the hour. need to stretch my legs a bit. [17:45:25] ack. [17:47:39] FYI, I'm out of the way for today [17:50:27] (03PS6) 10Ssingh: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [17:51:21] (03CR) 10Ssingh: [C:03+1] Remove varnish-staging, add varnish6 components [puppet] - 10https://gerrit.wikimedia.org/r/1135080 (https://phabricator.wikimedia.org/T391334) (owner: 10BCornwall) [17:53:00] (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [17:56:54] (03CR) 10BCornwall: [V:03+1 C:03+2] Remove varnish-staging, add varnish6 components [puppet] - 10https://gerrit.wikimedia.org/r/1135080 (https://phabricator.wikimedia.org/T391334) (owner: 10BCornwall) [17:59:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P74740 and previous config saved to /var/cache/conftool/dbconfig/20250408-175944-fceratto.json [17:59:47] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10722979 (10bking) @jhathaway in addition to site.pp (which everyone uses), we are also using it to add row/rack awareness to our Elastic ([[ https://phabricator.wikimedia... [18:00:05] brennen and dancy: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T1800). [18:03:51] !log import varnish 6.0.13-1wm1 to component/varnish6 bullseyw-wikimedia (T391334) [18:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:54] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [18:06:02] o/ [18:08:11] !log 1.44.0-wmf.24 train status: no current blockers, moving to group0 [18:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T391056)', diff saved to https://phabricator.wikimedia.org/P74741 and previous config saved to /var/cache/conftool/dbconfig/20250408-181450-fceratto.json [18:14:54] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:15:06] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1214.eqiad.wmnet with reason: Maintenance [18:15:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74742 and previous config saved to /var/cache/conftool/dbconfig/20250408-181513-fceratto.json [18:16:21] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135083 (https://phabricator.wikimedia.org/T386219) [18:16:22] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135083 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot) [18:17:10] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135083 (https://phabricator.wikimedia.org/T386219) (owner: 10TrainBranchBot) [18:22:14] (03PS1) 10Jforrester: Move to new async Parsoid fragment provision [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135084 (https://phabricator.wikimedia.org/T373253) [18:23:47] 06SRE, 06DBA, 10vm-requests: Requesting a VM as for a database - https://phabricator.wikimedia.org/T389089#10723150 (10Ladsgroup) We need to do some more work on this. I'll get there. [18:26:29] hrm: Check 'check_testservers_baremetal-1_of_1' failed: Sending to 4 hosts... [18:26:42] having a look at mwdebug1001 [18:26:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74743 and previous config saved to /var/cache/conftool/dbconfig/20250408-182654-fceratto.json [18:26:58] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:27:54] succeeded on a retry [18:28:16] (and was unable to reproduce any errors) [18:28:37] (03CR) 10Wargo: "Anyway, this change still can be accepted. It will work both if we modify portals or not. It fixes the main issue. And to prevent situatio" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134984 (https://phabricator.wikimedia.org/T391297) (owner: 10Wargo) [18:29:34] brennen: Was it a 500 error? [18:29:38] yeah [18:30:06] Sadly https://phabricator.wikimedia.org/T380958 [18:30:22] https://phabricator.wikimedia.org/P74744 [18:30:30] ah, right. [18:34:23] !log brennen@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.24 refs T386219 [18:34:26] T386219: 1.44.0-wmf.24 deployment blockers - https://phabricator.wikimedia.org/T386219 [18:42:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P74745 and previous config saved to /var/cache/conftool/dbconfig/20250408-184201-fceratto.json [18:44:13] !log dancy@deploy1003 Installing scap version "4.152.0" for 2 host(s) [18:46:02] !log dancy@deploy1003 Installation of scap version "4.152.0" completed for 2 hosts [18:47:32] 06SRE-OnFire, 06Release-Engineering-Team, 10Scap, 06serviceops, and 2 others: Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#10723243 (10dancy) scap 4.152.0 has been deployed to address the `update_helmfile_files()` issue. [18:55:29] (03PS1) 10AOkoth: releases: add force puppet 7 hiera [puppet] - 10https://gerrit.wikimedia.org/r/1135089 (https://phabricator.wikimedia.org/T384595) [18:57:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214', diff saved to https://phabricator.wikimedia.org/P74746 and previous config saved to /var/cache/conftool/dbconfig/20250408-185708-fceratto.json [19:12:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1214 (T391056)', diff saved to https://phabricator.wikimedia.org/P74747 and previous config saved to /var/cache/conftool/dbconfig/20250408-191215-fceratto.json [19:12:18] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:12:30] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1216.eqiad.wmnet with reason: Maintenance [19:21:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1226.eqiad.wmnet with reason: Maintenance [19:21:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1226 (T391056)', diff saved to https://phabricator.wikimedia.org/P74748 and previous config saved to /var/cache/conftool/dbconfig/20250408-192147-fceratto.json [19:21:50] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:27:00] (03CR) 10Dzahn: [C:04-1] "You don't need it because it's already in hieradata/role/common/insetup/collaboration_services_nftables.yaml on the role level" [puppet] - 10https://gerrit.wikimedia.org/r/1135089 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [19:31:38] (03CR) 10Eamedina: [C:03+1] ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135075 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [19:31:46] (03CR) 10Eamedina: [C:03+1] ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135074 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [19:33:10] (03PS3) 10Bking: cirrussearch: Add regex data for cirrussearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) [19:33:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T391056)', diff saved to https://phabricator.wikimedia.org/P74749 and previous config saved to /var/cache/conftool/dbconfig/20250408-193324-fceratto.json [19:33:27] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:33:34] (03CR) 10CI reject: [V:04-1] cirrussearch: Add regex data for cirrussearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:33:58] !log aokoth@cumin1002 START - Cookbook sre.hosts.reimage for host releases2003.codfw.wmnet with OS bookworm [19:35:42] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:36:06] (03PS4) 10Bking: cirrussearch: Add regex data for cirrussearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) [19:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:44:12] (03CR) 10Bking: cirrussearch: Add regex data for cirrussearch hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:45:44] (03CR) 10Bking: [C:03+2] cirrussearch: Add regex data for cirrussearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:48:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P74750 and previous config saved to /var/cache/conftool/dbconfig/20250408-194831-fceratto.json [19:53:15] !log aokoth@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on releases2003.codfw.wmnet with reason: host reimage [19:56:09] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on releases2003.codfw.wmnet with reason: host reimage [19:57:04] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10723434 (10Quiddity) Here's a representative example of 2 emails that I noticed are missing from my inbox, but included in the [[https://list... [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T2000). [20:00:05] anzx and abijeet: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:30] hello o/ [20:03:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226', diff saved to https://phabricator.wikimedia.org/P74751 and previous config saved to /var/cache/conftool/dbconfig/20250408-200338-fceratto.json [20:03:53] hello, is anyone around to help with the deployment? [20:06:24] what's up [20:06:34] I can take care of it [20:07:35] (03CR) 10Ladsgroup: [C:03+2] ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135074 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [20:07:39] (03CR) 10Ladsgroup: [C:03+2] ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135075 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [20:09:46] (03Merged) 10jenkins-bot: ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135074 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [20:09:48] (03Merged) 10jenkins-bot: ArticleFooterEntrypointCard: Fix display of entrypoint [extensions/ContentTranslation] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135075 (https://phabricator.wikimedia.org/T389176) (owner: 10Abijeet Patro) [20:09:51] that was fast [20:10:09] woah [20:10:15] thanks Amir1 [20:10:36] We skip browser tests in branches I think [20:11:22] sorry, I did not understand. I can still verify it on testservers with the wmf.23 branch right? [20:11:44] yeah [20:11:49] I meant CI tests [20:12:02] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host releases2003.codfw.wmnet with OS bookworm [20:12:08] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1135074|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]], [[gerrit:1135075|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]] [20:12:10] T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176 [20:12:32] ah understood [20:14:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10723507 (10phaultfinder) [20:16:09] Amir1, abijeet: That's "success caching" at work -- https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/message/KTP34HIR5D66QLGHC3ZAIZKQWE46O5F4/ [20:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:17:19] !log ladsgroup@deploy1003 abi, ladsgroup: Backport for [[gerrit:1135074|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]], [[gerrit:1135075|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:17:21] T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176 [20:17:29] abijeet: it's in test servers [20:17:31] testing [20:18:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1226 (T391056)', diff saved to https://phabricator.wikimedia.org/P74752 and previous config saved to /var/cache/conftool/dbconfig/20250408-201845-fceratto.json [20:18:50] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:19:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1255.eqiad.wmnet with reason: Maintenance [20:19:42] Amir1, looks good. [20:19:47] !log ladsgroup@deploy1003 abi, ladsgroup: Continuing with sync [20:21:48] I'm not seeing the second person who scheduled patches [20:21:51] bd808, that's a big QOL improvement. Thanks [20:22:07] !log import libvmod-re2 1.5.3-4 to component/varnish6 bullseyw-wikimedia (T391334) [20:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:10] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [20:22:24] mutante: Would you be willing to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134291 now that Alexandros has approved? [20:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:25:49] !log import varnishkafka 1.1.0-4 to component/varnish6 bullseyw-wikimedia (T391334) [20:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:25] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135074|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]], [[gerrit:1135075|ArticleFooterEntrypointCard: Fix display of entrypoint (T389176)]] (duration: 14m 16s) [20:26:27] T389176: Re-enable footer entry point to MinT for Wiki Readers - https://phabricator.wikimedia.org/T389176 [20:27:35] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1256.eqiad.wmnet with reason: Maintenance [20:27:54] Amir1, thanks for your help! [20:28:01] \o/ [20:31:52] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 went down - https://phabricator.wikimedia.org/T391372#10723635 (10Jclark-ctr) a:03VRiley-WMF [20:34:29] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10723653 (10bd808) >>! In T391330#10720826, @Jelto wrote: > ` > Apr 07 09:06:41 lists1004 mailman3[2696297]: (pymysql.err.OperationalError) (1... [20:35:26] (03PS2) 10Bking: search: allow any cirrussearch host to join cluster [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper) [20:35:30] (03CR) 10Bking: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper) [20:35:43] (03PS3) 10Bking: search: allow any cirrussearch host to join cluster [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper) [20:35:46] (03CR) 10Bking: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper) [20:36:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1257.eqiad.wmnet with reason: Maintenance [20:36:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1257 (T391056)', diff saved to https://phabricator.wikimedia.org/P74753 and previous config saved to /var/cache/conftool/dbconfig/20250408-203618-fceratto.json [20:36:21] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:37:07] !log import varnish-modules 0.15.0-3 to component/varnish6 bullseye-wikimedia (T391334) [20:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:09] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [20:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [20:39:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10723669 (10phaultfinder) [20:39:51] (03PS1) 10Jforrester: [BETA CLUSTER] Decommission Beta Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135100 (https://phabricator.wikimedia.org/T362200) [20:44:13] (03CR) 10Bking: [C:03+2] search: allow any cirrussearch host to join cluster [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper) [20:44:44] (03PS1) 10Ladsgroup: mariadb: Add cn_notice_projects to the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1135101 (https://phabricator.wikimedia.org/T363581) [20:46:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1257 (T391056)', diff saved to https://phabricator.wikimedia.org/P74754 and previous config saved to /var/cache/conftool/dbconfig/20250408-204615-fceratto.json [20:46:18] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:47:13] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:49:44] (03PS2) 10Ladsgroup: mariadb: Add cn_notice_projects to the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1135101 (https://phabricator.wikimedia.org/T363581) [20:49:50] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Add cn_notice_projects to the table catalog [puppet] - 10https://gerrit.wikimedia.org/r/1135101 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [20:51:33] !log import libvmod-querysort 0.4-2 to component/varnish6 bullseye-wikimedia (T391334) [20:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:36] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [20:53:36] (03PS3) 10Bking: cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) [20:53:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:54:20] (03PS1) 10Ladsgroup: openstack: wikireplica_dns: Add termstore aliases for s8 [puppet] - 10https://gerrit.wikimedia.org/r/1135107 (https://phabricator.wikimedia.org/T390954) [20:55:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10723743 (10phaultfinder) [20:55:55] (03CR) 10Bking: [C:04-1] "The role for cirrussearch is incorrect. Fixing..." [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T2100) [21:01:08] (03PS1) 10Ladsgroup: LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135109 (https://phabricator.wikimedia.org/T390514) [21:01:20] (03PS1) 10Ladsgroup: LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135110 (https://phabricator.wikimedia.org/T390514) [21:01:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1257', diff saved to https://phabricator.wikimedia.org/P74755 and previous config saved to /var/cache/conftool/dbconfig/20250408-210121-fceratto.json [21:01:25] jouncebot: nowandnext [21:01:25] For the next 0 hour(s) and 58 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250408T2100) [21:01:25] In 8 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250409T0600) [21:01:35] (03CR) 10Ladsgroup: [C:03+2] LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135109 (https://phabricator.wikimedia.org/T390514) (owner: 10Ladsgroup) [21:01:35] (03PS4) 10Bking: cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) [21:01:38] (03CR) 10Ladsgroup: [C:03+2] LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135110 (https://phabricator.wikimedia.org/T390514) (owner: 10Ladsgroup) [21:02:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:02:50] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:03:24] (03PS7) 10Andrea Denisse: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [21:03:35] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10723774 (10Jdforrester-WMF) >>! In T355914#10717142, @Ladsgroup wrote: > It'd be nice to add this to next week's tech news. Worth mentioning this has bee... [21:05:59] (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [21:06:42] (03PS5) 10Bking: cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) [21:06:45] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:08:00] Amir1: I was going to sling out https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1135100 but I don't want to step on your deployment toes. ;-) [21:08:57] I can deploy it [21:09:03] <3 [21:09:06] the backport patches take a while [21:09:11] Yeah. [21:09:13] (03CR) 10Ladsgroup: [C:03+2] [BETA CLUSTER] Decommission Beta Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135100 (https://phabricator.wikimedia.org/T362200) (owner: 10Jforrester) [21:09:17] Whee. [21:09:28] Now I need to drop the servers from horizon. [21:10:17] (03Merged) 10jenkins-bot: [BETA CLUSTER] Decommission Beta Wikifunctions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135100 (https://phabricator.wikimedia.org/T362200) (owner: 10Jforrester) [21:10:24] (03PS8) 10Andrea Denisse: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [21:12:23] (03Merged) 10jenkins-bot: LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135109 (https://phabricator.wikimedia.org/T390514) (owner: 10Ladsgroup) [21:12:57] (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [21:14:03] (03Merged) 10jenkins-bot: LoginSignupSpecialPage: Get a login token before persisting the session [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1135110 (https://phabricator.wikimedia.org/T390514) (owner: 10Ladsgroup) [21:16:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1257', diff saved to https://phabricator.wikimedia.org/P74756 and previous config saved to /var/cache/conftool/dbconfig/20250408-211629-fceratto.json [21:18:04] urandom: deploying right now [21:18:21] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1135110|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135109|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135100|[BETA CLUSTER] Decommission Beta Wikifunctions (T362200 T363397 T368161 T373464 T389274)]] [21:18:31] T362200: [QA task] wikifunction betacluster failures - https://phabricator.wikimedia.org/T362200 [21:18:32] T363397: wasmedge CLI Resource Limits Break Beta Cluster - https://phabricator.wikimedia.org/T363397 [21:18:32] T368161: Creation of object fails in betacluster with Unspecified error - https://phabricator.wikimedia.org/T368161 [21:18:33] T373464: Port routing on deployment-docker-wikifunctions01 port routing (?) seems broken, making Beta Cluster Wikifunctions orchestrator unable to talk to its evaluator - https://phabricator.wikimedia.org/T373464 [21:18:33] T389274: "Exec error in changeprop" for wikifunctions.beta.wmflabs.org - https://phabricator.wikimedia.org/T389274 [21:19:24] !log import libvmod-netmapper 1.9-4 to component/varnish6 bullseye-wikimedia (T391334) [21:19:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:26] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [21:19:53] (03PS1) 10Dzahn: cloud: re-add gitlab runner docker_gc Hiera settings in cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1135114 (https://phabricator.wikimedia.org/T390948) [21:19:55] (03PS1) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115 [21:20:05] Amir1: 👍 [21:21:31] (03CR) 10Dzahn: [C:03+2] "partial revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135114" [puppet] - 10https://gerrit.wikimedia.org/r/1133992 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [21:21:31] (03PS9) 10Andrea Denisse: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [21:21:37] (03PS2) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115 [21:21:42] (03CR) 10Dzahn: [C:03+2] cloud: re-add gitlab runner docker_gc Hiera settings in cloud.yaml [puppet] - 10https://gerrit.wikimedia.org/r/1135114 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [21:22:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:22:36] (03CR) 10Aleksandar Mastilovic: Absent systemd timers to stop attempting to generate enterprise HTML dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [21:23:50] (03CR) 10Aleksandar Mastilovic: Absent systemd timers to stop attempting to generate enterprise HTML dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1135042 (https://phabricator.wikimedia.org/T390556) (owner: 10Xcollazo) [21:24:03] (03CR) 10CI reject: [V:04-1] sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [21:25:39] !log ladsgroup@deploy1003 ladsgroup, jforrester: Backport for [[gerrit:1135110|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135109|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135100|[BETA CLUSTER] Decommission Beta Wikifunctions (T362200 T363397 T368161 T373464 T389274)]] synced to the testservers (https://wikitech.wikimed [21:25:39] ia.org/wiki/Mwdebug) [21:25:47] T362200: [QA task] wikifunction betacluster failures - https://phabricator.wikimedia.org/T362200 [21:25:47] T363397: wasmedge CLI Resource Limits Break Beta Cluster - https://phabricator.wikimedia.org/T363397 [21:25:47] T368161: Creation of object fails in betacluster with Unspecified error - https://phabricator.wikimedia.org/T368161 [21:25:48] T373464: Port routing on deployment-docker-wikifunctions01 port routing (?) seems broken, making Beta Cluster Wikifunctions orchestrator unable to talk to its evaluator - https://phabricator.wikimedia.org/T373464 [21:25:48] T389274: "Exec error in changeprop" for wikifunctions.beta.wmflabs.org - https://phabricator.wikimedia.org/T389274 [21:26:58] (03PS3) 10JHathaway: run_ci_locally.sh: use bind mounts for local runs [puppet] - 10https://gerrit.wikimedia.org/r/1135115 [21:27:20] !log ladsgroup@deploy1003 ladsgroup, jforrester: Continuing with sync [21:27:35] (03PS10) 10Andrea Denisse: sre: Add LibericaEtcdErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [21:27:37] (03PS1) 10Andrew Bogott: Add cloudcontrol1011 as an eqiad1 cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/1135117 (https://phabricator.wikimedia.org/T391300) [21:27:39] (03PS1) 10Andrew Bogott: Replace cloudcontrol1005 with cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300) [21:28:35] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135117 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott) [21:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [21:29:59] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:30:36] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:30:40] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:31:27] (03CR) 10Andrew Bogott: [C:03+2] Add cloudcontrol1011 as an eqiad1 cloudcontrol node [puppet] - 10https://gerrit.wikimedia.org/r/1135117 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott) [21:31:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1257 (T391056)', diff saved to https://phabricator.wikimedia.org/P74757 and previous config saved to /var/cache/conftool/dbconfig/20250408-213136-fceratto.json [21:31:39] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:31:41] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:31:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [21:32:27] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [21:33:46] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:33:49] (03CR) 10Bking: [C:03+2] "I fixed the role designation...merging" [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:34:03] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135110|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135109|LoginSignupSpecialPage: Get a login token before persisting the session (T390514)]], [[gerrit:1135100|[BETA CLUSTER] Decommission Beta Wikifunctions (T362200 T363397 T368161 T373464 T389274)]] (duration: 15m 42s) [21:34:10] T362200: [QA task] wikifunction betacluster failures - https://phabricator.wikimedia.org/T362200 [21:34:11] T363397: wasmedge CLI Resource Limits Break Beta Cluster - https://phabricator.wikimedia.org/T363397 [21:34:11] T368161: Creation of object fails in betacluster with Unspecified error - https://phabricator.wikimedia.org/T368161 [21:34:11] T373464: Port routing on deployment-docker-wikifunctions01 port routing (?) seems broken, making Beta Cluster Wikifunctions orchestrator unable to talk to its evaluator - https://phabricator.wikimedia.org/T373464 [21:34:11] T389274: "Exec error in changeprop" for wikifunctions.beta.wmflabs.org - https://phabricator.wikimedia.org/T389274 [21:35:30] the deployment just finished [21:37:57] (03CR) 10Andrea Denisse: "Hey! Quick heads-up on the `LibericaEtcdErrors` test, I had to make a few changes to get CI to pass." [alerts] - 10https://gerrit.wikimedia.org/r/1135050 (https://phabricator.wikimedia.org/T391340) (owner: 10Vgutierrez) [21:37:59] (03PS2) 10Andrew Bogott: Replace cloudcontrol1005 with cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300) [21:38:00] (03PS1) 10Andrew Bogott: Add cloudcontrol role to cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135120 (https://phabricator.wikimedia.org/T391300) [21:38:24] https://usercontent.irccloud-cdn.com/file/SBGPrPqR/grafik.png [21:38:54] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135120 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott) [21:39:01] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott) [21:39:43] Amir1: nice! is that the rate of "persisting for unknown reason" events? [21:40:05] swfrench-wmf: POST to sessionstore altogether [21:40:06] https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=now-3h&to=now&viewPanel=11 [21:40:41] (03CR) 10Andrew Bogott: [C:03+2] Add cloudcontrol role to cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135120 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott) [21:40:43] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2152.codfw.wmnet with reason: Maintenance [21:40:45] that looks quite promising :) [21:40:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2152 (T391056)', diff saved to https://phabricator.wikimedia.org/P74758 and previous config saved to /var/cache/conftool/dbconfig/20250408-214049-fceratto.json [21:40:53] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:43:42] (03PS1) 10Ryan Kemper: elastic: remove row A worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135121 (https://phabricator.wikimedia.org/T388610) [21:44:15] Amir1: that's a pretty significant drop [21:44:30] (03PS2) 10Ryan Kemper: elastic: remove row A worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135121 (https://phabricator.wikimedia.org/T388610) [21:45:07] it's not back to the pre-SUL3 era but much much better [21:45:42] yeah, what is that...about 20% [21:46:22] Amir1: I've not had a chance today to get up to speed on the details, but is this SUL3-specific code path that was duplicating? or was this an existing inefficiency? [21:46:29] hrmm, maybe more like 16%? [21:46:36] but for a one line code change, I like it! [21:47:09] (03CR) 10Bking: [C:03+1] elastic: remove row A worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135121 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:47:45] swfrench-wmf: I think it was existing already. But most importantly it might have been masked by something somewhere calling the login token [21:48:00] and some improvements unmasked it [21:48:08] regardless. It's nice to have for sure [21:48:25] got it, thanks! and yeah, very nice to have either way :) [21:48:41] Amir1: yeah, and these fell into the senseless overwrite bucket, right? [21:48:52] yup [21:48:56] (03CR) 10Ryan Kemper: [C:03+2] elastic: remove row A worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/1135121 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [21:49:03] yeah, eliminating these will help [21:49:49] got slightly even lower https://grafana.wikimedia.org/d/000001590/sessionstore?orgId=1&from=now-3h&to=now&viewPanel=11 [21:51:05] I think we're at the time of day when request volume is declining [21:51:31] partway between the peak and trough [21:51:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T391056)', diff saved to https://phabricator.wikimedia.org/P74759 and previous config saved to /var/cache/conftool/dbconfig/20250408-215159-fceratto.json [21:52:02] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:56:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3615 MB (3% inode=98%): /tmp 3615 MB (3% inode=98%): /var/tmp 3615 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [21:57:04] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [21:57:07] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [21:58:24] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2087 to cirrussearch2087 [21:58:35] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:58:38] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10724011 (10Jclark-ctr) [21:58:58] Amir1: here it is from the other end — https://grafana-rw.wikimedia.org/d/4plhqSPGk/bagostuff-stats-by-key-group?orgId=1&var-kClass=MWSession&from=1744138716070&to=1744149516070&forceLogin=&viewPanel=40 [21:59:06] 10ops-eqiad, 06SRE, 10Ceph, 10Cloud-VPS, and 2 others: [cloudceph] test the new DELL hard drives throughput - https://phabricator.wikimedia.org/T390134#10724027 (10Jclark-ctr) @Andrew @dcaro installed 8tb ssd drive [22:00:02] yeah, I'd call that a solid 15%. Nothing to sneeze at for a one-liner! [22:00:31] I'm trying to debug further [22:02:04] Amir1: don't forget to eat and get some sleep too! [22:02:23] shit, I forgot to make dinner [22:02:45] I go eat something, I will check again afterwards [22:02:58] !log T388610 Elasticsearch->Opensearch row a data node migration ongoing [22:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:01] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [22:03:33] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2087 to cirrussearch2087 - bking@cumin2002" [22:04:31] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2087 to cirrussearch2087 - bking@cumin2002" [22:04:31] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:04:32] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2087 [22:04:51] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2087 [22:05:31] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2087 to cirrussearch2087 [22:07:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P74760 and previous config saved to /var/cache/conftool/dbconfig/20250408-220706-fceratto.json [22:12:33] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [22:12:36] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [22:22:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P74761 and previous config saved to /var/cache/conftool/dbconfig/20250408-222213-fceratto.json [22:28:56] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2087.codfw.wmnet on all recursors [22:28:59] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2087.codfw.wmnet on all recursors [22:29:44] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [22:30:26] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2087.codfw.wmnet with OS bullseye [22:30:37] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2087 [22:32:52] (03PS3) 10Andrew Bogott: Replace cloudcontrol1005 with cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300) [22:33:04] (03PS4) 10Andrew Bogott: Replace cloudcontrol1005 with cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300) [22:33:34] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott) [22:33:36] (03CR) 10Scott French: [C:03+1] "Thanks, Effie! This LGTM from first principles, but I'm also minimally familiar with the logstash configuration here." [puppet] - 10https://gerrit.wikimedia.org/r/1135020 (owner: 10Effie Mouzeli) [22:33:58] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002" [22:34:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding ganeti2047 to codfw - jhancock@cumin2002" [22:34:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:34:37] !log bking@cumin2002 START - Cookbook sre.dns.netbox [22:37:05] (03PS1) 10Ryan Kemper: sre.elasticsearch.rolling-operation: handle negative caches between rename/reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1135133 (https://phabricator.wikimedia.org/T383811) [22:37:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T391056)', diff saved to https://phabricator.wikimedia.org/P74762 and previous config saved to /var/cache/conftool/dbconfig/20250408-223721-fceratto.json [22:37:24] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:37:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2154.codfw.wmnet with reason: Maintenance [22:37:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2154 (T391056)', diff saved to https://phabricator.wikimedia.org/P74763 and previous config saved to /var/cache/conftool/dbconfig/20250408-223744-fceratto.json [22:38:56] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2087 - bking@cumin2002" [22:39:02] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2087 - bking@cumin2002" [22:39:02] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:39:02] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2087.codfw.wmnet 90.0.192.10.in-addr.arpa 0.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:39:06] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2087.codfw.wmnet 90.0.192.10.in-addr.arpa 0.9.0.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [22:39:06] (03CR) 10Bking: [C:03+1] sre.elasticsearch.rolling-operation: handle negative caches between rename/reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1135133 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [22:39:06] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2087 [22:39:18] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2087 [22:39:19] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2087 [22:40:08] (03CR) 10Andrew Bogott: [C:03+2] Replace cloudcontrol1005 with cloudcontrol1011 [puppet] - 10https://gerrit.wikimedia.org/r/1135118 (https://phabricator.wikimedia.org/T391300) (owner: 10Andrew Bogott) [22:40:49] !log dani@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [22:40:51] !log dani@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [22:40:52] !log dani@deploy1003 helmfile [eqiad] START helmfile.d/services/miscweb: apply [22:40:54] !log dani@deploy1003 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [22:40:55] !log dani@deploy1003 helmfile [codfw] START helmfile.d/services/miscweb: apply [22:40:58] !log dani@deploy1003 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [22:41:18] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper) [22:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [22:48:01] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [22:48:03] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [22:49:08] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2069 to cirrussearch2069 [22:49:19] (03CR) 10Scott French: [C:03+1] "Thank you, Effie!" [puppet] - 10https://gerrit.wikimedia.org/r/1135021 (owner: 10Effie Mouzeli) [22:49:19] !log bking@cumin2002 START - Cookbook sre.dns.netbox [22:50:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T391056)', diff saved to https://phabricator.wikimedia.org/P74764 and previous config saved to /var/cache/conftool/dbconfig/20250408-225028-fceratto.json [22:50:31] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:53:41] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2069 to cirrussearch2069 - bking@cumin2002" [22:54:12] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2069 to cirrussearch2069 - bking@cumin2002" [22:54:12] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:54:13] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2069 [22:54:59] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2069 [22:55:02] (03PS1) 10Andrew Bogott: Remove final traces of cloudcontrol1005.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1135136 (https://phabricator.wikimedia.org/T391413) [22:55:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2069 to cirrussearch2069 [22:55:40] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2069.codfw.wmnet on all recursors [22:55:43] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2069.codfw.wmnet on all recursors [22:56:32] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2069.codfw.wmnet with OS bullseye [22:56:43] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2069 [22:56:52] !log bking@cumin2002 START - Cookbook sre.dns.netbox [22:56:54] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2087.codfw.wmnet with reason: host reimage [23:02:14] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2069 - bking@cumin2002" [23:02:23] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2069 - bking@cumin2002" [23:02:23] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:02:24] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2069.codfw.wmnet 142.0.192.10.in-addr.arpa 2.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [23:02:27] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2069.codfw.wmnet 142.0.192.10.in-addr.arpa 2.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [23:02:28] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2069 [23:02:38] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2087.codfw.wmnet with reason: host reimage [23:02:40] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2069 [23:02:40] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2069 [23:05:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P74765 and previous config saved to /var/cache/conftool/dbconfig/20250408-230535-fceratto.json [23:16:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3531 MB (3% inode=98%): /tmp 3531 MB (3% inode=98%): /var/tmp 3531 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [23:20:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P74766 and previous config saved to /var/cache/conftool/dbconfig/20250408-232042-fceratto.json [23:22:42] RESOLVED: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [23:24:52] (03PS2) 10Effie Mouzeli: logging: add support for php 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1135020 [23:28:10] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2087.codfw.wmnet with OS bullseye [23:29:32] (03PS1) 10Bking: cirrussearch: fix rack a7 regex [puppet] - 10https://gerrit.wikimedia.org/r/1135140 (https://phabricator.wikimedia.org/T388610) [23:31:42] (03CR) 10Bking: [C:03+2] "Self-merging interest of time" [puppet] - 10https://gerrit.wikimedia.org/r/1135140 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [23:33:40] FIRING: KubernetesRsyslogDown: rsyslog on wikikube-worker1132:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=wikikube-worker1132 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [23:35:20] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2069.codfw.wmnet with reason: host reimage [23:35:43] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:35:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T391056)', diff saved to https://phabricator.wikimedia.org/P74767 and previous config saved to /var/cache/conftool/dbconfig/20250408-233549-fceratto.json [23:35:53] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:36:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2162.codfw.wmnet with reason: Maintenance [23:36:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2162 (T391056)', diff saved to https://phabricator.wikimedia.org/P74768 and previous config saved to /var/cache/conftool/dbconfig/20250408-233611-fceratto.json [23:38:57] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2069.codfw.wmnet with reason: host reimage [23:39:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135141 [23:39:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135141 (owner: 10TrainBranchBot) [23:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:47:42] FIRING: AlertLintProblem: Linting problems found for CirrusBackendErrorRateTooHigh - https://wikitech.wikimedia.org/wiki/Alertmanager#Alert_linting_found_problems - TODO - https://alerts.wikimedia.org/?q=alertname%3DAlertLintProblem [23:48:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T391056)', diff saved to https://phabricator.wikimedia.org/P74769 and previous config saved to /var/cache/conftool/dbconfig/20250408-234850-fceratto.json [23:48:54] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:51:27] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1135141 (owner: 10TrainBranchBot) [23:59:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2069.codfw.wmnet with OS bullseye [23:59:36] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (1 nodes at a time) for ElasticSearch cluster search_codfw: reimage row A - bking@cumin2002 - T388610 [23:59:41] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610