[00:04:43] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:05:39] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[00:09:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[00:09:49] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136139
[00:09:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136139 (owner: 10TrainBranchBot)
[00:10:47] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.86 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:20:03] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[00:25:08] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[00:30:47] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136139 (owner: 10TrainBranchBot)
[00:43:54] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:46:35] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[00:56:37] <icinga-wm>	 PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/4dc58a2470693bde7218013f86951eceb81d1c9e87f9ef816f49591d04626c20/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[01:36:37] <icinga-wm>	 RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops
[02:10:47] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[03:32:11] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[03:40:03] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:45:03] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:52:11] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[04:09:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[04:20:03] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[04:25:03] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[04:45:03] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:46:35] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[05:08:54] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:48:27] <wikibugs>	 (03PS4) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666)
[05:49:02] <wikibugs>	 (03PS1) 10KartikMistry: Update MinT to 2025-04-09-054213-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136148
[05:51:11] <wikibugs>	 (03PS5) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666)
[05:52:11] <wikibugs>	 (03CR) 10Arnaudb: gerrit: failover cookbook (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[05:57:04] <wikibugs>	 (03PS1) 10Ayounsi: Host BGP: ignore hosts with no primary IP [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136150
[05:57:41] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:57:46] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[05:58:39] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:58:54] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:01:26] <arnaudb>	 gitui
[06:02:07] <wikibugs>	 (03PS6) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666)
[06:04:37] <wikibugs>	 (03PS1) 10Ayounsi: magru: remove novaacore/momentum [homer/public] - 10https://gerrit.wikimedia.org/r/1136152 (https://phabricator.wikimedia.org/T381913)
[06:12:48] <_joe_>	 !log uploaded conftool 5.1.0
[06:12:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:15:08] <moritzm>	 !log installing perl security updates
[06:15:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:19:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[06:23:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Add record for jvanderhoop LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/1136155
[06:26:05] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1136155 (owner: 10Muehlenhoff)
[06:27:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add record for jvanderhoop LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/1136155 (owner: 10Muehlenhoff)
[06:27:57] <wikibugs>	 10ops-codfw, 06DC-Ops: cr2-codfw: 2/4 PSU down - https://phabricator.wikimedia.org/T391790 (10ayounsi) 03NEW p:05Triage→03High
[06:35:00] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136104 (https://phabricator.wikimedia.org/T348611) (owner: 10Anzx)
[06:39:29] <wikibugs>	 (03PS1) 10Muehlenhoff: Track LDAP access for bcampbell804 [puppet] - 10https://gerrit.wikimedia.org/r/1136247
[06:41:10] <logmsgbot>	 !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply
[06:46:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1043.eqiad.wmnet
[06:47:18] <kart_>	 Testing MinT change, not deploying yet.
[06:48:14] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update MinT to 2025-04-09-054213-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136148 (owner: 10KartikMistry)
[06:48:27] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] Track LDAP access for bcampbell804 [puppet] - 10https://gerrit.wikimedia.org/r/1136247 (owner: 10Muehlenhoff)
[06:50:04] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2025-04-09-054213-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136148 (owner: 10KartikMistry)
[06:50:29] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1043.eqiad.wmnet
[06:50:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Track LDAP access for bcampbell804 [puppet] - 10https://gerrit.wikimedia.org/r/1136247 (owner: 10Muehlenhoff)
[06:51:18] <logmsgbot>	 !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[06:52:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc3 T391454', diff saved to https://phabricator.wikimedia.org/P74908 and previous config saved to /var/cache/conftool/dbconfig/20250414-065203-marostegui.json
[06:52:06] <stashbot>	 T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454
[06:52:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10737177 (10VRiley-WMF) Dell is currently with their level 3 engineers and looking at this ticket. They have laid out this plan of action on this server  "Plan of Action  Apply the latest iDRAC firmware up...
[06:54:15] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti1043.eqiad.wmnet
[06:54:55] <wikibugs>	 (03PS1) 10Marostegui: mariadb: pc2, upgrade to mariadb 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136249 (https://phabricator.wikimedia.org/T391454)
[06:55:10] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance
[06:58:19] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10737181 (10Marostegui) Thanks @VRiley-WMF - hopefully the plan is not to upgrade to that latest firmware and then wait again a few months to see exactly the same crash. Can you double check that their are...
[06:58:25] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: pc2, upgrade to mariadb 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136249 (https://phabricator.wikimedia.org/T391454) (owner: 10Marostegui)
[06:59:31] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1043.eqiad.wmnet
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T0700).
[07:00:05] <jouncebot>	 anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:50] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10737185 (10VRiley-WMF) Understood, I will be relaying this information to Dell to inquire if there are additional plans of action. As, I do know we have similar servers with similar configuration (if not...
[07:01:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10737186 (10Marostegui) Thank you!
[07:01:47] <moritzm>	 !log installing subversion security updates
[07:01:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:02:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc3 T391454', diff saved to https://phabricator.wikimedia.org/P74909 and previous config saved to /var/cache/conftool/dbconfig/20250414-070220-marostegui.json
[07:02:24] <stashbot>	 T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454
[07:04:29] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[07:04:45] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[07:05:43] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:06:39] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:10:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet
[07:13:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet
[07:15:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove now obsolete Cumin aliases for job runners [puppet] - 10https://gerrit.wikimedia.org/r/1136253 (https://phabricator.wikimedia.org/T354791)
[07:15:55] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance
[07:16:41] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Upgrade pc4 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136254 (https://phabricator.wikimedia.org/T391454)
[07:16:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc4 T391454', diff saved to https://phabricator.wikimedia.org/P74910 and previous config saved to /var/cache/conftool/dbconfig/20250414-071653-marostegui.json
[07:16:56] <stashbot>	 T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454
[07:19:19] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Upgrade pc4 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136254 (https://phabricator.wikimedia.org/T391454) (owner: 10Marostegui)
[07:24:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc4 T391454', diff saved to https://phabricator.wikimedia.org/P74911 and previous config saved to /var/cache/conftool/dbconfig/20250414-072437-marostegui.json
[07:24:42] <stashbot>	 T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454
[07:25:24] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update proton's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135910 (owner: 10Elukey)
[07:25:31] <elukey>	 jouncebot: nowandnext
[07:25:31] <jouncebot>	 For the next 0 hour(s) and 34 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T0700)
[07:25:31] <jouncebot>	 In 2 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1000)
[07:26:16] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1178.eqiad.wmnet with OS bullseye
[07:26:27] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10737233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1178.eqiad.wmnet with OS bullseye
[07:27:05] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/proton: sync
[07:27:45] <logmsgbot>	 !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: sync
[07:36:19] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/proton: sync
[07:37:11] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[07:37:29] <logmsgbot>	 !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: sync
[07:37:44] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: sync
[07:39:00] <logmsgbot>	 !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: sync
[07:42:54] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti1044.eqiad.wmnet
[07:45:03] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:46:21] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] "Bumping chart so that we can test the T386889" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136148 (owner: 10KartikMistry)
[07:48:10] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1044.eqiad.wmnet
[07:49:10] <wikibugs>	 (03PS4) 10Volans: remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond)
[07:49:10] <wikibugs>	 (03PS1) 10Volans: mysql: make MysqlRemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325
[07:49:10] <wikibugs>	 (03PS1) 10Volans: cookbook modules: use docstring for title [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136326
[07:49:47] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on db1246 is CRITICAL: connect to address 10.64.48.172 and port 22: Connection refused Marostegui Host crashed https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:50:08] <wikibugs>	 (03CR) 10Volans: "Resumed John's CR as I got some request to iterate over RemoteHosts instances. Added tests." [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond)
[07:50:36] <wikibugs>	 (03CR) 10Volans: "As requested on another CR." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325 (owner: 10Volans)
[07:52:07] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB memory on db2220 is CRITICAL: CRIT Memory 98% used. Largest process: mysqld (1575) = 97.3% Marostegui https://phabricator.wikimedia.org/T391795 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[07:53:03] <XioNoX>	 !log gnmic: bump `num-workers` to 12 on netflow1002 - T388641
[07:53:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:06] <stashbot>	 T388641: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641
[07:57:11] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[07:58:28] <moritzm>	 !log rebalance ganeti/B T391243
[07:58:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:31] <stashbot>	 T391243: Configure sandbox vlan on ganeti1043 and 1044 - https://phabricator.wikimedia.org/T391243
[08:00:42] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1175 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:05:00] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "looks good to me but I'd prefer a solution which depools Gerrit properly instead of running the sync multiple times. But this could be a l" [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[08:08:42] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1175 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[08:11:49] <moritzm>	 !log restarting clamav on vrts to pick up liblzma security updates
[08:11:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:17] <wikibugs>	 (03PS1) 10Slyngshede: IDP: Add dummy secret for Phabricator (test) [labs/private] - 10https://gerrit.wikimedia.org/r/1136327
[08:16:30] <wikibugs>	 (03PS2) 10Slyngshede: IDP: Add dummy secret for Phabricator (test) [labs/private] - 10https://gerrit.wikimedia.org/r/1136327 (https://phabricator.wikimedia.org/T377061)
[08:20:03] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[08:20:47] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1178.eqiad.wmnet with OS bullseye
[08:20:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10737377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host db1178.eqiad.wmnet with OS bullseye exe...
[08:22:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1178', diff saved to https://phabricator.wikimedia.org/P74912 and previous config saved to /var/cache/conftool/dbconfig/20250414-082235-marostegui.json
[08:23:34] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+1] IDP: Add dummy secret for Phabricator (test) [labs/private] - 10https://gerrit.wikimedia.org/r/1136327 (https://phabricator.wikimedia.org/T377061) (owner: 10Slyngshede)
[08:23:36] <wikibugs>	 (03CR) 10Slyngshede: [V:03+2 C:03+2] IDP: Add dummy secret for Phabricator (test) [labs/private] - 10https://gerrit.wikimedia.org/r/1136327 (https://phabricator.wikimedia.org/T377061) (owner: 10Slyngshede)
[08:25:00] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5274/console" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper)
[08:25:03] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[08:25:04] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[08:26:35] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[08:26:47] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10737412 (10Marostegui) Why was db1178 reimaged? This is a production host that is serving traffic.
[08:26:57] <marostegui>	 VRiley: check -sre please :)
[08:27:24] <fabfur>	 !log disable-puppet on A:cp to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135827 (T391670)
[08:27:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:27:27] <wikibugs>	 (03PS1) 10Muehlenhoff: Failover irc.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1136328
[08:27:27] <stashbot>	 T391670: Staticize haproxy directives from hiera to template - https://phabricator.wikimedia.org/T391670
[08:30:34] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5275/console" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper)
[08:30:43] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2220 - Upgrading host
[08:31:13] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2220 - Upgrading host
[08:31:35] <wikibugs>	 (03PS7) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666)
[08:31:42] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp1111.eqiad.wmnet
[08:32:03] <wikibugs>	 (03PS8) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666)
[08:32:04] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur)
[08:32:04] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2220.codfw.wmnet
[08:33:11] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10737471 (10Marostegui) >>! In T377878#10737412, @Marostegui wrote: > Why was db1178 reimaged? This is a production host that is serving traffic....
[08:33:17] <icinga-wm>	 RECOVERY - MariaDB memory on db2220 is OK: OK Memory 58% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[08:33:48] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Ok for me to failover, but I am wondering if it would be better for clients just to re-connect after a restart (rather than failover two t" [dns] - 10https://gerrit.wikimedia.org/r/1136328 (owner: 10Muehlenhoff)
[08:34:07] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[08:35:26] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp1111.eqiad.wmnet
[08:36:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10737477 (10phaultfinder)
[08:36:44] <wikibugs>	 (03CR) 10Volans: [C:04-1] "I think there are 2 logic error that would make the cookbook not behave as expected but are easily fixable. The rest LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb)
[08:36:48] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet
[08:38:59] <wikibugs>	 (03CR) 10Muehlenhoff: "It's a good point actually, with ircstream we can just as well simply restart and have them reconnect, the failover is only really neeedd " [dns] - 10https://gerrit.wikimedia.org/r/1136328 (owner: 10Muehlenhoff)
[08:39:27] <logmsgbot>	 !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2220.codfw.wmnet
[08:39:31] <moritzm>	 !log restarting ircstream on irc1003, clients will reconnect automatically
[08:39:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:34] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet
[08:40:54] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[08:41:11] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] alertmanager: route T&S tasks to their Slack [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková)
[08:41:50] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1178.eqiad.wmnet with OS bullseye
[08:41:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10737487 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1178.eqiad.wmnet with OS b...
[08:42:20] <wikibugs>	 (03CR) 10Kosta Harlan: [C:03+1] CommonSettings: remove outdated SecurePoll comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135835 (https://phabricator.wikimedia.org/T209892) (owner: 10Novem Linguae)
[08:42:29] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Failover irc.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1136328 (owner: 10Muehlenhoff)
[08:44:20] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5276/console" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper)
[08:45:03] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:45:13] <moritzm>	 !log restart Postfix/Dovecot on outbound MXes to pick up xz security updates
[08:45:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:49] <wikibugs>	 (03CR) 10Elukey: [C:03+1] remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond)
[08:46:34] <fabfur>	 !log enable-puppet on A:cp (T391670)
[08:46:36] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[08:46:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:38] <stashbot>	 T391670: Staticize haproxy directives from hiera to template - https://phabricator.wikimedia.org/T391670
[08:47:15] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10737505 (10hgzh) I'm not really happy that an enwiki discussion 'decided' this for all other projects that now get a notice three days before the change.
[08:47:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74914 and previous config saved to /var/cache/conftool/dbconfig/20250414-084716-root.json
[08:47:24] <wikibugs>	 (03CR) 10Elukey: [C:03+1] mysql: make MysqlRemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325 (owner: 10Volans)
[08:47:53] <moritzm>	 !log installing Postgres 15 security updates
[08:47:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:20] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136104 (https://phabricator.wikimedia.org/T348611) (owner: 10Anzx)
[08:48:44] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2220 gradually with 4 steps - Finished upgrading host
[08:51:24] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[08:54:01] <wikibugs>	 (03PS10) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805)
[08:57:48] <wikibugs>	 (03Merged) 10jenkins-bot: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[09:00:31] <XioNoX>	 !log gnmic: bump `num-workers` to 16 on netflow1002 - T388641
[09:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:35] <stashbot>	 T388641: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641
[09:02:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P74917 and previous config saved to /var/cache/conftool/dbconfig/20250414-090222-root.json
[09:03:03] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5277/console" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper)
[09:04:30] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[09:05:44] <wikibugs>	 (03PS3) 10Slyngshede: idp-test: add Phabricator test instance client [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper)
[09:06:14] <wikibugs>	 (03CR) 10Federico Ceratto: "I simplified the change keeping the original handling of Puppet and alerting." [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[09:06:18] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 2 others: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10737609 (10ABran-WMF) This first iteration is still fairly manual but will give us a stepping stone to build upon.  I'll r...
[09:06:28] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5278/console" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper)
[09:09:30] <jinxer-wm>	 RESOLVED: Emergency syslog message: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[09:11:18] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: enable object storage on one of the replicas [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[09:11:51] <wikibugs>	 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10737627 (10Volans) Ack, I can confirm the pages I was having trouble with are now found in search (at the cost of a larger index, I think is around...
[09:14:35] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1135714 (https://phabricator.wikimedia.org/T391577) (owner: 10Federico Ceratto)
[09:15:20] <wikibugs>	 (03PS11) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805)
[09:15:50] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2230.codfw.wmnet
[09:17:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74919 and previous config saved to /var/cache/conftool/dbconfig/20250414-091727-root.json
[09:20:57] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2230.codfw.wmnet
[09:23:38] <wikibugs>	 (03CR) 10Federico Ceratto: "Added support for the test cluster (skipping dbctl completely) and did a full run against db2230" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[09:24:15] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] pool.py: In dry-run mode do not monitor connection drain [cookbooks] - 10https://gerrit.wikimedia.org/r/1135714 (https://phabricator.wikimedia.org/T391577) (owner: 10Federico Ceratto)
[09:24:19] <wikibugs>	 (03PS1) 10Vgutierrez: sre: Add LibericaUnhealthyRealserverPooled alert [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697)
[09:29:45] <wikibugs>	 (03CR) 10Vgutierrez: sre: Add LibericaUnhealthyRealserverPooled alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez)
[09:31:45] <vgutierrez>	 !log restarting acme-chief to catch up on liblzma updates
[09:31:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P74922 and previous config saved to /var/cache/conftool/dbconfig/20250414-093232-root.json
[09:33:45] <vgutierrez>	 !log restarting acme-chief API servers to catch up on liblzma updates
[09:33:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:34:27] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:34:27] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:35:07] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2220 gradually with 4 steps - Finished upgrading host
[09:37:53] <wikibugs>	 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10737757 (10Aklapper) Sounds like this should be set to `declined` status again?
[09:39:56] <wikibugs>	 (03PS1) 10Hashar: CI: diff against parent commit instead of remote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781)
[09:43:02] <wikibugs>	 (03PS1) 10Brouberol: airflow: convert the scheduler liveness/readiness checks to a tcpCheck [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136336 (https://phabricator.wikimedia.org/T391497)
[09:44:39] <wikibugs>	 (03CR) 10CI reject: [V:04-1] airflow: convert the scheduler liveness/readiness checks to a tcpCheck [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136336 (https://phabricator.wikimedia.org/T391497) (owner: 10Brouberol)
[09:45:58] <wikibugs>	 (03PS2) 10Brouberol: airflow: convert the scheduler liveness/readiness checks to a tcpCheck [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136336 (https://phabricator.wikimedia.org/T391497)
[09:47:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P74924 and previous config saved to /var/cache/conftool/dbconfig/20250414-094737-root.json
[09:55:09] <wikibugs>	 (03CR) 10Volans: "I'm a little bit confused as this patch and I4ce9217392a7795940c981e1ee7da52df026cb5c are both performing substantial changes to the same " [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[09:58:56] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[09:59:45] <wikibugs>	 (03CR) 10Marostegui: upgrade.py: Depool, repool, update Phabricator (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto)
[10:00:03] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1000)
[10:00:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc1', diff saved to https://phabricator.wikimedia.org/P74925 and previous config saved to /var/cache/conftool/dbconfig/20250414-100038-marostegui.json
[10:01:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc1', diff saved to https://phabricator.wikimedia.org/P74927 and previous config saved to /var/cache/conftool/dbconfig/20250414-100135-marostegui.json
[10:02:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74928 and previous config saved to /var/cache/conftool/dbconfig/20250414-100242-root.json
[10:04:05] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1157.eqiad.wmnet with reason: Maintenance
[10:04:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T391056)', diff saved to https://phabricator.wikimedia.org/P74929 and previous config saved to /var/cache/conftool/dbconfig/20250414-100412-fceratto.json
[10:04:16] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[10:04:40] <wikibugs>	 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10737856 (10A_smart_kitten) >>! In T332220#10737757, @Aklapper wrote: > Sounds like this should be set to `declined` status again?  Would `stalled` on a reply be better? As it sounds like acquiring...
[10:05:40] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10737860 (10Ladsgroup) This is not really because of English Wikipedia. This has been requested many many times by many communities. For example:  - Engli...
[10:08:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T391056)', diff saved to https://phabricator.wikimedia.org/P74930 and previous config saved to /var/cache/conftool/dbconfig/20250414-100809-fceratto.json
[10:09:09] <Amir1>	 jouncebot: nowandnext
[10:09:09] <jouncebot>	 For the next 0 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1000)
[10:09:10] <jouncebot>	 In 1 hour(s) and 50 minute(s): Wikifunctions MediaWiki integration backport (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1200)
[10:09:20] <Amir1>	 nothing is happening on infra side?
[10:11:09] <wikibugs>	 (03PS1) 10Ladsgroup: Bump thumbnail steps to 90% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136339 (https://phabricator.wikimedia.org/T360589)
[10:12:06] <wikibugs>	 (03PS1) 10Ayounsi: gNMIc: bump num-workers to 16 [puppet] - 10https://gerrit.wikimedia.org/r/1136341 (https://phabricator.wikimedia.org/T388641)
[10:13:48] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136339 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:13:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135835 (https://phabricator.wikimedia.org/T209892) (owner: 10Novem Linguae)
[10:14:56] <wikibugs>	 (03Merged) 10jenkins-bot: Bump thumbnail steps to 90% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136339 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup)
[10:15:01] <wikibugs>	 (03Merged) 10jenkins-bot: CommonSettings: remove outdated SecurePoll comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135835 (https://phabricator.wikimedia.org/T209892) (owner: 10Novem Linguae)
[10:15:36] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136339|Bump thumbnail steps to 90% (T360589)]], [[gerrit:1135835|CommonSettings: remove outdated SecurePoll comment (T209892)]]
[10:15:39] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[10:15:39] <stashbot>	 T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892
[10:17:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P74931 and previous config saved to /var/cache/conftool/dbconfig/20250414-101748-root.json
[10:19:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[10:20:57] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] gNMIc: bump num-workers to 16 [puppet] - 10https://gerrit.wikimedia.org/r/1136341 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[10:21:24] <wikibugs>	 (03Abandoned) 10Clément Goubert: scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[10:22:26] <wikibugs>	 (03PS2) 10Clément Goubert: mw:periodic_jobs: Add mw-cron boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555)
[10:22:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: sre: Add LibericaUnhealthyRealserverPooled alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez)
[10:23:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P74932 and previous config saved to /var/cache/conftool/dbconfig/20250414-102316-fceratto.json
[10:24:23] <wikibugs>	 (03PS1) 10Jelto: ceph: move apus_keys to ceph folder [labs/private] - 10https://gerrit.wikimedia.org/r/1136344 (https://phabricator.wikimedia.org/T378922)
[10:26:52] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] gNMIc: bump num-workers to 16 [puppet] - 10https://gerrit.wikimedia.org/r/1136341 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi)
[10:31:17] <wikibugs>	 (03CR) 10MVernon: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1136344 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[10:32:18] <wikibugs>	 (03CR) 10Vgutierrez: sre: Add LibericaUnhealthyRealserverPooled alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez)
[10:32:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74933 and previous config saved to /var/cache/conftool/dbconfig/20250414-103253-root.json
[10:35:38] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1 C:03+2] idp-test: add Phabricator test instance client [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper)
[10:35:55] <vgutierrez>	 !log upload varnish 7.1.1-1.1~bpo11+wmf3 to apt.wm.o (bullseye-wikimedia) - T391334
[10:35:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:58] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[10:37:41] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw-cron: Add statsd-exporter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135929 (https://phabricator.wikimedia.org/T391672) (owner: 10Clément Goubert)
[10:39:05] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "Nice I like it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136326 (owner: 10Volans)
[10:39:09] <wikibugs>	 (03Merged) 10jenkins-bot: mw-cron: Add statsd-exporter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135929 (https://phabricator.wikimedia.org/T391672) (owner: 10Clément Goubert)
[10:39:36] <Amir1>	 claime: hii, sorry to bother, let me know when I can do a scap :D
[10:39:47] <claime>	 Amir1: gimme a couple minutes
[10:40:01] <Amir1>	 no worries. Thanks!
[10:40:03] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply
[10:40:14] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply
[10:40:27] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[10:40:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:40:36] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[10:40:45] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033)
[10:41:28] <claime>	 Amir1: you can go ahead
[10:41:33] <Amir1>	 <3
[10:41:51] <logmsgbot>	 !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136339|Bump thumbnail steps to 90% (T360589)]], [[gerrit:1135835|CommonSettings: remove outdated SecurePoll comment (T209892)]]
[10:41:55] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[10:41:56] <stashbot>	 T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892
[10:42:00] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox
[10:43:25] <wikibugs>	 (03PS3) 10Federico Ceratto: Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212)
[10:43:29] <vgutierrez>	 !log rolling upgrade to varnish 7.1.1-1..1~bpo11+wmf3 in ulsfo - T391334
[10:43:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:32] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[10:43:53] <wikibugs>	 (03PS1) 10Hnowlan: rest-gateway: correct mobileapps path ordering when revision is used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136347 (https://phabricator.wikimedia.org/T264670)
[10:44:19] <logmsgbot>	 !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough
[10:45:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:46:49] <wikibugs>	 (03PS4) 10Federico Ceratto: Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212)
[10:47:06] <wikibugs>	 (03CR) 10Federico Ceratto: "Added to codfw as well." [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[10:47:22] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup, novemlinguae: Backport for [[gerrit:1136339|Bump thumbnail steps to 90% (T360589)]], [[gerrit:1135835|CommonSettings: remove outdated SecurePoll comment (T209892)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[10:47:26] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[10:47:27] <stashbot>	 T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892
[10:47:28] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Updating docker-pkg to 4.0.4 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1134727 (owner: 10Elukey)
[10:47:42] <wikibugs>	 (03CR) 10Federico Ceratto: "Updated" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[10:47:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74935 and previous config saved to /var/cache/conftool/dbconfig/20250414-104758-root.json
[10:48:00] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_ulsfo
[10:48:33] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_ulsfo and not P{cp4047.ulsfo.wmnet} and A:cp
[10:49:43] <logmsgbot>	 !log ladsgroup@deploy1003 ladsgroup, novemlinguae: Continuing with sync
[10:50:35] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[10:50:37] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[10:51:19] <icinga-wm>	 PROBLEM - OSPF status on cloudsw2-d5-eqiad.mgmt is CRITICAL: OSPFv2: 1/1 UP : OSPFv3: 0/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:51:43] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:52:57] <logmsgbot>	 !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-upload_ulsfo and not P{cp4047.ulsfo.wmnet} and A:cp
[10:53:03] <logmsgbot>	 !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-text_ulsfo
[10:53:30] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T391056)', diff saved to https://phabricator.wikimedia.org/P74936 and previous config saved to /var/cache/conftool/dbconfig/20250414-105329-fceratto.json
[10:53:34] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[10:53:45] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[10:53:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T391056)', diff saved to https://phabricator.wikimedia.org/P74937 and previous config saved to /var/cache/conftool/dbconfig/20250414-105351-fceratto.json
[10:57:10] <wikibugs>	 (03CR) 10Jgiannelos: [C:03+1] rest-gateway: correct mobileapps path ordering when revision is used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136347 (https://phabricator.wikimedia.org/T264670) (owner: 10Hnowlan)
[10:57:31] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough
[10:57:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T391056)', diff saved to https://phabricator.wikimedia.org/P74938 and previous config saved to /var/cache/conftool/dbconfig/20250414-105741-fceratto.json
[10:58:03] <wikibugs>	 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10738065 (10Ladsgroup) eqiad containers are much bigger and it'll take way more time to clean them. 24 days have passed and only roughly 30% have been removed from 0x containers. Now...
[10:59:18] <logmsgbot>	 !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136339|Bump thumbnail steps to 90% (T360589)]], [[gerrit:1135835|CommonSettings: remove outdated SecurePoll comment (T209892)]] (duration: 17m 26s)
[10:59:22] <stashbot>	 T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589
[10:59:23] <stashbot>	 T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892
[11:01:09] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10738083 (10hgzh) Thanks for the links, most of the requests are based on a local discussion and also the global ones seem to come mainly from individual...
[11:01:22] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[11:03:14] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] Remove now obsolete Cumin aliases for job runners [puppet] - 10https://gerrit.wikimedia.org/r/1136253 (https://phabricator.wikimedia.org/T354791) (owner: 10Muehlenhoff)
[11:09:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:10:02] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw:periodic_jobs: Add mw-cron boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[11:11:37] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Enable SUL3 on most remaining beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850
[11:12:03] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] mw:periodic_jobs: Add mw-cron boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert)
[11:12:05] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: Clean up obsolete SUL3 settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851
[11:12:27] <moritzm>	 !log restart spamassassin on lists* to pick up Perl security updates
[11:12:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:12:34] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:04-1] "Needs to wait for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1135964 to be deployed now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 (owner: 10Bartosz Dziewoński)
[11:12:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P74939 and previous config saved to /var/cache/conftool/dbconfig/20250414-111247-fceratto.json
[11:17:07] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] rest-gateway: correct mobileapps path ordering when revision is used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136347 (https://phabricator.wikimedia.org/T264670) (owner: 10Hnowlan)
[11:18:55] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: correct mobileapps path ordering when revision is used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136347 (https://phabricator.wikimedia.org/T264670) (owner: 10Hnowlan)
[11:19:01] <logmsgbot>	 !log fceratto@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[11:19:03] <logmsgbot>	 !log fceratto@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[11:19:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove now obsolete Cumin aliases for job runners [puppet] - 10https://gerrit.wikimedia.org/r/1136253 (https://phabricator.wikimedia.org/T354791) (owner: 10Muehlenhoff)
[11:20:29] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply
[11:20:35] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply
[11:24:13] <vgutierrez>	 !log upload varnishkafka 1.2.0-3 to apt.wm.o (bullseye-wikimedia) - T391334
[11:24:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:24:17] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[11:24:20] <wikibugs>	 (03CR) 10Hashar: "Indeed for the release Jenkins, there is no service defined in Puppet.  The systemd unit is installed by the Debian package which is itsel" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[11:24:44] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[11:25:00] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply
[11:25:02] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp4037.ulsfo.wmnet} and A:cp
[11:25:04] <wikibugs>	 (03PS1) 10Brouberol: airflow-test-k8s: adjust dag/file processing timeout to account for large v1 dumps dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136351 (https://phabricator.wikimedia.org/T391744)
[11:25:08] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply
[11:25:17] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp4045.ulsfo.wmnet} and A:cp
[11:26:18] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply
[11:26:31] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply
[11:26:41] <wikibugs>	 (03CR) 10Jelto: [V:03+2 C:03+2] ceph: move apus_keys to ceph folder [labs/private] - 10https://gerrit.wikimedia.org/r/1136344 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[11:27:25] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'.
[11:27:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P74940 and previous config saved to /var/cache/conftool/dbconfig/20250414-112754-fceratto.json
[11:28:37] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'.
[11:29:21] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[11:29:35] <wikibugs>	 (03PS1) 10Clément Goubert: team-sre/mw-cron: Fix logstash link [alerts] - 10https://gerrit.wikimedia.org/r/1136352
[11:29:44] <logmsgbot>	 !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[11:30:08] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp4037.ulsfo.wmnet} and A:cp
[11:30:11] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+2] alertmanager: route T&S tasks to their Slack [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková)
[11:30:18] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp4045.ulsfo.wmnet} and A:cp
[11:33:46] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] team-sre/mw-cron: Fix logstash link [alerts] - 10https://gerrit.wikimedia.org/r/1136352 (owner: 10Clément Goubert)
[11:34:06] <wikibugs>	 (03CR) 10Jelto: [V:03+2 C:03+2] "this did not solve the issue, `Function lookup() did not find a value for the name 'profile::ceph::s3::client::apus_keys'`" [labs/private] - 10https://gerrit.wikimedia.org/r/1136344 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[11:37:04] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (owner: 10Elukey)
[11:37:57] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:38:49] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:40:28] <logmsgbot>	 !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox
[11:42:11] <jinxer-wm>	 FIRING: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[11:42:39] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "Sure, SGTM. I don't have a strong opinion either way." [debs/helm3] (helm311) - 10https://gerrit.wikimedia.org/r/1135412 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto)
[11:43:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T391056)', diff saved to https://phabricator.wikimedia.org/P74941 and previous config saved to /var/cache/conftool/dbconfig/20250414-114300-fceratto.json
[11:43:04] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[11:43:16] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[11:43:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T391056)', diff saved to https://phabricator.wikimedia.org/P74942 and previous config saved to /var/cache/conftool/dbconfig/20250414-114323-fceratto.json
[11:45:03] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:46:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Create insetup role for ServiceOps with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133927 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff)
[11:47:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T391056)', diff saved to https://phabricator.wikimedia.org/P74943 and previous config saved to /var/cache/conftool/dbconfig/20250414-114711-fceratto.json
[11:47:35] <kart_>	 OK to deploy cxserver/MinT?
[11:49:35] <wikibugs>	 (03CR) 10Jaime Nuche: "> We should run scap to deploy Jenkins+plugins on the new host that is erroring out." [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[11:55:10] <wikibugs>	 (03CR) 10Volans: [C:03+2] remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond)
[11:55:59] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] team-sre/mw-cron: Fix logstash link [alerts] - 10https://gerrit.wikimedia.org/r/1136352 (owner: 10Clément Goubert)
[11:57:14] <wikibugs>	 (03Merged) 10jenkins-bot: team-sre/mw-cron: Fix logstash link [alerts] - 10https://gerrit.wikimedia.org/r/1136352 (owner: 10Clément Goubert)
[12:00:04] <jouncebot>	 James_F: OwO what's this, a deployment window?? Wikifunctions MediaWiki integration backport. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1200). nyaa~
[12:00:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136049 (https://phabricator.wikimedia.org/T391594) (owner: 10Jforrester)
[12:00:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136050 (owner: 10Jforrester)
[12:00:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136051 (https://phabricator.wikimedia.org/T391441) (owner: 10Jforrester)
[12:00:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136126 (owner: 10Jforrester)
[12:01:02] <wikibugs>	 (03PS1) 10Jelto: gitlab: create an alias for apus credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922)
[12:01:17] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: pc3 on pc2013 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1539, Errmsg: Error Unknown event wmf_slave_overload on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:01:40] <wikibugs>	 (03Merged) 10jenkins-bot: logging: Allow through WikiLambdaClient logs at info level and above [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136126 (owner: 10Jforrester)
[12:02:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P74944 and previous config saved to /var/cache/conftool/dbconfig/20250414-120219-fceratto.json
[12:02:55] <wikibugs>	 (03PS1) 10Filippo Giunchedi: profile: fix prometheus cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1136360
[12:02:58] <wikibugs>	 (03CR) 10Volans: "post-merge reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[12:03:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gitlab: create an alias for apus credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[12:03:17] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Make choice of firewall stack in insetup roles specific / Add nftables variants - https://phabricator.wikimedia.org/T389825#10738270 (10MoritzMuehlenhoff) 05Open→03Resolved Separate insetup roles have been created and an announcement was sent t...
[12:03:27] <wikibugs>	 (03Merged) 10jenkins-bot: Special pages: Don't just set userCanExecute() but actually run it [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136049 (https://phabricator.wikimedia.org/T391594) (owner: 10Jforrester)
[12:03:39] <jinxer-wm>	 RESOLVED: [2x] DatasourceNoData: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData
[12:05:06] <wikibugs>	 (03Merged) 10jenkins-bot: Client mode: Provide WikiLambdaClientModeOffline for SRE to disable [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136050 (owner: 10Jforrester)
[12:05:56] <wikibugs>	 (03Merged) 10jenkins-bot: Wikifunctions VE: Add loading and abort state to content editable [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136051 (https://phabricator.wikimedia.org/T391441) (owner: 10Jforrester)
[12:06:13] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136049|Special pages: Don't just set userCanExecute() but actually run it (T391594)]], [[gerrit:1136050|Client mode: Provide WikiLambdaClientModeOffline for SRE to disable]], [[gerrit:1136051|Wikifunctions VE: Add loading and abort state to content editable (T391441)]], [[gerrit:1136126|logging: Allow through WikiLambdaClient logs at info level and
[12:06:14] <logmsgbot>	 above]]
[12:06:18] <stashbot>	 T391594: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectLabel: [1146] Table 'test2wiki.wikilamb - https://phabricator.wikimedia.org/T391594
[12:06:18] <stashbot>	 T391441: [VE WikifunctionsCall]: Adapt function call editing to new parsoid version - https://phabricator.wikimedia.org/T391441
[12:06:29] <wikibugs>	 (03Merged) 10jenkins-bot: remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond)
[12:06:56] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[12:09:03] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc3 on pc2013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:09:06] <marostegui>	 On it
[12:11:16] <wikibugs>	 (03PS2) 10Hnowlan: jobrunner: clean up remaining cruft [puppet] - 10https://gerrit.wikimedia.org/r/1135465
[12:11:17] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: pc3 on pc2013 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:12:03] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: pc3 on pc2013 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:13:55] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: failover cookbook fix [cookbooks] - 10https://gerrit.wikimedia.org/r/1136361 (https://phabricator.wikimedia.org/T260666)
[12:13:55] <wikibugs>	 (03CR) 10Arnaudb: "missed that fix in the previous merge" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136361 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[12:14:28] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10738327 (10Ladsgroup) >>! In T355914#10738083, @hgzh wrote: > Thanks for the links, most of the requests are based on a local discussion and also the glo...
[12:15:27] <wikibugs>	 (03PS2) 10Brouberol: airflow-test-k8s: adjust dag/file processing timeout to account for large v1 dumps dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136351 (https://phabricator.wikimedia.org/T391744)
[12:17:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P74945 and previous config saved to /var/cache/conftool/dbconfig/20250414-121726-fceratto.json
[12:18:20] <wikibugs>	 (03CR) 10Hashar: "Ah excellent thank you! :)" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[12:19:06] <wikibugs>	 (03PS4) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782)
[12:19:09] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] python-webapp: Update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135432 (https://phabricator.wikimedia.org/T384212) (owner: 10Clément Goubert)
[12:19:11] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136366
[12:19:12] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[12:19:32] <wikibugs>	 (03Abandoned) 10Hashar: ci: switch jenkins deployment method on contint to scap [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[12:20:55] <wikibugs>	 (03Merged) 10jenkins-bot: python-webapp: Update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135432 (https://phabricator.wikimedia.org/T384212) (owner: 10Clément Goubert)
[12:22:19] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[12:22:56] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[12:23:30] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' .
[12:23:39] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:24:24] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:24:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10738390 (10phaultfinder)
[12:24:45] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:24:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1136360 (owner: 10Filippo Giunchedi)
[12:25:13] <logmsgbot>	 !log cgoubert@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[12:25:50] <wikibugs>	 (03PS1) 10Jforrester: Complete our RecentChanges entry generation and formatting [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136368 (https://phabricator.wikimedia.org/T386020)
[12:28:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] profile: fix prometheus cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1136360 (owner: 10Filippo Giunchedi)
[12:28:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[12:29:09] <wikibugs>	 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10738399 (10Nikerabbit)
[12:30:31] <wikibugs>	 (03CR) 10Volans: sanitarium_restart.py: restart Sanitarium hosts (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[12:30:52] <wikibugs>	 (03Abandoned) 10AOkoth: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth)
[12:31:29] <wikibugs>	 (03CR) 10Volans: [C:03+1] "LGTM, thx" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136361 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[12:31:47] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook fix [cookbooks] - 10https://gerrit.wikimedia.org/r/1136361 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[12:31:54] <wikibugs>	 (03CR) 10Effie Mouzeli: "LGTM! one question, do we still need them defined in common.yaml (under wikimedia_clusters)?" [puppet] - 10https://gerrit.wikimedia.org/r/1135465 (owner: 10Hnowlan)
[12:31:57] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] jobrunner: clean up remaining cruft [puppet] - 10https://gerrit.wikimedia.org/r/1135465 (owner: 10Hnowlan)
[12:32:21] <wikibugs>	 (03CR) 10Marostegui: sanitarium_restart.py: restart Sanitarium hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[12:32:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T391056)', diff saved to https://phabricator.wikimedia.org/P74946 and previous config saved to /var/cache/conftool/dbconfig/20250414-123234-fceratto.json
[12:32:38] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[12:32:50] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[12:32:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T391056)', diff saved to https://phabricator.wikimedia.org/P74947 and previous config saved to /var/cache/conftool/dbconfig/20250414-123255-fceratto.json
[12:36:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T391056)', diff saved to https://phabricator.wikimedia.org/P74948 and previous config saved to /var/cache/conftool/dbconfig/20250414-123649-fceratto.json
[12:36:52] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136049|Special pages: Don't just set userCanExecute() but actually run it (T391594)]], [[gerrit:1136050|Client mode: Provide WikiLambdaClientModeOffline for SRE to disable]], [[gerrit:1136051|Wikifunctions VE: Add loading and abort state to content editable (T391441)]], [[gerrit:1136126|logging: Allow through WikiLambdaClient logs at info level and
[12:36:52] <logmsgbot>	 above]]
[12:36:56] <stashbot>	 T391594: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectLabel: [1146] Table 'test2wiki.wikilamb - https://phabricator.wikimedia.org/T391594
[12:36:57] <stashbot>	 T391441: [VE WikifunctionsCall]: Adapt function call editing to new parsoid version - https://phabricator.wikimedia.org/T391441
[12:36:59] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820 (10Lena_WMDE) 03NEW
[12:37:16] <wikibugs>	 (03PS3) 10KartikMistry: MinT: Increase liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125093 (https://phabricator.wikimedia.org/T386889)
[12:38:56] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10738496 (10Lena_WMDE)
[12:39:40] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10738498 (10WMDE-leszek) On WMDE's behalf I approve this request, and confirm @Lena_WMDE is who she claims to be.
[12:40:18] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10738499 (10Lena_WMDE)
[12:40:21] <wikibugs>	 (03CR) 10Volans: sanitarium_restart.py: restart Sanitarium hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[12:42:25] <wikibugs>	 (03CR) 10Marostegui: sanitarium_restart.py: restart Sanitarium hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto)
[12:43:45] <moritzm>	 !log remove ganeti01.svc.eqsin.wmnet cert (replaced by cfssl cert) T357750
[12:43:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:50] <stashbot>	 T357750: Phase out cergen - https://phabricator.wikimedia.org/T357750
[12:44:24] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1136049|Special pages: Don't just set userCanExecute() but actually run it (T391594)]], [[gerrit:1136050|Client mode: Provide WikiLambdaClientModeOffline for SRE to disable]], [[gerrit:1136051|Wikifunctions VE: Add loading and abort state to content editable (T391441)]], [[gerrit:1136126|logging: Allow through WikiLambdaClient logs at info level and above]] sync
[12:44:24] <logmsgbot>	 ed to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:44:29] <stashbot>	 T391594: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectLabel: [1146] Table 'test2wiki.wikilamb - https://phabricator.wikimedia.org/T391594
[12:44:29] <stashbot>	 T391441: [VE WikifunctionsCall]: Adapt function call editing to new parsoid version - https://phabricator.wikimedia.org/T391441
[12:44:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10738536 (10phaultfinder)
[12:46:05] <moritzm>	 !log remove ganeti01.svc.ulsfo.wmnet cert (replaced by cfssl cert) T357750
[12:46:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:22] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[12:48:04] <wikibugs>	 (03PS2) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430)
[12:48:27] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[12:48:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[12:48:44] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:48:53] <jinxer-wm>	 FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[12:48:53] <wikibugs>	 (03CR) 10Hashar: "From https://gerrit.wikimedia.org/r/c/operations/puppet/+/676008/comment/ccc95a45_71fd7099/ , the logic is shared with production and SRE " [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[12:49:37] <moritzm>	 !log remove ganeti01.svc.esams.wmnet cert (replaced by cfssl cert) T357750
[12:49:40] <wikibugs>	 (03PS3) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430)
[12:49:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:41] <stashbot>	 T357750: Phase out cergen - https://phabricator.wikimedia.org/T357750
[12:50:45] <godog>	 !log upgrade prometheus2005 to thanos 0.38.0 - T383966
[12:50:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:49] <stashbot>	 T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966
[12:51:34] <godog>	 !log upgrade prometheus2007 to thanos 0.38.0 - T383966
[12:51:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:51:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P74949 and previous config saved to /var/cache/conftool/dbconfig/20250414-125156-fceratto.json
[12:53:06] <moritzm>	 !log remove ganeti01.svc.codfw.wmnet cert (replaced by cfssl cert) T357750
[12:53:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:53:39] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:54:51] <wikibugs>	 (03PS2) 10Jelto: gitlab: create an alias for apus credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922)
[12:55:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc5 T391454', diff saved to https://phabricator.wikimedia.org/P74950 and previous config saved to /var/cache/conftool/dbconfig/20250414-125511-marostegui.json
[12:55:15] <stashbot>	 T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454
[12:56:14] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_ulsfo and not P{cp4047.ulsfo.wmnet} and not P{cp4045.ulsfo.wmnet} and A:cp
[12:56:20] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136049|Special pages: Don't just set userCanExecute() but actually run it (T391594)]], [[gerrit:1136050|Client mode: Provide WikiLambdaClientModeOffline for SRE to disable]], [[gerrit:1136051|Wikifunctions VE: Add loading and abort state to content editable (T391441)]], [[gerrit:1136126|logging: Allow through WikiLambdaClient logs at info level an
[12:56:20] <logmsgbot>	 d above]] (duration: 19m 27s)
[12:56:21] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance
[12:56:22] <James_F>	 Just in time for the backport window.
[12:56:24] <stashbot>	 T391594: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectLabel: [1146] Table 'test2wiki.wikilamb - https://phabricator.wikimedia.org/T391594
[12:56:24] <stashbot>	 T391441: [VE WikifunctionsCall]: Adapt function call editing to new parsoid version - https://phabricator.wikimedia.org/T391441
[12:56:34] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_ulsfo and not P{cp4037.ulsfo.wmnet} and A:cp
[12:56:58] <wikibugs>	 (03PS1) 10Marostegui: mariadb: pc5 upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136373 (https://phabricator.wikimedia.org/T391454)
[12:57:08] <wikibugs>	 (03CR) 10CI reject: [V:04-1] gitlab: create an alias for apus credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[12:58:39] <jinxer-wm>	 FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[12:58:42] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: pc5 upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136373 (https://phabricator.wikimedia.org/T391454) (owner: 10Marostegui)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1300).
[13:00:05] <jouncebot>	 MatmaRex and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:16] <anzx>	 o/
[13:00:21] <Lucas_WMDE>	 I can’t deploy today, forgot to bring my yubikey to the office 😔
[13:00:35] <MatmaRex>	 hi
[13:00:58] <moritzm>	 !log remove ganeti01.svc.eqiad.wmnet cert (replaced by cfssl cert) T357750
[13:01:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:02] <stashbot>	 T357750: Phase out cergen - https://phabricator.wikimedia.org/T357750
[13:01:22] <wikibugs>	 (03CR) 10Majavah: Add wmcs-bastionless utility script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1118526 (https://phabricator.wikimedia.org/T379550) (owner: 10Andrew Bogott)
[13:02:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc5 T391454', diff saved to https://phabricator.wikimedia.org/P74951 and previous config saved to /var/cache/conftool/dbconfig/20250414-130222-marostegui.json
[13:02:26] <stashbot>	 T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454
[13:02:57] <wikibugs>	 (03PS1) 10Volans: sre.hosts.reimage: check dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878)
[13:03:06] * TheresNoTime can deploy
[13:03:11] <Lucas_WMDE>	 \o/
[13:03:25] <TheresNoTime>	 MatmaRex: starting with your CA patch
[13:03:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135993 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński)
[13:03:59] <MatmaRex>	 thanks
[13:04:29] <MatmaRex>	 nothing to test on mwdebug here, we don't have a way to reproduce these failures
[13:04:42] <TheresNoTime>	 ack
[13:06:03] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] Updating docker-pkg to 4.0.4 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1134727 (owner: 10Elukey)
[13:07:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P74952 and previous config saved to /var/cache/conftool/dbconfig/20250414-130703-fceratto.json
[13:08:39] <jinxer-wm>	 RESOLVED: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:08:42] <wikibugs>	 (03CR) 10Volans: "Tested with test-cookbook in dry-run on a pooled host:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans)
[13:09:53] <wikibugs>	 (03PS1) 10Ssingh: hiera: durum: add dummy ECH private key [labs/private] - 10https://gerrit.wikimedia.org/r/1136376 (https://phabricator.wikimedia.org/T205378)
[13:10:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove from list of approvers [puppet] - 10https://gerrit.wikimedia.org/r/1136377
[13:11:03] <wikibugs>	 (03CR) 10Ssingh: [V:03+2 C:03+2] hiera: durum: add dummy ECH private key [labs/private] - 10https://gerrit.wikimedia.org/r/1136376 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:13:02] <godog>	 !log remove old LVs from prometheus[12]00[56] - T383232
[13:13:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:13:05] <stashbot>	 T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232
[13:13:14] <logmsgbot>	 !log elukey@deploy1003 Started deploy [docker-pkg/deploy@a555b7b]: Upgrade to 4.0.4
[13:13:29] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10738719 (10hgzh) I tried an onwiki answer, so thank you for the reply here. But IMO this could have been announced earlier and more detailed, keeping in...
[13:13:47] <logmsgbot>	 !log elukey@deploy1003 Finished deploy [docker-pkg/deploy@a555b7b]: Upgrade to 4.0.4 (duration: 00m 38s)
[13:14:33] <wikibugs>	 (03Merged) 10jenkins-bot: CentralAuthTokenManager: Log failures for write operations [extensions/CentralAuth] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135993 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński)
[13:14:52] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1135993|CentralAuthTokenManager: Log failures for write operations (T390784)]]
[13:14:56] <stashbot>	 T390784: Error when logging-in on auth.wikimedia.org: "The provided authentication token is either expired or invalid." - https://phabricator.wikimedia.org/T390784
[13:15:10] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans)
[13:16:55] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_magru
[13:17:06] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_magru
[13:17:07] <wikibugs>	 (03CR) 10Elukey: "LGTM, I left a nit about the log message, the rest looks good and safer." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans)
[13:17:25] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey)
[13:17:30] <vgutierrez>	 !log rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in magru - T391334
[13:17:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:33] <stashbot>	 T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334
[13:18:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from cirrussearch2014 to cirrussearch2104
[13:18:25] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10738759 (10Papaul) @VRiley-WMF yes it is OK to apply 7.20 to the server. My personally opinion I  don't think applying this latest IDRAC upgrade to the server will provide us with any information then wha...
[13:18:27] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[13:19:47] <logmsgbot>	 !log samtar@deploy1003 samtar, matmarex: Backport for [[gerrit:1135993|CentralAuthTokenManager: Log failures for write operations (T390784)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:19:49] <logmsgbot>	 !log samtar@deploy1003 samtar, matmarex: Continuing with sync
[13:19:58] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Release for conftool 5.1.0 [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1136378
[13:20:58] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Release for conftool 5.1.0 [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1136378 (owner: 10Giuseppe Lavagetto)
[13:21:29] <logmsgbot>	 !log oblivian@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Compatibility with conftool 5.1.0 - oblivian@cumin2002"
[13:21:31] <logmsgbot>	 !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Compatibility with conftool 5.1.0 - oblivian@cumin2002
[13:22:01] <logmsgbot>	 !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Compatibility with conftool 5.1.0 - oblivian@cumin2002
[13:22:03] <logmsgbot>	 !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Compatibility with conftool 5.1.0 - oblivian@cumin2002"
[13:22:11] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T391056)', diff saved to https://phabricator.wikimedia.org/P74953 and previous config saved to /var/cache/conftool/dbconfig/20250414-132210-fceratto.json
[13:22:14] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[13:22:26] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[13:22:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T391056)', diff saved to https://phabricator.wikimedia.org/P74954 and previous config saved to /var/cache/conftool/dbconfig/20250414-132232-fceratto.json
[13:22:34] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cirrussearch2014 to cirrussearch2104 - bking@cumin2002"
[13:22:55] <logmsgbot>	 !log oblivian@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Compatibility with conftool 5.1.0 (take 2) - oblivian@cumin2002"
[13:22:59] <logmsgbot>	 !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Compatibility with conftool 5.1.0 (take 2) - oblivian@cumin2002
[13:23:35] <logmsgbot>	 !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Compatibility with conftool 5.1.0 (take 2) - oblivian@cumin2002
[13:23:37] <logmsgbot>	 !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Compatibility with conftool 5.1.0 (take 2) - oblivian@cumin2002"
[13:24:57] <icinga-wm>	 RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[13:25:18] <wikibugs>	 (03CR) 10Volans: sre.hosts.reimage: check dbctl (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans)
[13:26:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T391056)', diff saved to https://phabricator.wikimedia.org/P74955 and previous config saved to /var/cache/conftool/dbconfig/20250414-132625-fceratto.json
[13:26:32] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135993|CentralAuthTokenManager: Log failures for write operations (T390784)]] (duration: 11m 39s)
[13:26:35] <stashbot>	 T390784: Error when logging-in on auth.wikimedia.org: "The provided authentication token is either expired or invalid." - https://phabricator.wikimedia.org/T390784
[13:26:53] <TheresNoTime>	 MatmaRex: anzx: running your two config patches together
[13:27:13] <anzx>	 ok
[13:27:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 (owner: 10Bartosz Dziewoński)
[13:27:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136104 (https://phabricator.wikimedia.org/T348611) (owner: 10Anzx)
[13:27:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cirrussearch2014 to cirrussearch2104 - bking@cumin2002"
[13:27:35] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:27:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2104
[13:27:40] <MatmaRex>	 thanks
[13:27:41] <jinxer-wm>	 FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[13:27:45] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2104
[13:27:47] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1135465 (owner: 10Hnowlan)
[13:28:06] <wikibugs>	 (03Merged) 10jenkins-bot: Enable SUL3 on most remaining beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 (owner: 10Bartosz Dziewoński)
[13:28:11] <wikibugs>	 (03Merged) 10jenkins-bot: punjabiwikimedia, maiwikimedia: fix tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136104 (https://phabricator.wikimedia.org/T348611) (owner: 10Anzx)
[13:28:25] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from cirrussearch2014 to cirrussearch2104
[13:28:27] <wikibugs>	 (03PS1) 10Jforrester: Switch test Wikifunctions client deployment from test2wiki to test2iki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136379 (https://phabricator.wikimedia.org/T391584)
[13:28:28] <logmsgbot>	 !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1135850|Enable SUL3 on most remaining beta cluster wikis]], [[gerrit:1136104|punjabiwikimedia, maiwikimedia: fix tagline (T348611)]]
[13:28:29] <wikibugs>	 (03PS1) 10Jforrester: Document Wikifunctions options, adding wgWikiLambdaClientModeOffline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136380 (https://phabricator.wikimedia.org/T391584)
[13:28:31] <stashbot>	 T348611: [Deployment] Fix logo clipping issues in mai and punjabi wikis - https://phabricator.wikimedia.org/T348611
[13:28:54] <wikibugs>	 (03CR) 10Elukey: sre.hosts.reimage: check dbctl (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans)
[13:30:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2104.codfw.wmnet with OS bullseye
[13:30:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2104
[13:30:54] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2104
[13:30:54] <jinxer-wm>	 FIRING: [5x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:33:04] <logmsgbot>	 !log samtar@deploy1003 matmarex, anzx, samtar: Backport for [[gerrit:1135850|Enable SUL3 on most remaining beta cluster wikis]], [[gerrit:1136104|punjabiwikimedia, maiwikimedia: fix tagline (T348611)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:33:17] <anzx>	 looking
[13:33:22] <TheresNoTime>	 ack
[13:33:39] <jinxer-wm>	 RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive
[13:33:43] <anzx>	 TheresNoTime: logos on both wikis look good
[13:33:47] <logmsgbot>	 !log samtar@deploy1003 matmarex, anzx, samtar: Continuing with sync
[13:33:51] <wikibugs>	 (03PS2) 10Volans: sre.hosts.reimage: check dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878)
[13:34:17] <wikibugs>	 (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans)
[13:34:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10738897 (10phaultfinder)
[13:34:58] <wikibugs>	 (03CR) 10Fabfur: [C:03+1] sre: Add LibericaUnhealthyRealserverPooled alert [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez)
[13:35:34] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra: fix Istio SLO metrics template [puppet] - 10https://gerrit.wikimedia.org/r/1136381
[13:35:54] <jinxer-wm>	 FIRING: [6x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:37:20] <wikibugs>	 (03PS1) 10Fabfur: data-engineering: duplicating varnishkafka alerts [alerts] - 10https://gerrit.wikimedia.org/r/1136383 (https://phabricator.wikimedia.org/T391810)
[13:38:05] <sukhe>	 !log reprepro -C component/nginx-ech include bookworm-wikimedia nginx_1.22.1-9+deb12u1+ech2_amd64.changes: T205378
[13:38:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:08] <stashbot>	 T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378
[13:39:26] <wikibugs>	 (03PS7) 10Ssingh: P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[13:40:28] <logmsgbot>	 !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135850|Enable SUL3 on most remaining beta cluster wikis]], [[gerrit:1136104|punjabiwikimedia, maiwikimedia: fix tagline (T348611)]] (duration: 12m 00s)
[13:40:32] <stashbot>	 T348611: [Deployment] Fix logo clipping issues in mai and punjabi wikis - https://phabricator.wikimedia.org/T348611
[13:40:40] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:40:41] <TheresNoTime>	 anzx: live and logo purged :)
[13:40:55] <anzx>	 TheresNoTime: thank you for deploying 
[13:41:18] <TheresNoTime>	 !log UTC afternoon backport window done
[13:41:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:41:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P74956 and previous config saved to /var/cache/conftool/dbconfig/20250414-134132-fceratto.json
[13:42:11] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pyrra: fix Istio SLO metrics template [puppet] - 10https://gerrit.wikimedia.org/r/1136381 (owner: 10Elukey)
[13:42:11] <Lucas_WMDE>	 thanks for deploying TheresNoTime!
[13:42:57] <wikibugs>	 (03CR) 10Volans: [C:03+2] commit: refactor asking for approval [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[13:43:22] <TheresNoTime>	 Lucas_WMDE: np! :)
[13:43:54] <wikibugs>	 (03CR) 10DCausse: [C:03+1] cirrussearch: remove no-longer-existing master-eligibles. [puppet] - 10https://gerrit.wikimedia.org/r/1136026 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[13:45:38] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610)
[13:45:40] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136387 (https://phabricator.wikimedia.org/T388610)
[13:46:25] <wikibugs>	 (03Abandoned) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136387 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[13:46:54] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "cheers, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1135465 (owner: 10Hnowlan)
[13:47:04] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.gerrit.failover from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org
[13:47:09] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.gerrit.failover (exit_code=97) from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org
[13:47:51] <wikibugs>	 (03PS2) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610)
[13:49:23] <hnowlan>	 jouncebot: nowandnext
[13:49:23] <jouncebot>	 For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1300)
[13:49:23] <jouncebot>	 In 1 hour(s) and 40 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1530)
[13:49:33] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: remove no-longer-existing master-eligibles. [puppet] - 10https://gerrit.wikimedia.org/r/1136026 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[13:49:37] <wikibugs>	 (03PS8) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[13:50:16] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] Host BGP: ignore hosts with no primary IP [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136150 (owner: 10Ayounsi)
[13:50:33] <wikibugs>	 (03CR) 10Cathal Mooney: [C:03+1] "sorry that's probably my fault with the nokia test servers is it?" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136150 (owner: 10Ayounsi)
[13:50:43] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:51:36] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:53:27] <wikibugs>	 (03PS9) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[13:53:56] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] sre: Add LibericaUnhealthyRealserverPooled alert [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez)
[13:54:30] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666)
[13:55:16] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:55:20] <wikibugs>	 (03Merged) 10jenkins-bot: commit: refactor asking for approval [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[13:55:27] <wikibugs>	 (03CR) 10Cathal Mooney: "LGTM overall, I think we probably should get a list of what hosts are using this role and run PCC against them just to check there is no c" [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah)
[13:55:40] <wikibugs>	 (03CR) 10Volans: [C:03+2] commit: refactor asking for approval (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[13:55:54] <jinxer-wm>	 FIRING: [8x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[13:55:57] <wikibugs>	 (03PS3) 10Jelto: gitlab: fix type of s3 credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922)
[13:55:58] <wikibugs>	 (03Merged) 10jenkins-bot: sre: Add LibericaUnhealthyRealserverPooled alert [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez)
[13:56:16] <wikibugs>	 (03CR) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[13:56:41] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P74960 and previous config saved to /var/cache/conftool/dbconfig/20250414-135640-fceratto.json
[13:56:49] <wikibugs>	 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10739003 (10hnowlan) 05Open→03Resolved All jobrunner hardware decommissioned or reclaimed, services torn down, puppet cleaned up.
[13:57:49] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1178.eqiad.wmnet with OS bullseye
[13:57:57] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1178.eqi...
[13:58:42] <wikibugs>	 (03CR) 10Bking: [V:04-1] "Do not merge until elastic2115 has been reimaged to cirrussearch2115" [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[13:59:16] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10739013 (10MoritzMuehlenhoff) p:05Triage→03Medium
[13:59:41] <godog>	 I accidentally thanos, it is coming back
[13:59:46] <wikibugs>	 (03PS10) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[14:00:23] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5284/console" [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[14:00:54] <jinxer-wm>	 FIRING: [8x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[14:01:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2104.codfw.wmnet with reason: host reimage
[14:01:41] <godog>	 !log temp disable "backend time" panel using unaggregated big mediawiki metric on "reading web performance" dashboard - T391677
[14:01:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:44] <stashbot>	 T391677: Audit dashboards using histogram_quantile on mediawiki_WikimediaEvents_editResponseTime - https://phabricator.wikimedia.org/T391677
[14:01:49] <wikibugs>	 (03PS9) 10Federico Ceratto: Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810)
[14:01:50] <wikibugs>	 (03CR) 10Federico Ceratto: "Basic cookbook moving the existing code from switchmaster." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[14:03:23] <wikibugs>	 (03CR) 10Volans: [C:03+2] commit: allow to approve/reject diffs globally (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[14:03:39] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:03:40] <wikibugs>	 (03CR) 10Volans: [C:03+2] doc: update documentation configuration [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 (owner: 10Volans)
[14:04:05] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666)
[14:04:48] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2104.codfw.wmnet with reason: host reimage
[14:04:55] <wikibugs>	 (03CR) 10Federico Ceratto: "The CR received a +1, is it ok if I set the required changes as Resolved?" [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[14:04:57] <wikibugs>	 (03CR) 10Arnaudb: gerrit: failover cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[14:05:35] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: cr2-codfw: 2/4 PSU down - https://phabricator.wikimedia.org/T391790#10739035 (10Jhancock.wm) 05Open→03Resolved reseated the cables to the two downed CPU. direct result of tension from the fiber drop connected the cage in DH7 to DH5. pointed out the issue to the maintenan...
[14:06:16] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:06:20] <wikibugs>	 (03PS1) 10Ssingh: modules: move durum.yaml to secret snake oil [labs/private] - 10https://gerrit.wikimedia.org/r/1136389
[14:07:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove from list of approvers [puppet] - 10https://gerrit.wikimedia.org/r/1136377 (owner: 10Muehlenhoff)
[14:11:21] <wikibugs>	 (03PS2) 10Ssingh: modules: move durum.yaml to secret snake oil [labs/private] - 10https://gerrit.wikimedia.org/r/1136389
[14:11:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T391056)', diff saved to https://phabricator.wikimedia.org/P74961 and previous config saved to /var/cache/conftool/dbconfig/20250414-141148-fceratto.json
[14:11:51] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[14:12:04] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[14:12:22] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[14:12:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T391056)', diff saved to https://phabricator.wikimedia.org/P74962 and previous config saved to /var/cache/conftool/dbconfig/20250414-141227-fceratto.json
[14:14:01] <wikibugs>	 (03Merged) 10jenkins-bot: commit: allow to approve/reject diffs globally [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans)
[14:14:23] <wikibugs>	 (03Merged) 10jenkins-bot: doc: update documentation configuration [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 (owner: 10Volans)
[14:15:06] <wikibugs>	 (03PS1) 10Jelto: gitlab: use a wmflib::expand_path compatible path for apus keys [labs/private] - 10https://gerrit.wikimedia.org/r/1136391 (https://phabricator.wikimedia.org/T378922)
[14:15:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] logstash: cast 'error' to object too [puppet] - 10https://gerrit.wikimedia.org/r/1135917 (owner: 10Filippo Giunchedi)
[14:15:34] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[14:15:52] <wikibugs>	 (03CR) 10Ssingh: [V:03+2 C:03+2] modules: move durum.yaml to secret snake oil [labs/private] - 10https://gerrit.wikimedia.org/r/1136389 (owner: 10Ssingh)
[14:16:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T391056)', diff saved to https://phabricator.wikimedia.org/P74963 and previous config saved to /var/cache/conftool/dbconfig/20250414-141639-fceratto.json
[14:16:55] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] "It's a yes from me!" [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[14:18:29] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10739100 (10VRiley-WMF) Understood. I will be reaching out to them again to see if we can request that plan of action that you've recommended. I can ask them about the mainboard to see if they would replac...
[14:19:54] <wikibugs>	 (03PS11) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[14:21:23] <wikibugs>	 (03CR) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:21:39] <wikibugs>	 (03CR) 10Nikerabbit: Catalog ContentTranslation tables (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas)
[14:23:05] <wikibugs>	 (03CR) 10Ssingh: [C:04-1] P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:23:39] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2115:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:24:21] <wikibugs>	 (03PS12) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[14:25:23] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:26:04] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2104.codfw.wmnet with OS bullseye
[14:28:01] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.9.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1136392
[14:28:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I believe we can abandon this now" [alerts] - 10https://gerrit.wikimedia.org/r/1135673 (owner: 10Slyngshede)
[14:28:53] <wikibugs>	 (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v0.9.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1136392 (owner: 10Volans)
[14:29:13] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1136022 (https://phabricator.wikimedia.org/T374711) (owner: 10JHathaway)
[14:30:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[14:31:32] <wikibugs>	 (03PS13) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[14:31:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P74964 and previous config saved to /var/cache/conftool/dbconfig/20250414-143146-fceratto.json
[14:32:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli)
[14:35:07] <wikibugs>	 (03PS12) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805)
[14:40:26] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] CHANGELOG: add changelogs for release v0.9.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1136392 (owner: 10Volans)
[14:40:55] <wikibugs>	 (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.9.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1136392 (owner: 10Volans)
[14:42:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (1) discovery-search elastic1067 out of eqiad D6 - https://phabricator.wikimedia.org/T391542#10739150 (10Gehel)
[14:45:06] <wikibugs>	 (03PS1) 10Scott French: Remove PHP 8.1 migration WikimediaEvents settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135507 (https://phabricator.wikimedia.org/T391421)
[14:45:08] <wikibugs>	 (03PS1) 10Scott French: hieradata: remove mw-php-migration.lua from plugin chains [puppet] - 10https://gerrit.wikimedia.org/r/1135504 (https://phabricator.wikimedia.org/T391421)
[14:45:39] <wikibugs>	 (03PS14) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[14:46:17] <wikibugs>	 (03PS1) 10Herron: logstash: increase refresh_interval to 10s in index templates [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714)
[14:46:40] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5290/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:46:54] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P74965 and previous config saved to /var/cache/conftool/dbconfig/20250414-144653-fceratto.json
[14:47:09] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10739175 (10JTweed-WMF)
[14:51:54] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] P:durum: add conditional to enable ECH (durum2002) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:53:43] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135507 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French)
[14:54:37] <wikibugs>	 (03CR) 10Effie Mouzeli: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1135504 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French)
[14:54:59] <wikibugs>	 (03PS15) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[14:58:07] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5291/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:58:55] <wikibugs>	 (03CR) 10Ahmon Dancy: [C:03+1] "Looks reasonable to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar)
[14:59:11] <wikibugs>	 (03PS16) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[14:59:37] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5292/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:59:41] <wikibugs>	 (03PS1) 10Volans: Release v0.9.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136396
[15:00:03] <wikibugs>	 (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:02:01] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T391056)', diff saved to https://phabricator.wikimedia.org/P74966 and previous config saved to /var/cache/conftool/dbconfig/20250414-150200-fceratto.json
[15:02:04] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:02:16] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[15:05:31] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1129179 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[15:07:48] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[15:08:39] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:11:43] <wikibugs>	 (03PS2) 10Eevans: restbase: bootstrap restbase1044 (refresh for restbase1029) [puppet] - 10https://gerrit.wikimedia.org/r/1130174 (https://phabricator.wikimedia.org/T389423)
[15:11:43] <wikibugs>	 (03PS2) 10Eevans: restbase: bootstrap restbase1045 (refresh for restbase1030) [puppet] - 10https://gerrit.wikimedia.org/r/1130175 (https://phabricator.wikimedia.org/T389423)
[15:11:44] <wikibugs>	 (03CR) 10Ayounsi: [C:03+1] Release v0.9.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136396 (owner: 10Volans)
[15:11:52] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Host BGP: ignore hosts with no primary IP [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136150 (owner: 10Ayounsi)
[15:12:34] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130174 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans)
[15:12:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] logstash: increase refresh_interval to 10s in index templates [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron)
[15:13:10] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2149.codfw.wmnet with reason: Maintenance
[15:13:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T391056)', diff saved to https://phabricator.wikimedia.org/P74967 and previous config saved to /var/cache/conftool/dbconfig/20250414-151316-fceratto.json
[15:13:19] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:15:31] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet
[15:18:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] snmp_exporter: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129179 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[15:18:23] <wikibugs>	 (03PS2) 10Filippo Giunchedi: snmp_exporter: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129179 (https://phabricator.wikimedia.org/T389170)
[15:18:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] snmp_exporter: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129179 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi)
[15:20:18] <wikibugs>	 (03CR) 10Volans: [C:03+2] Release v0.9.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136396 (owner: 10Volans)
[15:20:22] <wikibugs>	 (03CR) 10Eevans: [C:03+2] restbase: bootstrap restbase1044 (refresh for restbase1029) [puppet] - 10https://gerrit.wikimedia.org/r/1130174 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans)
[15:22:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "My understanding is that refresh time affects how long it takes for indexed documents to be available for search; worth adding "high frequ" [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron)
[15:22:24] <wikibugs>	 (03PS1) 10Gergő Tisza: private: Drop $wgCentralAuthSul3SharedDomainRestrictions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136397 (https://phabricator.wikimedia.org/T390329)
[15:22:57] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] mysql: make MysqlRemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325 (owner: 10Volans)
[15:22:58] <wikibugs>	 (03CR) 10Gergő Tisza: [C:04-2] "Needs to wait a week for the train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136397 (https://phabricator.wikimedia.org/T390329) (owner: 10Gergő Tisza)
[15:23:58] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.9.0 - volans@cumin1002
[15:24:48] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet
[15:24:49] <wikibugs>	 (03CR) 10Volans: [C:03+2] mysql: make MysqlRemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325 (owner: 10Volans)
[15:24:59] <wikibugs>	 (03CR) 10Volans: [C:03+2] cookbook modules: use docstring for title [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136326 (owner: 10Volans)
[15:25:39] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.9.0 - volans@cumin1002
[15:25:46] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye
[15:25:59] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739522 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1181...
[15:26:18] <volans>	 !log deployed homer v0.9.0 to cumin hosts
[15:26:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[15:29:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T391056)', diff saved to https://phabricator.wikimedia.org/P74968 and previous config saved to /var/cache/conftool/dbconfig/20250414-152911-fceratto.json
[15:29:16] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[15:29:57] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:04-1] "Still figuring out the correlation between outer and inner SNI." [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:30:05] <jouncebot>	 jan_drewniak: Your horoscope predicts another Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1530).
[15:30:05] <wikibugs>	 (03PS1) 10Herron: logstash: set "index.translog.durability": "async" as template default [puppet] - 10https://gerrit.wikimedia.org/r/1136400 (https://phabricator.wikimedia.org/T391714)
[15:30:25] <jan_drewniak>	 skipping portal deployments this week
[15:30:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10739548 (10phaultfinder)
[15:30:50] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_ulsfo and not P{cp4047.ulsfo.wmnet} and not P{cp4045.ulsfo.wmnet} and A:cp
[15:31:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] logstash: set "index.translog.durability": "async" as template default [puppet] - 10https://gerrit.wikimedia.org/r/1136400 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron)
[15:32:58] <logmsgbot>	 !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1044.eqiad.wmnet with reason: Bootstrapping — T389423
[15:33:02] <stashbot>	 T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423
[15:34:02] <wikibugs>	 (03PS1) 10Elukey: profile::pyrra: fix latency total count metric for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1136401
[15:34:27] <wikibugs>	 (03Merged) 10jenkins-bot: mysql: make MysqlRemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325 (owner: 10Volans)
[15:35:16] <wikibugs>	 (03Merged) 10jenkins-bot: cookbook modules: use docstring for title [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136326 (owner: 10Volans)
[15:35:25] <wikibugs>	 (03CR) 10Elukey: dnsdisc: make it compatible with bookworm (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[15:35:37] <wikibugs>	 (03CR) 10Elukey: [C:03+1] dnsdisc: make it compatible with bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[15:36:23] <wikibugs>	 (03CR) 10Elukey: [C:03+2] profile::pyrra: fix latency total count metric for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1136401 (owner: 10Elukey)
[15:36:37] <wikibugs>	 (03CR) 10Volans: [C:03+2] dnsdisc: make it compatible with bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[15:36:40] <wikibugs>	 (03CR) 10Herron: [C:03+1] profile::pyrra: fix latency total count metric for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1136401 (owner: 10Elukey)
[15:36:42] <wikibugs>	 (03PS1) 10Hashar: Gemfile: update rspec-puppet to 2.10.x [puppet] - 10https://gerrit.wikimedia.org/r/1136403
[15:36:44] <wikibugs>	 (03PS17) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[15:37:23] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] Add zarcillo (aux k8s) CNAME (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[15:37:29] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto)
[15:37:41] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.dns.netbox
[15:37:45] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5294/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:37:48] <urandom>	 !log bootstrapping Cassandra/restbase1044-a — T389423
[15:37:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:38:39] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:40:17] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:41:31] <logmsgbot>	 !log fceratto@cumin1002 START - Cookbook sre.dns.netbox
[15:42:58] <wikibugs>	 (03PS3) 10Dzahn: jenkins: fix puppet error, systemd override requires systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595)
[15:44:05] <logmsgbot>	 !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:44:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P74969 and previous config saved to /var/cache/conftool/dbconfig/20250414-154419-fceratto.json
[15:45:31] <wikibugs>	 (03CR) 10Dzahn: "So you are saying the flag isn't actually transitory and should stay around forever? That's also a valid answer, but there would need to b" [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[15:45:47] <wikibugs>	 (03Merged) 10jenkins-bot: dnsdisc: make it compatible with bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans)
[15:47:04] <wikibugs>	 (03CR) 10Dzahn: "I did not make the claim that it was easy. I was trying to start a discussion how we can move forward here. The answer can be many things," [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[15:47:49] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:48:29] <sukhe>	 federico3: hi. did you run authdns-update? thanks!
[15:48:32] <wikibugs>	 06SRE, 10observability: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852 (10elukey) 03NEW
[15:48:33] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:48:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:48:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Remove PHP 8.1 migration WikimediaEvents settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135507 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French)
[15:49:01] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:49:41] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:49:41] <wikibugs>	 06SRE, 10observability: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10739720 (10elukey) Code changes merged so far:  https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135746 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136...
[15:49:43] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:49:51] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite)
[15:50:00] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C:03+1] private: Drop $wgCentralAuthSul3SharedDomainRestrictions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136397 (https://phabricator.wikimedia.org/T390329) (owner: 10Gergő Tisza)
[15:50:02] <wikibugs>	 06SRE, 10MediaWiki-Core-HTTP-Cache, 06Traffic-Icebox, 07Wikimedia-Performance-recommendation: Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835#10739722 (10Seb35) There is the [[https://datatracker.ietf.org/doc/html/rfc9213|RFC 9213 "Targeted HTTP Cache Control"]...
[15:50:03] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:50:11] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:50:21] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:50:26] <wikibugs>	 (03Restored) 10Dzahn: ci: switch jenkins deployment method on contint to scap [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[15:50:29] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:50:29] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:50:34] <wikibugs>	 (03PS2) 10Dzahn: ci: switch jenkins deployment method on contint to scap [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595)
[15:50:39] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:50:43] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:51:03] <wikibugs>	 (03Abandoned) 10Dzahn: ci: switch jenkins deployment method on contint to scap [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[15:51:11] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:51:25] <federico3>	 sukhe: no
[15:51:29] <sukhe>	 please do :)
[15:51:31] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:51:37] <sukhe>	 this is what the above alert is about
[15:51:49] <wikibugs>	 (03Restored) 10Dzahn: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth)
[15:51:58] <wikibugs>	 (03PS8) 10Dzahn: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth)
[15:52:03] <wikibugs>	 (03Abandoned) 10Dzahn: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth)
[15:52:03] <federico3>	 how? I've been told to run sre.dns.netbox but it's showing "Nothing to commit"
[15:52:14] <sukhe>	 federico3: no worries
[15:52:31] <icinga-wm>	 PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:52:32] <sukhe>	 https://wikitech.wikimedia.org/wiki/DNS#Deploying_DNS_changes
[15:52:33] <federico3>	 I'll follow the authdns update run as by wiki, ok?
[15:52:36] <sukhe>	 yep
[15:53:21] <claime>	 Ah, our doc is bad
[15:53:23] <claime>	 Editing
[15:53:32] <wikibugs>	 (03CR) 10Vgutierrez: P:durum: add conditional to enable ECH (durum2002) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[15:53:39] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service restbase1044-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:53:39] <federico3>	 sudo -i authdns-update    from   dns1004.wikimedia.org   , sounds good?
[15:53:50] <sukhe>	 yep
[15:53:53] <logmsgbot>	 !log fceratto@dns1004 START - running authdns-update
[15:54:57] <claime>	 https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#DNS_changes fixed
[15:55:18] <sukhe>	 claime: thanks!
[15:55:21] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:55:29] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:55:29] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:55:39] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:55:42] <federico3>	 thanks claime
[15:55:43] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:56:11] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:56:23] <logmsgbot>	 !log fceratto@dns1004 END - running authdns-update
[15:56:31] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:56:44] <federico3>	 ok, the tool ran without errors
[15:56:50] <sukhe>	 nice thanks
[15:57:02] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1181.eqiad.wmnet with OS bullseye
[15:57:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1181.eqi...
[15:57:31] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:57:49] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:58:33] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:58:58] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[15:59:01] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:59:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P74970 and previous config saved to /var/cache/conftool/dbconfig/20250414-155925-fceratto.json
[15:59:38] <wikibugs>	 (03CR) 10Cwhite: [C:03+1] "I don't think this is the problem, but this won't hurt." [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron)
[15:59:41] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[15:59:43] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[16:00:03] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[16:00:11] <icinga-wm>	 RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run
[16:00:27] <wikibugs>	 10ops-eqiad, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854 (10elukey) 03NEW
[16:01:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10739807 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff
[16:03:38] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_ulsfo and not P{cp4037.ulsfo.wmnet} and A:cp
[16:03:57] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:04:53] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:05:05] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL
[16:06:06] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye
[16:06:07] <wikibugs>	 (03PS3) 10Bking: sre.elasticsearch.rolling-operation: use tftp, run puppet with new hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[16:06:15] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1181...
[16:06:45] <wikibugs>	 (03Abandoned) 10Slyngshede: Netbox: Temporarily remove Netbox alerting [alerts] - 10https://gerrit.wikimedia.org/r/1135673 (owner: 10Slyngshede)
[16:10:35] <wikibugs>	 (03PS4) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430)
[16:11:07] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:11:25] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:11:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[16:13:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: use tftp, run puppet with new hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[16:14:26] <wikibugs>	 (03Abandoned) 10Hashar: Gemfile: update to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976304 (owner: 10Jbond)
[16:14:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T391056)', diff saved to https://phabricator.wikimedia.org/P74971 and previous config saved to /var/cache/conftool/dbconfig/20250414-161432-fceratto.json
[16:14:36] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[16:14:42] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10739874 (10Dzahn) Checking now the mail queue is much smaller than before. (hundreds vs thousands). So missing mail might have been delivered...
[16:14:49] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2156.codfw.wmnet with reason: Maintenance
[16:15:05] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2186.codfw.wmnet with reason: Maintenance
[16:15:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T391056)', diff saved to https://phabricator.wikimedia.org/P74972 and previous config saved to /var/cache/conftool/dbconfig/20250414-161512-fceratto.json
[16:15:44] <wikibugs>	 (03PS4) 10Bking: sre.elasticsearch.rolling-operation: use tftp, run puppet with new hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper)
[16:18:18] <wikibugs>	 (03PS5) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430)
[16:19:36] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[16:19:48] <hashar>	 seriously ...
[16:20:53] <wikibugs>	 (03PS6) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430)
[16:20:57] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:21:15] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:21:29] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1181.eqiad.wmnet with OS bullseye
[16:21:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1181.eqi...
[16:22:09] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[16:22:37] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10739895 (10elukey) I can confirm that using `start initialization` and stopping it right afterwards makes `set jbod` working, without a...
[16:23:25] <wikibugs>	 (03PS3) 10Herron: logstash: set "index.translog.durability": "async" as template default [puppet] - 10https://gerrit.wikimedia.org/r/1136400 (https://phabricator.wikimedia.org/T391714)
[16:24:12] <wikibugs>	 (03PS7) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430)
[16:26:44] <wikibugs>	 (03CR) 10CI reject: [V:04-1] ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[16:28:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[16:28:53] <wikibugs>	 (03PS1) 10Ebernhardson: Revert "mjolnir: temp remove msearch daemon from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1136414
[16:29:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Revert "mjolnir: temp remove msearch daemon from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1136414 (owner: 10Ebernhardson)
[16:29:46] <wikibugs>	 (03PS2) 10Ebernhardson: Revert "mjolnir: temp remove msearch daemon from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1136414
[16:30:38] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T391056)', diff saved to https://phabricator.wikimedia.org/P74973 and previous config saved to /var/cache/conftool/dbconfig/20250414-163037-fceratto.json
[16:30:41] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[16:31:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:32:33] <wikibugs>	 (03PS8) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430)
[16:37:00] <wikibugs>	 06SRE, 10observability: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10739985 (10elukey) @herron something is off in one of the recording rules, see for example https://w.wiki/Doru. Do you have an idea why this is so different? I didn't...
[16:38:08] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye
[16:38:18] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1181...
[16:39:09] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:39:57] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1208 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:41:45] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:42:08] <wikibugs>	 (03CR) 10Hashar: "Done as of patchset 8" [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[16:43:38] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "I have cherry picked it on `integration-puppetserver-01.integration.eqiad1.wikimedia.cloud` and ran Puppet on the two CI instances buildin" [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[16:43:48] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar)
[16:45:43] <wikibugs>	 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker2042:9290 - https://phabricator.wikimedia.org/T391860 (10phaultfinder) 03NEW
[16:45:45] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:45:45] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P74974 and previous config saved to /var/cache/conftool/dbconfig/20250414-164545-fceratto.json
[16:45:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1208 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:45:57] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:47:47] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861 (10KColeman-WMF) 03NEW
[16:47:55] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:49:09] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:52:41] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[16:56:25] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_magru
[16:56:53] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_magru
[16:59:49] <Amir1>	 sirenbot: sing
[16:59:56] <Amir1>	 _joe_: :( 
[17:00:05] <jouncebot>	 swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1700).
[17:00:05] <jouncebot>	 ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1700).
[17:00:15] <Amir1>	 !sing
[17:00:15] <sirenbot>	 Never gonna give you up
[17:00:16] <sirenbot>	 Never gonna let you down
[17:00:16] <sirenbot>	 Never gonna run around and desert you
[17:00:17] <sirenbot>	 Never gonna make you cry
[17:00:18] <sirenbot>	 Never gonna say goodbye
[17:00:19] <sirenbot>	 Never gonna tell a lie and hurt you
[17:00:25] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:00:25] <Amir1>	 *chef's kiss*\
[17:00:40] <James_F>	 Amir1: BTW, Dexbot seems to not be active on wikitech any more?
[17:00:45] <swfrench-wmf>	 o/
[17:00:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P74975 and previous config saved to /var/cache/conftool/dbconfig/20250414-170052-fceratto.json
[17:00:57] <Amir1>	 https://phabricator.wikimedia.org/T391346
[17:01:06] <Amir1>	 James_F: I think it's something with SUL3 roll out
[17:01:07] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:01:16] <James_F>	 Amir1: Aha, yes, that'd break things.
[17:01:41] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:01:57] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:02:11] <swfrench-wmf>	 FYI, I'll be starting a backport deployment for some PHP 8.1 migration cleanuiup shortly.
[17:02:12] <wikibugs>	 (03PS18) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[17:02:15] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:02:26] <swfrench-wmf>	 *cleanup
[17:03:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135507 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French)
[17:03:11] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:03:11] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5296/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[17:03:55] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:03:56] <wikibugs>	 (03Merged) 10jenkins-bot: Remove PHP 8.1 migration WikimediaEvents settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135507 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French)
[17:04:05] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[17:04:13] <logmsgbot>	 !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1135507|Remove PHP 8.1 migration WikimediaEvents settings (T391421)]]
[17:04:16] <stashbot>	 T391421: Clean up and abstract PHP_ENGINE 8.1 routing - https://phabricator.wikimedia.org/T391421
[17:04:35] <wikibugs>	 (03PS1) 10Ebernhardson: search: Update envoy alerts for discovery dns names [alerts] - 10https://gerrit.wikimedia.org/r/1136422 (https://phabricator.wikimedia.org/T143553)
[17:04:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740204 (10phaultfinder)
[17:05:57] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[17:06:04] <wikibugs>	 (03PS2) 10Ebernhardson: search: Update envoy alerts for discovery dns names [alerts] - 10https://gerrit.wikimedia.org/r/1136422 (https://phabricator.wikimedia.org/T143553)
[17:06:11] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:06:13] <wikibugs>	 (03CR) 10Ssingh: [V:03+1 C:04-1] "2025/04/14 17:05:51 [emerg] 2928385#2928385: "http" directive is not allowed here in /etc/nginx/sites-enabled/durum:10" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[17:08:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:08:53] <logmsgbot>	 !log swfrench@deploy1003 swfrench: Backport for [[gerrit:1135507|Remove PHP 8.1 migration WikimediaEvents settings (T391421)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:10:00] <logmsgbot>	 !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1181.eqiad.wmnet with OS bullseye
[17:10:12] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] cirrussearch: enable knn native lib [puppet] - 10https://gerrit.wikimedia.org/r/1135441 (https://phabricator.wikimedia.org/T388549) (owner: 10DCausse)
[17:10:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10740234 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1181.eqi...
[17:10:38] <logmsgbot>	 !log swfrench@deploy1003 swfrench: Continuing with sync
[17:12:41] <jinxer-wm>	 FIRING: [7x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[17:13:21] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:13:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[17:14:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:14:50] <wikibugs>	 (03Abandoned) 10Ebernhardson: tlsproxy: Reload nginx when on-disk and served cert don't match [puppet] - 10https://gerrit.wikimedia.org/r/1133187 (https://phabricator.wikimedia.org/T390599) (owner: 10Ebernhardson)
[17:15:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:15:33] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:15:58] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] phabricator weekly changes email: Lower "in progress" threshold to 1y [puppet] - 10https://gerrit.wikimedia.org/r/1136028 (https://phabricator.wikimedia.org/T380300) (owner: 10Aklapper)
[17:15:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T391056)', diff saved to https://phabricator.wikimedia.org/P74976 and previous config saved to /var/cache/conftool/dbconfig/20250414-171558-fceratto.json
[17:16:02] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[17:16:04] <wikibugs>	 (03CR) 10Bking: "Plugins have been updated across CODFW, so we are clear to revert." [puppet] - 10https://gerrit.wikimedia.org/r/1136414 (owner: 10Ebernhardson)
[17:16:06] <wikibugs>	 (03CR) 10Bking: [C:03+2] Revert "mjolnir: temp remove msearch daemon from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1136414 (owner: 10Ebernhardson)
[17:16:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2177.codfw.wmnet with reason: Maintenance
[17:16:21] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:16:22] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74977 and previous config saved to /var/cache/conftool/dbconfig/20250414-171622-fceratto.json
[17:17:23] <logmsgbot>	 !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135507|Remove PHP 8.1 migration WikimediaEvents settings (T391421)]] (duration: 13m 10s)
[17:17:27] <stashbot>	 T391421: Clean up and abstract PHP_ENGINE 8.1 routing - https://phabricator.wikimedia.org/T391421
[17:17:41] <jinxer-wm>	 FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[17:18:33] <swfrench-wmf>	 FYI, I have a couple of other cleanups to fit in during this window, but I'm done with deployments
[17:18:43] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: enable knn native lib [puppet] - 10https://gerrit.wikimedia.org/r/1135441 (https://phabricator.wikimedia.org/T388549) (owner: 10DCausse)
[17:20:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740369 (10phaultfinder)
[17:20:53] <swfrench-wmf>	 !log running: cumin 'A:cp-text' 'disable-puppet "merging ATS config change - T391421"'
[17:20:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:21:48] <wikibugs>	 (03CR) 10Scott French: [C:03+2] hieradata: remove mw-php-migration.lua from plugin chains [puppet] - 10https://gerrit.wikimedia.org/r/1135504 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French)
[17:22:27] <wikibugs>	 06SRE, 10observability: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10740391 (10herron) First thing I notice is the first panel (using recording rule) applies rate(sum()) and the second panel sum(rate())  Seems like a similar issue to...
[17:22:33] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:23:07] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1194 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:25:36] <swfrench-wmf>	 !log running: run-puppet-agent -e "merging ATS config change - T391421" on cp4040
[17:25:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10740420 (10phaultfinder)
[17:25:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:39] <stashbot>	 T391421: Clean up and abstract PHP_ENGINE 8.1 routing - https://phabricator.wikimedia.org/T391421
[17:25:48] <logmsgbot>	 !log hashar@deploy1003 Started deploy [integration/docroot@e92740c]: opensource: remove OOjs Router - T358813
[17:25:51] <stashbot>	 T358813: Document mediawiki-router, move oojs-router into core - https://phabricator.wikimedia.org/T358813
[17:25:59] <logmsgbot>	 !log hashar@deploy1003 Finished deploy [integration/docroot@e92740c]: opensource: remove OOjs Router - T358813 (duration: 00m 10s)
[17:30:47] <swfrench-wmf>	 !log running: cumin -b8 -s60 'A:cp-text' 'run-puppet-agent -e "merging ATS config change - T391421"'
[17:30:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:30:50] <stashbot>	 T391421: Clean up and abstract PHP_ENGINE 8.1 routing - https://phabricator.wikimedia.org/T391421
[17:32:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74978 and previous config saved to /var/cache/conftool/dbconfig/20250414-173218-fceratto.json
[17:32:22] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[17:32:36] <wikibugs>	 (03PS1) 10Scott French: P:trafficserver::backend: absent mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135505 (https://phabricator.wikimedia.org/T391421)
[17:32:38] <wikibugs>	 (03PS1) 10Scott French: P:trafficserver::backend: remove mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135506 (https://phabricator.wikimedia.org/T391421)
[17:33:39] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:37:56] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] P:trafficserver::backend: absent mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135505 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French)
[17:38:02] <wikibugs>	 (03CR) 10Ssingh: [C:03+1] P:trafficserver::backend: remove mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135506 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French)
[17:38:13] <wikibugs>	 (03CR) 10Herron: [C:03+1] "🌅" [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite)
[17:47:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P74979 and previous config saved to /var/cache/conftool/dbconfig/20250414-174725-fceratto.json
[17:49:10] <wikibugs>	 (03PS1) 10Ebernhardson: search: Remove CirrusSearchJVMGCYoungPoolInsufficient alert [alerts] - 10https://gerrit.wikimedia.org/r/1136426
[17:49:23] <wikibugs>	 (03CR) 10Herron: [C:03+2] "Thanks for the reviews!  Good idea, cc-ing releng for awareness" [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron)
[17:50:08] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1194 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[17:52:27] <wikibugs>	 (03CR) 10Ahmon Dancy: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron)
[17:52:33] <wikibugs>	 (03PS19) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[17:53:34] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:trafficserver::backend: absent mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135505 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French)
[17:53:41] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5297/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[17:54:55] <wikibugs>	 (03CR) 10Ssingh: "Changes since last time:" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:00:04] <jouncebot>	 James_F: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikifunctions MediaWiki integration backport II deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1800).
[18:00:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136368 (https://phabricator.wikimedia.org/T386020) (owner: 10Jforrester)
[18:00:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136379 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[18:00:18] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136380 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[18:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[18:01:11] <wikibugs>	 (03Merged) 10jenkins-bot: Switch test Wikifunctions client deployment from test2wiki to test2iki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136379 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[18:01:15] <wikibugs>	 (03Merged) 10jenkins-bot: Document Wikifunctions options, adding wgWikiLambdaClientModeOffline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136380 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[18:01:39] <wikibugs>	 (03CR) 10Ssingh: "I think this is ready for review. Thanks a lot for the feedback and rubber ducking, @vgutierrez@wikimedia.org!" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:02:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P74980 and previous config saved to /var/cache/conftool/dbconfig/20250414-180232-fceratto.json
[18:04:05] <wikibugs>	 (03CR) 10Ssingh: "Dropping ssl_dhparam too. Not really required for TLS1.3." [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:04:47] <wikibugs>	 (03Merged) 10jenkins-bot: Complete our RecentChanges entry generation and formatting [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136368 (https://phabricator.wikimedia.org/T386020) (owner: 10Jforrester)
[18:05:04] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136368|Complete our RecentChanges entry generation and formatting (T386020)]], [[gerrit:1136379|Switch test Wikifunctions client deployment from test2wiki to test2iki (T391584)]], [[gerrit:1136380|Document Wikifunctions options, adding wgWikiLambdaClientModeOffline (T391584)]]
[18:05:11] <stashbot>	 T386020: Implement design for change propagation when WF function calls change - https://phabricator.wikimedia.org/T386020
[18:05:11] <stashbot>	 T391584: Complete the successful deployment of embedded Wikifunctions to testwiki - https://phabricator.wikimedia.org/T391584
[18:05:15] <wikibugs>	 (03PS20) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[18:06:21] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5298/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:15:07] <wikibugs>	 (03CR) 10Vgutierrez: P:durum: add conditional to enable ECH (durum2002) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:15:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10740691 (10phaultfinder)
[18:16:27] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] P:durum: add conditional to enable ECH (durum2002) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:17:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74981 and previous config saved to /var/cache/conftool/dbconfig/20250414-181740-fceratto.json
[18:17:44] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[18:17:56] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2190.codfw.wmnet with reason: Maintenance
[18:18:03] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T391056)', diff saved to https://phabricator.wikimedia.org/P74982 and previous config saved to /var/cache/conftool/dbconfig/20250414-181802-fceratto.json
[18:18:10] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:19:11] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10740730 (10Jclark-ctr) a:03VRiley-WMF
[18:19:24] <wikibugs>	 (03CR) 10Ssingh: [V:03+1] P:durum: add conditional to enable ECH (durum2002) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:19:54] <wikibugs>	 (03PS21) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378)
[18:19:57] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10740731 (10Jclark-ctr) a:03VRiley-WMF
[18:20:15] <wikibugs>	 (03CR) 10Ssingh: "If we are removing CSP, I removed cache-control here as well." [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[18:20:45] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10740733 (10Jclark-ctr) a:03VRiley-WMF
[18:20:54] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker2042:9290 - https://phabricator.wikimedia.org/T391860#10740735 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm another instances of a third party loosening power cables in our rack. reseated.
[18:23:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:24:37] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1136368|Complete our RecentChanges entry generation and formatting (T386020)]], [[gerrit:1136379|Switch test Wikifunctions client deployment from test2wiki to test2iki (T391584)]], [[gerrit:1136380|Document Wikifunctions options, adding wgWikiLambdaClientModeOffline (T391584)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[18:24:41] <stashbot>	 T386020: Implement design for change propagation when WF function calls change - https://phabricator.wikimedia.org/T386020
[18:24:42] <stashbot>	 T391584: Complete the successful deployment of embedded Wikifunctions to testwiki - https://phabricator.wikimedia.org/T391584
[18:25:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740771 (10phaultfinder)
[18:27:01] <James_F>	 !log Run `mwscript sql --wiki=testwiki /srv/mediawiki-staging/php-1.44.0-wmf.24/extensions/WikiLambda/sql/mysql/table-usage.sql` for T391885
[18:27:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:27:05] <stashbot>	 T391885: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'testwiki.wikifunctionsclient_usage' doesn't existFunction: MediaWiki\Extension\WikiLambda\WikifunctionsClientStore::deleteWikifunctionsUsageQuery: DELETE FROM `wikifunctionscli - https://phabricator.wikimedia.org/T391885
[18:27:42] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[18:34:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T391056)', diff saved to https://phabricator.wikimedia.org/P74983 and previous config saved to /var/cache/conftool/dbconfig/20250414-183411-fceratto.json
[18:34:15] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[18:35:10] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[18:36:04] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:trafficserver::backend: remove mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135506 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French)
[18:36:33] <wikibugs>	 (03PS2) 10Scott French: P:trafficserver::backend: remove mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135506 (https://phabricator.wikimedia.org/T391421)
[18:37:29] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136368|Complete our RecentChanges entry generation and formatting (T386020)]], [[gerrit:1136379|Switch test Wikifunctions client deployment from test2wiki to test2iki (T391584)]], [[gerrit:1136380|Document Wikifunctions options, adding wgWikiLambdaClientModeOffline (T391584)]] (duration: 32m 25s)
[18:37:33] <stashbot>	 T386020: Implement design for change propagation when WF function calls change - https://phabricator.wikimedia.org/T386020
[18:37:34] <stashbot>	 T391584: Complete the successful deployment of embedded Wikifunctions to testwiki - https://phabricator.wikimedia.org/T391584
[18:39:38] <wikibugs>	 (03CR) 10Scott French: [C:03+2] P:trafficserver::backend: remove mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135506 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French)
[18:46:16] <jinxer-wm>	 FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.268s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:49:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P74984 and previous config saved to /var/cache/conftool/dbconfig/20250414-184918-fceratto.json
[18:51:16] <jinxer-wm>	 RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.268s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[18:54:39] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740852 (10phaultfinder)
[18:55:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row D - bking@cumin2002 - T388610
[18:55:58] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[19:00:22] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10740880 (10Jclark-ctr) @elukey would you like to shut it down or can we shutdown on our own?
[19:00:40] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10740885 (10Jclark-ctr) a:03Jclark-ctr
[19:02:29] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2109 to cirrussearch2109
[19:02:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:03:24] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391654#10740894 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[19:04:21] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] php-fpm-multiversion-base: Stop copying mwcron scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert)
[19:04:27] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P74985 and previous config saved to /var/cache/conftool/dbconfig/20250414-190426-fceratto.json
[19:04:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740896 (10phaultfinder)
[19:05:43] <wikibugs>	 (03CR) 10Bking: [C:03+2] "This should really help reduce alert noise, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1136426 (owner: 10Ebernhardson)
[19:06:54] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: Incident documents are less visible with Corto - https://phabricator.wikimedia.org/T390126#10740901 (10Eevans) >>! In T390126#10719499, @jhathaway wrote: > reached out to ITS in a follow-up task: https://wikimediainternal.zendesk.com/hc/en-us/requests/111894  Just following up...
[19:07:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2109 to cirrussearch2109 - bking@cumin2002"
[19:07:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2109 to cirrussearch2109 - bking@cumin2002"
[19:07:44] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:07:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2109
[19:07:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2109
[19:08:36] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2109 to cirrussearch2109
[19:10:36] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2109.codfw.wmnet with OS bullseye
[19:10:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2109
[19:12:59] <wikibugs>	 (03CR) 10Dwisehaupt: "@jhathaway@wikimedia.org I think we are ready to roll this out when possible (maybe tomorrow 4/15). I'm not 100% certain that the prod mx-" [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[19:13:01] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.netbox
[19:13:46] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10740934 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Rebalanced pdu
[19:14:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740947 (10phaultfinder)
[19:17:09] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2109 - bking@cumin2002"
[19:17:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2109 - bking@cumin2002"
[19:17:14] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:17:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2109.codfw.wmnet 160.48.192.10.in-addr.arpa 0.6.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:17:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2109.codfw.wmnet 160.48.192.10.in-addr.arpa 0.6.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors
[19:17:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2109
[19:17:41] <wikibugs>	 (03PS1) 10Eevans: Upgade data-gateway to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136436 (https://phabricator.wikimedia.org/T370470)
[19:18:17] <urandom>	 mforns: Ok, first step: upgrading data-gateway to v1.0.12 (matching what is already in staging ) ^^^
[19:19:13] <urandom>	 (as soon as helm-lint has had its say ofc...)
[19:19:34] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T391056)', diff saved to https://phabricator.wikimedia.org/P74986 and previous config saved to /var/cache/conftool/dbconfig/20250414-191933-fceratto.json
[19:19:37] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[19:19:50] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: Maintenance
[19:19:55] <wikibugs>	 (03CR) 10Jgreen: [C:03+1] Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[19:19:57] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T391056)', diff saved to https://phabricator.wikimedia.org/P74987 and previous config saved to /var/cache/conftool/dbconfig/20250414-191957-fceratto.json
[19:20:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2109
[19:20:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2109
[19:20:22] <wikibugs>	 (03CR) 10Eevans: [C:03+2] Upgade data-gateway to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136436 (https://phabricator.wikimedia.org/T370470) (owner: 10Eevans)
[19:21:42] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Fix "changeme" cable labels - https://phabricator.wikimedia.org/T390818#10740976 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr
[19:21:49] <wikibugs>	 (03Merged) 10jenkins-bot: Upgade data-gateway to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136436 (https://phabricator.wikimedia.org/T370470) (owner: 10Eevans)
[19:23:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[19:23:43] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/data-gateway: apply
[19:24:02] <logmsgbot>	 !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply
[19:24:27] <logmsgbot>	 !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/data-gateway: apply
[19:24:45] <logmsgbot>	 !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply
[19:26:01] <urandom>	 mforns: ok, the data-gateway service is at v1.0.12, so I'm going to drop those 8 tables
[19:28:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[19:31:32] <urandom>	 !log dropped & recreated 8 commons impact metrics tables — https://phabricator.wikimedia.org/T370470#10687053
[19:31:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:31:42] <urandom>	 mforns: you are good to start reloading
[19:34:56] <logmsgbot>	 !log mforns@deploy1003 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply
[19:35:11] <logmsgbot>	 !log mforns@deploy1003 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply
[19:35:19] <logmsgbot>	 !log mforns@deploy1003 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: apply
[19:35:33] <logmsgbot>	 !log mforns@deploy1003 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: apply
[19:36:10] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T391056)', diff saved to https://phabricator.wikimedia.org/P74988 and previous config saved to /var/cache/conftool/dbconfig/20250414-193610-fceratto.json
[19:36:15] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[19:36:45] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2109.codfw.wmnet with reason: host reimage
[19:40:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2109.codfw.wmnet with reason: host reimage
[19:43:48] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 2 others: Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded - https://phabricator.wikimedia.org/T381109#10741064 (10Umherirrender) a:03Umherirrender
[19:47:45] <wikibugs>	 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 2 others: Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded (via Special:UploadStash) - https://phabricator.wikimedia.org/T381109#10741073 (10Umherirrender)
[19:47:47] <logmsgbot>	 !log mforns@deploy1003 Started deploy [analytics/refinery@6fe5a7e] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6fe5a7e3]
[19:48:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:50:31] <logmsgbot>	 !log mforns@deploy1003 Finished deploy [analytics/refinery@6fe5a7e] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6fe5a7e3] (duration: 02m 44s)
[19:51:18] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P74989 and previous config saved to /var/cache/conftool/dbconfig/20250414-195117-fceratto.json
[19:53:35] <wikibugs>	 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10741088 (10VRiley-WMF) After working with Dell a bit more on this, I pushed back on their request regarding the iDRAC. They initially wanted to check if the newer firmware would collect more in-depth logs...
[19:53:39] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service restbase1044-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:56:31] <James_F>	 Nothing in the deploy window, so I may steal it.
[19:56:36] <wikibugs>	 (03CR) 10JHathaway: [C:03+1] "looks good, let me know if you need help in the rollout" [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt)
[19:57:12] <logmsgbot>	 !log mforns@deploy1003 Started deploy [analytics/refinery@6fe5a7e]: Regular analytics weekly train [analytics/refinery@6fe5a7e3]
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T2000).
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:00:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741111 (10phaultfinder)
[20:00:43] <logmsgbot>	 !log mforns@deploy1003 Finished deploy [analytics/refinery@6fe5a7e]: Regular analytics weekly train [analytics/refinery@6fe5a7e3] (duration: 03m 31s)
[20:00:45] <wikibugs>	 (03PS1) 10Jforrester: FunctionCalls: Use base64url encoding rather than raw base64 [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136447 (https://phabricator.wikimedia.org/T391584)
[20:00:56] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136447 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[20:01:22] <wikibugs>	 (03PS1) 10Jforrester: FunctionCalls: Don't error if Wikifunctions.org isn't in client mode yet [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136448 (https://phabricator.wikimedia.org/T391584)
[20:01:27] <logmsgbot>	 !log mforns@deploy1003 Started deploy [analytics/refinery@6fe5a7e] (thin): Regular analytics weekly train THIN [analytics/refinery@6fe5a7e3]
[20:01:32] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136448 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[20:01:55] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row D - bking@cumin2002 - T388610
[20:01:59] <stashbot>	 T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610
[20:02:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2109.codfw.wmnet with OS bullseye
[20:02:36] <logmsgbot>	 !log mforns@deploy1003 Finished deploy [analytics/refinery@6fe5a7e] (thin): Regular analytics weekly train THIN [analytics/refinery@6fe5a7e3] (duration: 01m 09s)
[20:03:27] <wikibugs>	 (03PS1) 10Jforrester: FunctionCalls: Throw an explicable error if json_encode returns null [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136449 (https://phabricator.wikimedia.org/T391584)
[20:03:38] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136447 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[20:03:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136448 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[20:03:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136449 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[20:03:42] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136449 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[20:06:25] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P74990 and previous config saved to /var/cache/conftool/dbconfig/20250414-200624-fceratto.json
[20:08:49] <wikibugs>	 (03Merged) 10jenkins-bot: FunctionCalls: Use base64url encoding rather than raw base64 [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136447 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[20:08:52] <wikibugs>	 (03Merged) 10jenkins-bot: FunctionCalls: Don't error if Wikifunctions.org isn't in client mode yet [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136448 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[20:08:54] <wikibugs>	 (03Merged) 10jenkins-bot: FunctionCalls: Throw an explicable error if json_encode returns null [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136449 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester)
[20:09:13] <logmsgbot>	 !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136447|FunctionCalls: Use base64url encoding rather than raw base64 (T391584)]], [[gerrit:1136448|FunctionCalls: Don't error if Wikifunctions.org isn't in client mode yet (T391584)]], [[gerrit:1136449|FunctionCalls: Throw an explicable error if json_encode returns null (T391584)]]
[20:09:16] <stashbot>	 T391584: Complete the successful deployment of embedded Wikifunctions to testwiki - https://phabricator.wikimedia.org/T391584
[20:14:03] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1136447|FunctionCalls: Use base64url encoding rather than raw base64 (T391584)]], [[gerrit:1136448|FunctionCalls: Don't error if Wikifunctions.org isn't in client mode yet (T391584)]], [[gerrit:1136449|FunctionCalls: Throw an explicable error if json_encode returns null (T391584)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:17:02] <logmsgbot>	 !log jforrester@deploy1003 jforrester: Continuing with sync
[20:21:32] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T391056)', diff saved to https://phabricator.wikimedia.org/P74991 and previous config saved to /var/cache/conftool/dbconfig/20250414-202131-fceratto.json
[20:21:38] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[20:21:47] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2205.codfw.wmnet with reason: Maintenance
[20:21:53] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2205 (T391056)', diff saved to https://phabricator.wikimedia.org/P74992 and previous config saved to /var/cache/conftool/dbconfig/20250414-202152-fceratto.json
[20:23:33] <logmsgbot>	 !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136447|FunctionCalls: Use base64url encoding rather than raw base64 (T391584)]], [[gerrit:1136448|FunctionCalls: Don't error if Wikifunctions.org isn't in client mode yet (T391584)]], [[gerrit:1136449|FunctionCalls: Throw an explicable error if json_encode returns null (T391584)]] (duration: 14m 20s)
[20:23:37] <stashbot>	 T391584: Complete the successful deployment of embedded Wikifunctions to testwiki - https://phabricator.wikimedia.org/T391584
[20:38:00] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T391056)', diff saved to https://phabricator.wikimedia.org/P74993 and previous config saved to /var/cache/conftool/dbconfig/20250414-203800-fceratto.json
[20:38:04] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[20:39:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741316 (10phaultfinder)
[20:53:07] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P74994 and previous config saved to /var/cache/conftool/dbconfig/20250414-205307-fceratto.json
[20:56:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3636 MB (3% inode=98%): /tmp 3636 MB (3% inode=98%): /var/tmp 3636 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[21:00:04] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T2100).
[21:03:20] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: corto: track responders - https://phabricator.wikimedia.org/T391897 (10Eevans) 03NEW
[21:05:55] <wikibugs>	 06SRE-OnFire, 10Incident Tooling: Incident documents are less visible with Corto - https://phabricator.wikimedia.org/T390126#10741431 (10jhathaway) not yet, but I asked for an update.
[21:08:14] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P74995 and previous config saved to /var/cache/conftool/dbconfig/20250414-210814-fceratto.json
[21:13:39] <jinxer-wm>	 FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable
[21:15:34] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] keyholder: restart proxy after arming a key [puppet] - 10https://gerrit.wikimedia.org/r/1136022 (https://phabricator.wikimedia.org/T374711) (owner: 10JHathaway)
[21:16:00] <wikibugs>	 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder, 13Patch-For-Review: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10741473 (10jhathaway) 05Open→03Resolved a:03jhathaway
[21:17:41] <jinxer-wm>	 FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[21:23:21] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T391056)', diff saved to https://phabricator.wikimedia.org/P74996 and previous config saved to /var/cache/conftool/dbconfig/20250414-212320-fceratto.json
[21:23:25] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[21:23:37] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2227.codfw.wmnet with reason: Maintenance
[21:23:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2227 (T391056)', diff saved to https://phabricator.wikimedia.org/P74997 and previous config saved to /var/cache/conftool/dbconfig/20250414-212344-fceratto.json
[21:24:01] <wikibugs>	 (03CR) 10JHathaway: [C:03+2] puppetmaster tests: remove resolving www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1134289 (owner: 10JHathaway)
[21:34:37] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741549 (10phaultfinder)
[21:39:58] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T391056)', diff saved to https://phabricator.wikimedia.org/P74998 and previous config saved to /var/cache/conftool/dbconfig/20250414-213957-fceratto.json
[21:40:01] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[21:45:04] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:45:46] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:45:46] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:50:18] <icinga-wm>	 PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:50:24] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:50:38] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.602 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:50:42] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 555 bytes in 6.172 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:50:54] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 05 May 2025 06:42:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:51:08] <icinga-wm>	 RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:51:14] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 05 May 2025 06:42:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:51:39] <denisse>	 ^ looking, I can't access Grafana.
[21:53:46] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:53:46] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:54:04] <icinga-wm>	 PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:55:05] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P74999 and previous config saved to /var/cache/conftool/dbconfig/20250414-215504-fceratto.json
[21:55:18] <icinga-wm>	 PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:55:21] <wikibugs>	 (03PS1) 10Bking: cirrussearch: Add row C hosts [puppet] - 10https://gerrit.wikimedia.org/r/1136464 (https://phabricator.wikimedia.org/T388610)
[21:55:24] <icinga-wm>	 PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[21:55:45] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cirrussearch: Add row C hosts [puppet] - 10https://gerrit.wikimedia.org/r/1136464 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:57:13] <wikibugs>	 (03PS2) 10Bking: cirrussearch: Add row C hosts [puppet] - 10https://gerrit.wikimedia.org/r/1136464 (https://phabricator.wikimedia.org/T388610)
[21:58:30] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add row C hosts [puppet] - 10https://gerrit.wikimedia.org/r/1136464 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[21:58:47] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: Add row C hosts [puppet] - 10https://gerrit.wikimedia.org/r/1136464 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[22:00:54] <jinxer-wm>	 FIRING: [2x] SLOMetricAbsent: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent
[22:01:38] <wikibugs>	 (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add row D non-master hosts to elasticsearch pools [puppet] - 10https://gerrit.wikimedia.org/r/1135976 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[22:01:48] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: Add row D non-master hosts to elasticsearch pools [puppet] - 10https://gerrit.wikimedia.org/r/1135976 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking)
[22:04:43] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741650 (10phaultfinder)
[22:05:02] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 05 May 2025 06:42:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[22:05:08] <icinga-wm>	 RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[22:05:14] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 05 May 2025 06:42:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[22:05:36] <icinga-wm>	 RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[22:05:36] <icinga-wm>	 RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[22:06:41] <logmsgbot>	 !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch2060.codfw.wmnet|cirrussearch2067.codfw.wmnet|cirrussearch2068.codfw.wmnet|cirrussearch2072.codfw.wmnet|cirrussearch2085.codfw.wmnet|cirrussearch2104.codfw.wmnet|cirrussearch2105.codfw.wmnet|cirrussearch2107.codfw.wmnet|cirrussearch2109.codfw.wmnet|cirrussearch2114.codfw.wmnet|cirrussearch2115.codfw.wmnet
[22:07:41] <jinxer-wm>	 FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed
[22:10:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P75000 and previous config saved to /var/cache/conftool/dbconfig/20250414-221012-fceratto.json
[22:13:18] <sbassett>	 Hey all - currently deploying one security patch for today’s window: T391343
[22:16:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3510 MB (3% inode=98%): /tmp 3510 MB (3% inode=98%): /var/tmp 3510 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[22:19:44] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741660 (10phaultfinder)
[22:20:32] <sbassett>	 !log Deployment of security patch for T391343 halted
[22:20:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:23:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:25:20] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T391056)', diff saved to https://phabricator.wikimedia.org/P75001 and previous config saved to /var/cache/conftool/dbconfig/20250414-222519-fceratto.json
[22:25:24] <stashbot>	 T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056
[22:25:25] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2239.codfw.wmnet with reason: Maintenance
[22:27:16] <wikibugs>	 (03CR) 10Dzahn: "Ah, transitory in _that_ way. I see now, ok. thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[22:29:47] <jinxer-wm>	 FIRING: [5x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:30:09] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "Ideally, let's avoid a pattern where setting up a new machine requires coordination between teams (and using both puppet and scap)." [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[22:30:43] <sbassett>	 !log Deployed previous good versions of affected files for T391343
[22:30:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:33:50] <logmsgbot>	 !log dzahn@deploy1003 Installing scap version "4.153.0" for 1 host(s)
[22:34:47] <jinxer-wm>	 RESOLVED: [5x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[22:34:49] <logmsgbot>	 !log dzahn@deploy1003 Installation of scap version "4.153.0" completed for 1 hosts
[22:34:54] <mutante>	 !log deploy1003 - scap install-world -l release2003.codfw.wmnet T391590
[22:34:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:34:57] <stashbot>	 T391590: PuppetFailure - releases2003 - https://phabricator.wikimedia.org/T391590
[22:35:34] <icinga-wm>	 PROBLEM - MD RAID on aqs1015 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[22:35:35] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on aqs1015 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T391903 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[22:35:45] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903 (10ops-monitoring-bot) 03NEW
[22:37:18] <wikibugs>	 (03PS1) 10Ladsgroup: wikimedia.org: Add gitlab.wm.o HIBP TXT record, remove lists [dns] - 10https://gerrit.wikimedia.org/r/1136465
[22:37:52] <wikibugs>	 (03CR) 10CI reject: [V:04-1] wikimedia.org: Add gitlab.wm.o HIBP TXT record, remove lists [dns] - 10https://gerrit.wikimedia.org/r/1136465 (owner: 10Ladsgroup)
[22:39:08] <wikibugs>	 (03CR) 10Ladsgroup: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1136465 (owner: 10Ladsgroup)
[22:39:49] <wikibugs>	 (03PS2) 10Ladsgroup: wikimedia.org: Add gitlab.wm.o HIBP TXT record, remove lists [dns] - 10https://gerrit.wikimedia.org/r/1136465
[22:41:55] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] wikimedia.org: Add gitlab.wm.o HIBP TXT record, remove lists [dns] - 10https://gerrit.wikimedia.org/r/1136465 (owner: 10Ladsgroup)
[22:42:16] <logmsgbot>	 !log ladsgroup@dns1004 START - running authdns-update
[22:44:40] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741683 (10phaultfinder)
[22:44:45] <logmsgbot>	 !log ladsgroup@dns1004 END - running authdns-update
[22:46:13] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "What scap command would you actually run?" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[22:53:46] <wikibugs>	 (03PS1) 10MusikAnimal: testwiki: enable wgUseCodexSpecialBlock and wgEnableMultiBlocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136466 (https://phabricator.wikimedia.org/T377121)
[22:54:53] <wikibugs>	 (03CR) 10Tim Starling: [C:03+1] testwiki: enable wgUseCodexSpecialBlock and wgEnableMultiBlocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136466 (https://phabricator.wikimedia.org/T377121) (owner: 10MusikAnimal)
[22:56:16] <icinga-wm>	 PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3448 MB (3% inode=98%): /tmp 3448 MB (3% inode=98%): /var/tmp 3448 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops
[22:58:43] <jinxer-wm>	 FIRING: [5x] ProbeDown: Service restbase1044-a:9042 has failed probes (tcp_cassandra_a_cql_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T2300)
[23:00:46] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "deploy1003:~] $ scap deploy -v -l releases2003.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn)
[23:03:29] <wikibugs>	 06SRE: archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904 (10Dzahn) 03NEW
[23:12:02] <zabe>	 !log zabe@mwmaint1002:~$ cat group2.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php {} --delete /home/zabe/afl_text_table_deletedump/{} --sleep 0.3" # T381599
[23:12:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:12:05] <stashbot>	 T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599
[23:22:37] <urandom>	 !log bootstrapping Cassandra/restbase1044-b — T389423
[23:22:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:22:41] <stashbot>	 T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423
[23:23:39] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service restbase1044-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:28:39] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[23:40:00] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136472
[23:40:00] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136472 (owner: 10TrainBranchBot)
[23:40:33] <wikibugs>	 (03CR) 10Creynolds: [C:03+1] dumps enterprise copy update [puppet] - 10https://gerrit.wikimedia.org/r/1135522 (owner: 10Creynolds)
[23:48:39] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:52:26] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136472 (owner: 10TrainBranchBot)
[23:56:00] <wikibugs>	 (03PS2) 10Scott French: hieradata: switch parsoidtest1001 to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1136413 (https://phabricator.wikimedia.org/T380485)