[00:04:43] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:05:39] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [00:09:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:09:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136139 [00:09:50] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136139 (owner: 10TrainBranchBot) [00:10:47] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 639.86 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:20:03] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:25:08] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:30:47] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1136139 (owner: 10TrainBranchBot) [00:43:54] FIRING: [2x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:46:35] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [00:56:37] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/4dc58a2470693bde7218013f86951eceb81d1c9e87f9ef816f49591d04626c20/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:36:37] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [02:10:47] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:32:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [03:40:03] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:45:03] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:52:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [04:09:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:20:03] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:25:03] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:45:03] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:46:35] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [05:08:54] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:48:27] (03PS4) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) [05:49:02] (03PS1) 10KartikMistry: Update MinT to 2025-04-09-054213-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136148 [05:51:11] (03PS5) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) [05:52:11] (03CR) 10Arnaudb: gerrit: failover cookbook (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [05:57:04] (03PS1) 10Ayounsi: Host BGP: ignore hosts with no primary IP [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136150 [05:57:41] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:57:46] (03CR) 10CI reject: [V:04-1] gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [05:58:39] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:58:54] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:01:26] gitui [06:02:07] (03PS6) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) [06:04:37] (03PS1) 10Ayounsi: magru: remove novaacore/momentum [homer/public] - 10https://gerrit.wikimedia.org/r/1136152 (https://phabricator.wikimedia.org/T381913) [06:12:48] <_joe_> !log uploaded conftool 5.1.0 [06:12:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:15:08] !log installing perl security updates [06:15:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:23:50] (03PS1) 10Muehlenhoff: Add record for jvanderhoop LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/1136155 [06:26:05] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1136155 (owner: 10Muehlenhoff) [06:27:50] (03CR) 10Muehlenhoff: [C:03+2] Add record for jvanderhoop LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/1136155 (owner: 10Muehlenhoff) [06:27:57] 10ops-codfw, 06DC-Ops: cr2-codfw: 2/4 PSU down - https://phabricator.wikimedia.org/T391790 (10ayounsi) 03NEW p:05Triage→03High [06:35:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136104 (https://phabricator.wikimedia.org/T348611) (owner: 10Anzx) [06:39:29] (03PS1) 10Muehlenhoff: Track LDAP access for bcampbell804 [puppet] - 10https://gerrit.wikimedia.org/r/1136247 [06:41:10] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/miscweb: apply [06:46:42] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1043.eqiad.wmnet [06:47:18] Testing MinT change, not deploying yet. [06:48:14] (03CR) 10KartikMistry: [C:03+2] Update MinT to 2025-04-09-054213-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136148 (owner: 10KartikMistry) [06:48:27] (03CR) 10Slyngshede: [C:03+1] Track LDAP access for bcampbell804 [puppet] - 10https://gerrit.wikimedia.org/r/1136247 (owner: 10Muehlenhoff) [06:50:04] (03Merged) 10jenkins-bot: Update MinT to 2025-04-09-054213-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136148 (owner: 10KartikMistry) [06:50:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1043.eqiad.wmnet [06:50:54] (03CR) 10Muehlenhoff: [C:03+2] Track LDAP access for bcampbell804 [puppet] - 10https://gerrit.wikimedia.org/r/1136247 (owner: 10Muehlenhoff) [06:51:18] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/miscweb: apply [06:52:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc3 T391454', diff saved to https://phabricator.wikimedia.org/P74908 and previous config saved to /var/cache/conftool/dbconfig/20250414-065203-marostegui.json [06:52:06] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [06:52:18] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10737177 (10VRiley-WMF) Dell is currently with their level 3 engineers and looking at this ticket. They have laid out this plan of action on this server "Plan of Action Apply the latest iDRAC firmware up... [06:54:15] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti1043.eqiad.wmnet [06:54:55] (03PS1) 10Marostegui: mariadb: pc2, upgrade to mariadb 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136249 (https://phabricator.wikimedia.org/T391454) [06:55:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2013.codfw.wmnet,pc1013.eqiad.wmnet with reason: Maintenance [06:58:19] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10737181 (10Marostegui) Thanks @VRiley-WMF - hopefully the plan is not to upgrade to that latest firmware and then wait again a few months to see exactly the same crash. Can you double check that their are... [06:58:25] (03CR) 10Marostegui: [C:03+2] mariadb: pc2, upgrade to mariadb 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136249 (https://phabricator.wikimedia.org/T391454) (owner: 10Marostegui) [06:59:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1043.eqiad.wmnet [07:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T0700). [07:00:05] anzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:50] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10737185 (10VRiley-WMF) Understood, I will be relaying this information to Dell to inquire if there are additional plans of action. As, I do know we have similar servers with similar configuration (if not... [07:01:36] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10737186 (10Marostegui) Thank you! [07:01:47] !log installing subversion security updates [07:01:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc3 T391454', diff saved to https://phabricator.wikimedia.org/P74909 and previous config saved to /var/cache/conftool/dbconfig/20250414-070220-marostegui.json [07:02:24] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [07:04:29] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [07:04:45] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [07:05:43] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:06:39] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:10:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1044.eqiad.wmnet [07:13:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1044.eqiad.wmnet [07:15:19] (03PS1) 10Muehlenhoff: Remove now obsolete Cumin aliases for job runners [puppet] - 10https://gerrit.wikimedia.org/r/1136253 (https://phabricator.wikimedia.org/T354791) [07:15:55] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2014.codfw.wmnet,pc1014.eqiad.wmnet with reason: Maintenance [07:16:41] (03PS1) 10Marostegui: mariadb: Upgrade pc4 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136254 (https://phabricator.wikimedia.org/T391454) [07:16:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc4 T391454', diff saved to https://phabricator.wikimedia.org/P74910 and previous config saved to /var/cache/conftool/dbconfig/20250414-071653-marostegui.json [07:16:56] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [07:19:19] (03CR) 10Marostegui: [C:03+2] mariadb: Upgrade pc4 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136254 (https://phabricator.wikimedia.org/T391454) (owner: 10Marostegui) [07:24:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc4 T391454', diff saved to https://phabricator.wikimedia.org/P74911 and previous config saved to /var/cache/conftool/dbconfig/20250414-072437-marostegui.json [07:24:42] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [07:25:24] (03CR) 10Elukey: [C:03+2] services: update proton's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135910 (owner: 10Elukey) [07:25:31] jouncebot: nowandnext [07:25:31] For the next 0 hour(s) and 34 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T0700) [07:25:31] In 2 hour(s) and 34 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1000) [07:26:16] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1178.eqiad.wmnet with OS bullseye [07:26:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10737233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1178.eqiad.wmnet with OS bullseye [07:27:05] !log elukey@deploy1003 helmfile [staging] START helmfile.d/services/proton: sync [07:27:45] !log elukey@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: sync [07:36:19] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/services/proton: sync [07:37:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [07:37:29] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: sync [07:37:44] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: sync [07:39:00] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: sync [07:42:54] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reboot-single for host ganeti1044.eqiad.wmnet [07:45:03] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:21] (03CR) 10KartikMistry: [C:03+2] "Bumping chart so that we can test the T386889" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136148 (owner: 10KartikMistry) [07:48:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1044.eqiad.wmnet [07:49:10] (03PS4) 10Volans: remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond) [07:49:10] (03PS1) 10Volans: mysql: make MysqlRemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325 [07:49:10] (03PS1) 10Volans: cookbook modules: use docstring for title [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136326 [07:49:47] ACKNOWLEDGEMENT - SSH on db1246 is CRITICAL: connect to address 10.64.48.172 and port 22: Connection refused Marostegui Host crashed https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:50:08] (03CR) 10Volans: "Resumed John's CR as I got some request to iterate over RemoteHosts instances. Added tests." [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond) [07:50:36] (03CR) 10Volans: "As requested on another CR." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325 (owner: 10Volans) [07:52:07] ACKNOWLEDGEMENT - MariaDB memory on db2220 is CRITICAL: CRIT Memory 98% used. Largest process: mysqld (1575) = 97.3% Marostegui https://phabricator.wikimedia.org/T391795 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:53:03] !log gnmic: bump `num-workers` to 12 on netflow1002 - T388641 [07:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:06] T388641: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641 [07:57:11] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [07:58:28] !log rebalance ganeti/B T391243 [07:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:31] T391243: Configure sandbox vlan on ganeti1043 and 1044 - https://phabricator.wikimedia.org/T391243 [08:00:42] PROBLEM - Hadoop NodeManager on an-worker1175 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:05:00] (03CR) 10Jelto: [C:03+1] "looks good to me but I'd prefer a solution which depools Gerrit properly instead of running the sync multiple times. But this could be a l" [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [08:08:42] RECOVERY - Hadoop NodeManager on an-worker1175 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [08:11:49] !log restarting clamav on vrts to pick up liblzma security updates [08:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:17] (03PS1) 10Slyngshede: IDP: Add dummy secret for Phabricator (test) [labs/private] - 10https://gerrit.wikimedia.org/r/1136327 [08:16:30] (03PS2) 10Slyngshede: IDP: Add dummy secret for Phabricator (test) [labs/private] - 10https://gerrit.wikimedia.org/r/1136327 (https://phabricator.wikimedia.org/T377061) [08:20:03] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:20:47] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host db1178.eqiad.wmnet with OS bullseye [08:20:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10737377 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host db1178.eqiad.wmnet with OS bullseye exe... [08:22:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1178', diff saved to https://phabricator.wikimedia.org/P74912 and previous config saved to /var/cache/conftool/dbconfig/20250414-082235-marostegui.json [08:23:34] (03CR) 10Slyngshede: [V:03+2 C:03+1] IDP: Add dummy secret for Phabricator (test) [labs/private] - 10https://gerrit.wikimedia.org/r/1136327 (https://phabricator.wikimedia.org/T377061) (owner: 10Slyngshede) [08:23:36] (03CR) 10Slyngshede: [V:03+2 C:03+2] IDP: Add dummy secret for Phabricator (test) [labs/private] - 10https://gerrit.wikimedia.org/r/1136327 (https://phabricator.wikimedia.org/T377061) (owner: 10Slyngshede) [08:25:00] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5274/console" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper) [08:25:03] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:25:04] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [08:26:35] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [08:26:47] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10737412 (10Marostegui) Why was db1178 reimaged? This is a production host that is serving traffic. [08:26:57] VRiley: check -sre please :) [08:27:24] !log disable-puppet on A:cp to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135827 (T391670) [08:27:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:27] (03PS1) 10Muehlenhoff: Failover irc.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1136328 [08:27:27] T391670: Staticize haproxy directives from hiera to template - https://phabricator.wikimedia.org/T391670 [08:30:34] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5275/console" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper) [08:30:43] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2220 - Upgrading host [08:31:13] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2220 - Upgrading host [08:31:35] (03PS7) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) [08:31:42] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp1111.eqiad.wmnet [08:32:03] (03PS8) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) [08:32:04] (03CR) 10Fabfur: [C:03+2] haproxy: staticize haproxy acls into template [puppet] - 10https://gerrit.wikimedia.org/r/1135827 (https://phabricator.wikimedia.org/T391670) (owner: 10Fabfur) [08:32:04] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2220.codfw.wmnet [08:33:11] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10737471 (10Marostegui) >>! In T377878#10737412, @Marostegui wrote: > Why was db1178 reimaged? This is a production host that is serving traffic.... [08:33:17] RECOVERY - MariaDB memory on db2220 is OK: OK Memory 58% used https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [08:33:48] (03CR) 10Elukey: [C:03+1] "Ok for me to failover, but I am wondering if it would be better for clients just to re-connect after a restart (rather than failover two t" [dns] - 10https://gerrit.wikimedia.org/r/1136328 (owner: 10Muehlenhoff) [08:34:07] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [08:35:26] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp1111.eqiad.wmnet [08:36:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10737477 (10phaultfinder) [08:36:44] (03CR) 10Volans: [C:04-1] "I think there are 2 logic error that would make the cookbook not behave as expected but are easily fixable. The rest LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [08:36:48] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [08:38:59] (03CR) 10Muehlenhoff: "It's a good point actually, with ircstream we can just as well simply restart and have them reconnect, the failover is only really neeedd " [dns] - 10https://gerrit.wikimedia.org/r/1136328 (owner: 10Muehlenhoff) [08:39:27] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for db2220.codfw.wmnet [08:39:31] !log restarting ircstream on irc1003, clients will reconnect automatically [08:39:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:34] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [08:40:54] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1178.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [08:41:11] (03CR) 10Kosta Harlan: [C:03+1] alertmanager: route T&S tasks to their Slack [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [08:41:50] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1178.eqiad.wmnet with OS bullseye [08:41:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10737487 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1178.eqiad.wmnet with OS b... [08:42:20] (03CR) 10Kosta Harlan: [C:03+1] CommonSettings: remove outdated SecurePoll comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135835 (https://phabricator.wikimedia.org/T209892) (owner: 10Novem Linguae) [08:42:29] (03Abandoned) 10Muehlenhoff: Failover irc.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1136328 (owner: 10Muehlenhoff) [08:44:20] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5276/console" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper) [08:45:03] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:13] !log restart Postfix/Dovecot on outbound MXes to pick up xz security updates [08:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:49] (03CR) 10Elukey: [C:03+1] remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond) [08:46:34] !log enable-puppet on A:cp (T391670) [08:46:36] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [08:46:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:38] T391670: Staticize haproxy directives from hiera to template - https://phabricator.wikimedia.org/T391670 [08:47:15] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10737505 (10hgzh) I'm not really happy that an enwiki discussion 'decided' this for all other projects that now get a notice three days before the change. [08:47:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P74914 and previous config saved to /var/cache/conftool/dbconfig/20250414-084716-root.json [08:47:24] (03CR) 10Elukey: [C:03+1] mysql: make MysqlRemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325 (owner: 10Volans) [08:47:53] !log installing Postgres 15 security updates [08:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136104 (https://phabricator.wikimedia.org/T348611) (owner: 10Anzx) [08:48:44] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2220 gradually with 4 steps - Finished upgrading host [08:51:24] (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [08:54:01] (03PS10) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [08:57:48] (03Merged) 10jenkins-bot: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [09:00:31] !log gnmic: bump `num-workers` to 16 on netflow1002 - T388641 [09:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:35] T388641: Migrate network icinga alerts to gNMI/prometheus - https://phabricator.wikimedia.org/T388641 [09:02:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P74917 and previous config saved to /var/cache/conftool/dbconfig/20250414-090222-root.json [09:03:03] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5277/console" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper) [09:04:30] FIRING: Emergency syslog message: Alert for device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:05:44] (03PS3) 10Slyngshede: idp-test: add Phabricator test instance client [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper) [09:06:14] (03CR) 10Federico Ceratto: "I simplified the change keeping the original handling of Puppet and alerting." [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [09:06:18] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 2 others: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10737609 (10ABran-WMF) This first iteration is still fairly manual but will give us a stepping stone to build upon. I'll r... [09:06:28] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5278/console" [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper) [09:09:30] RESOLVED: Emergency syslog message: Device cloudsw1-f4-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [09:11:18] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: enable object storage on one of the replicas [puppet] - 10https://gerrit.wikimedia.org/r/1135919 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [09:11:51] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10737627 (10Volans) Ack, I can confirm the pages I was having trouble with are now found in search (at the cost of a larger index, I think is around... [09:14:35] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1135714 (https://phabricator.wikimedia.org/T391577) (owner: 10Federico Ceratto) [09:15:20] (03PS11) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [09:15:50] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2230.codfw.wmnet [09:17:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P74919 and previous config saved to /var/cache/conftool/dbconfig/20250414-091727-root.json [09:20:57] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2230.codfw.wmnet [09:23:38] (03CR) 10Federico Ceratto: "Added support for the test cluster (skipping dbctl completely) and did a full run against db2230" [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [09:24:15] (03CR) 10Federico Ceratto: [C:03+2] pool.py: In dry-run mode do not monitor connection drain [cookbooks] - 10https://gerrit.wikimedia.org/r/1135714 (https://phabricator.wikimedia.org/T391577) (owner: 10Federico Ceratto) [09:24:19] (03PS1) 10Vgutierrez: sre: Add LibericaUnhealthyRealserverPooled alert [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) [09:29:45] (03CR) 10Vgutierrez: sre: Add LibericaUnhealthyRealserverPooled alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez) [09:31:45] !log restarting acme-chief to catch up on liblzma updates [09:31:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P74922 and previous config saved to /var/cache/conftool/dbconfig/20250414-093232-root.json [09:33:45] !log restarting acme-chief API servers to catch up on liblzma updates [09:33:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:27] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:34:27] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:35:07] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2220 gradually with 4 steps - Finished upgrading host [09:37:53] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10737757 (10Aklapper) Sounds like this should be set to `declined` status again? [09:39:56] (03PS1) 10Hashar: CI: diff against parent commit instead of remote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) [09:43:02] (03PS1) 10Brouberol: airflow: convert the scheduler liveness/readiness checks to a tcpCheck [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136336 (https://phabricator.wikimedia.org/T391497) [09:44:39] (03CR) 10CI reject: [V:04-1] airflow: convert the scheduler liveness/readiness checks to a tcpCheck [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136336 (https://phabricator.wikimedia.org/T391497) (owner: 10Brouberol) [09:45:58] (03PS2) 10Brouberol: airflow: convert the scheduler liveness/readiness checks to a tcpCheck [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136336 (https://phabricator.wikimedia.org/T391497) [09:47:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P74924 and previous config saved to /var/cache/conftool/dbconfig/20250414-094737-root.json [09:55:09] (03CR) 10Volans: "I'm a little bit confused as this patch and I4ce9217392a7795940c981e1ee7da52df026cb5c are both performing substantial changes to the same " [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [09:58:56] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1150.eqiad.wmnet with reason: Maintenance [09:59:45] (03CR) 10Marostegui: upgrade.py: Depool, repool, update Phabricator (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [10:00:03] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1000) [10:00:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc1', diff saved to https://phabricator.wikimedia.org/P74925 and previous config saved to /var/cache/conftool/dbconfig/20250414-100038-marostegui.json [10:01:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc1', diff saved to https://phabricator.wikimedia.org/P74927 and previous config saved to /var/cache/conftool/dbconfig/20250414-100135-marostegui.json [10:02:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P74928 and previous config saved to /var/cache/conftool/dbconfig/20250414-100242-root.json [10:04:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1157.eqiad.wmnet with reason: Maintenance [10:04:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T391056)', diff saved to https://phabricator.wikimedia.org/P74929 and previous config saved to /var/cache/conftool/dbconfig/20250414-100412-fceratto.json [10:04:16] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:04:40] 06SRE, 10Domains, 06Traffic, 13Patch-For-Review: Acquire enwp.org - https://phabricator.wikimedia.org/T332220#10737856 (10A_smart_kitten) >>! In T332220#10737757, @Aklapper wrote: > Sounds like this should be set to `declined` status again? Would `stalled` on a reply be better? As it sounds like acquiring... [10:05:40] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10737860 (10Ladsgroup) This is not really because of English Wikipedia. This has been requested many many times by many communities. For example: - Engli... [10:08:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T391056)', diff saved to https://phabricator.wikimedia.org/P74930 and previous config saved to /var/cache/conftool/dbconfig/20250414-100809-fceratto.json [10:09:09] jouncebot: nowandnext [10:09:09] For the next 0 hour(s) and 50 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1000) [10:09:10] In 1 hour(s) and 50 minute(s): Wikifunctions MediaWiki integration backport (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1200) [10:09:20] nothing is happening on infra side? [10:11:09] (03PS1) 10Ladsgroup: Bump thumbnail steps to 90% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136339 (https://phabricator.wikimedia.org/T360589) [10:12:06] (03PS1) 10Ayounsi: gNMIc: bump num-workers to 16 [puppet] - 10https://gerrit.wikimedia.org/r/1136341 (https://phabricator.wikimedia.org/T388641) [10:13:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136339 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:13:49] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135835 (https://phabricator.wikimedia.org/T209892) (owner: 10Novem Linguae) [10:14:56] (03Merged) 10jenkins-bot: Bump thumbnail steps to 90% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136339 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:15:01] (03Merged) 10jenkins-bot: CommonSettings: remove outdated SecurePoll comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135835 (https://phabricator.wikimedia.org/T209892) (owner: 10Novem Linguae) [10:15:36] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136339|Bump thumbnail steps to 90% (T360589)]], [[gerrit:1135835|CommonSettings: remove outdated SecurePoll comment (T209892)]] [10:15:39] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:15:39] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [10:17:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P74931 and previous config saved to /var/cache/conftool/dbconfig/20250414-101748-root.json [10:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:20:57] (03CR) 10Cathal Mooney: [C:03+1] gNMIc: bump num-workers to 16 [puppet] - 10https://gerrit.wikimedia.org/r/1136341 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:21:24] (03Abandoned) 10Clément Goubert: scap::scripts: Add mwscript-mwcron wrapper [puppet] - 10https://gerrit.wikimedia.org/r/1135912 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [10:22:26] (03PS2) 10Clément Goubert: mw:periodic_jobs: Add mw-cron boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555) [10:22:40] (03CR) 10Filippo Giunchedi: sre: Add LibericaUnhealthyRealserverPooled alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez) [10:23:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P74932 and previous config saved to /var/cache/conftool/dbconfig/20250414-102316-fceratto.json [10:24:23] (03PS1) 10Jelto: ceph: move apus_keys to ceph folder [labs/private] - 10https://gerrit.wikimedia.org/r/1136344 (https://phabricator.wikimedia.org/T378922) [10:26:52] (03CR) 10Ayounsi: [C:03+2] gNMIc: bump num-workers to 16 [puppet] - 10https://gerrit.wikimedia.org/r/1136341 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [10:31:17] (03CR) 10MVernon: "check experimental" [labs/private] - 10https://gerrit.wikimedia.org/r/1136344 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:32:18] (03CR) 10Vgutierrez: sre: Add LibericaUnhealthyRealserverPooled alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez) [10:32:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P74933 and previous config saved to /var/cache/conftool/dbconfig/20250414-103253-root.json [10:35:38] (03CR) 10Slyngshede: [V:03+1 C:03+2] idp-test: add Phabricator test instance client [puppet] - 10https://gerrit.wikimedia.org/r/1117842 (https://phabricator.wikimedia.org/T377061) (owner: 10Aklapper) [10:35:55] !log upload varnish 7.1.1-1.1~bpo11+wmf3 to apt.wm.o (bullseye-wikimedia) - T391334 [10:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:58] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [10:37:41] (03CR) 10Clément Goubert: [C:03+2] mw-cron: Add statsd-exporter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135929 (https://phabricator.wikimedia.org/T391672) (owner: 10Clément Goubert) [10:39:05] (03CR) 10Elukey: [C:03+1] "Nice I like it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136326 (owner: 10Volans) [10:39:09] (03Merged) 10jenkins-bot: mw-cron: Add statsd-exporter release [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135929 (https://phabricator.wikimedia.org/T391672) (owner: 10Clément Goubert) [10:39:36] claime: hii, sorry to bother, let me know when I can do a scap :D [10:39:47] Amir1: gimme a couple minutes [10:40:01] no worries. Thanks! [10:40:03] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [10:40:14] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [10:40:27] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:40:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:40:36] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:40:45] (03PS1) 10Hnowlan: rest-gateway: add mobileapps/PCS endpoints that don't use internal cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136346 (https://phabricator.wikimedia.org/T385033) [10:41:28] Amir1: you can go ahead [10:41:33] <3 [10:41:51] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1136339|Bump thumbnail steps to 90% (T360589)]], [[gerrit:1135835|CommonSettings: remove outdated SecurePoll comment (T209892)]] [10:41:55] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:41:56] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [10:42:00] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart rolling restart_daemons on A:dnsbox [10:43:25] (03PS3) 10Federico Ceratto: Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) [10:43:29] !log rolling upgrade to varnish 7.1.1-1..1~bpo11+wmf3 in ulsfo - T391334 [10:43:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:32] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [10:43:53] (03PS1) 10Hnowlan: rest-gateway: correct mobileapps path ordering when revision is used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136347 (https://phabricator.wikimedia.org/T264670) [10:44:19] !log sukhe@cumin1002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough [10:45:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:46:49] (03PS4) 10Federico Ceratto: Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) [10:47:06] (03CR) 10Federico Ceratto: "Added to codfw as well." [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [10:47:22] !log ladsgroup@deploy1003 ladsgroup, novemlinguae: Backport for [[gerrit:1136339|Bump thumbnail steps to 90% (T360589)]], [[gerrit:1135835|CommonSettings: remove outdated SecurePoll comment (T209892)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:47:26] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:47:27] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [10:47:28] (03CR) 10Clément Goubert: [C:03+1] Updating docker-pkg to 4.0.4 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1134727 (owner: 10Elukey) [10:47:42] (03CR) 10Federico Ceratto: "Updated" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [10:47:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1178 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P74935 and previous config saved to /var/cache/conftool/dbconfig/20250414-104758-root.json [10:48:00] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_ulsfo [10:48:33] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_ulsfo and not P{cp4047.ulsfo.wmnet} and A:cp [10:49:43] !log ladsgroup@deploy1003 ladsgroup, novemlinguae: Continuing with sync [10:50:35] (03CR) 10Federico Ceratto: [C:03+1] Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [10:50:37] (03CR) 10Federico Ceratto: [C:03+2] Add namespace for zarcillo [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135696 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [10:51:19] PROBLEM - OSPF status on cloudsw2-d5-eqiad.mgmt is CRITICAL: OSPFv2: 1/1 UP : OSPFv3: 0/1 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:51:43] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:52:57] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-upload_ulsfo and not P{cp4047.ulsfo.wmnet} and A:cp [10:53:03] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-text_ulsfo [10:53:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T391056)', diff saved to https://phabricator.wikimedia.org/P74936 and previous config saved to /var/cache/conftool/dbconfig/20250414-105329-fceratto.json [10:53:34] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [10:53:45] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:53:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T391056)', diff saved to https://phabricator.wikimedia.org/P74937 and previous config saved to /var/cache/conftool/dbconfig/20250414-105351-fceratto.json [10:57:10] (03CR) 10Jgiannelos: [C:03+1] rest-gateway: correct mobileapps path ordering when revision is used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136347 (https://phabricator.wikimedia.org/T264670) (owner: 10Hnowlan) [10:57:31] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough [10:57:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T391056)', diff saved to https://phabricator.wikimedia.org/P74938 and previous config saved to /var/cache/conftool/dbconfig/20250414-105741-fceratto.json [10:58:03] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10738065 (10Ladsgroup) eqiad containers are much bigger and it'll take way more time to clean them. 24 days have passed and only roughly 30% have been removed from 0x containers. Now... [10:59:18] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136339|Bump thumbnail steps to 90% (T360589)]], [[gerrit:1135835|CommonSettings: remove outdated SecurePoll comment (T209892)]] (duration: 17m 26s) [10:59:22] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:59:23] T209892: SecurePoll is not compatible with GPG 2.1+ - https://phabricator.wikimedia.org/T209892 [11:01:09] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10738083 (10hgzh) Thanks for the links, most of the requests are based on a local discussion and also the global ones seem to come mainly from individual... [11:01:22] (03CR) 10Ssingh: [C:03+1] Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [11:03:14] (03CR) 10Hnowlan: [C:03+1] Remove now obsolete Cumin aliases for job runners [puppet] - 10https://gerrit.wikimedia.org/r/1136253 (https://phabricator.wikimedia.org/T354791) (owner: 10Muehlenhoff) [11:09:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:10:02] (03CR) 10Kamila Součková: [C:03+1] mw:periodic_jobs: Add mw-cron boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [11:11:37] (03PS2) 10Bartosz Dziewoński: Enable SUL3 on most remaining beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 [11:12:03] (03CR) 10Clément Goubert: [C:03+2] mw:periodic_jobs: Add mw-cron boilerplate [puppet] - 10https://gerrit.wikimedia.org/r/1135936 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [11:12:05] (03PS2) 10Bartosz Dziewoński: Clean up obsolete SUL3 settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 [11:12:27] !log restart spamassassin on lists* to pick up Perl security updates [11:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:34] (03CR) 10Bartosz Dziewoński: [C:04-1] "Needs to wait for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CentralAuth/+/1135964 to be deployed now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135851 (owner: 10Bartosz Dziewoński) [11:12:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P74939 and previous config saved to /var/cache/conftool/dbconfig/20250414-111247-fceratto.json [11:17:07] (03CR) 10Hnowlan: [C:03+2] rest-gateway: correct mobileapps path ordering when revision is used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136347 (https://phabricator.wikimedia.org/T264670) (owner: 10Hnowlan) [11:18:55] (03Merged) 10jenkins-bot: rest-gateway: correct mobileapps path ordering when revision is used [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136347 (https://phabricator.wikimedia.org/T264670) (owner: 10Hnowlan) [11:19:01] !log fceratto@deploy1003 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:19:03] !log fceratto@deploy1003 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:19:50] (03CR) 10Muehlenhoff: [C:03+2] Remove now obsolete Cumin aliases for job runners [puppet] - 10https://gerrit.wikimedia.org/r/1136253 (https://phabricator.wikimedia.org/T354791) (owner: 10Muehlenhoff) [11:20:29] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:20:35] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:24:13] !log upload varnishkafka 1.2.0-3 to apt.wm.o (bullseye-wikimedia) - T391334 [11:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:17] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [11:24:20] (03CR) 10Hashar: "Indeed for the release Jenkins, there is no service defined in Puppet. The systemd unit is installed by the Debian package which is itsel" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [11:24:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:25:00] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:25:02] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp4037.ulsfo.wmnet} and A:cp [11:25:04] (03PS1) 10Brouberol: airflow-test-k8s: adjust dag/file processing timeout to account for large v1 dumps dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136351 (https://phabricator.wikimedia.org/T391744) [11:25:08] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:25:17] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp4045.ulsfo.wmnet} and A:cp [11:26:18] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:26:31] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:26:41] (03CR) 10Jelto: [V:03+2 C:03+2] ceph: move apus_keys to ceph folder [labs/private] - 10https://gerrit.wikimedia.org/r/1136344 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [11:27:25] !log fceratto@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [11:27:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P74940 and previous config saved to /var/cache/conftool/dbconfig/20250414-112754-fceratto.json [11:28:37] !log fceratto@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [11:29:21] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [11:29:35] (03PS1) 10Clément Goubert: team-sre/mw-cron: Fix logstash link [alerts] - 10https://gerrit.wikimedia.org/r/1136352 [11:29:44] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [11:30:08] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp4037.ulsfo.wmnet} and A:cp [11:30:11] (03CR) 10Kamila Součková: [C:03+2] alertmanager: route T&S tasks to their Slack [puppet] - 10https://gerrit.wikimedia.org/r/1135005 (https://phabricator.wikimedia.org/T388542) (owner: 10Kamila Součková) [11:30:18] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp4045.ulsfo.wmnet} and A:cp [11:33:46] (03CR) 10Kamila Součková: [C:03+1] team-sre/mw-cron: Fix logstash link [alerts] - 10https://gerrit.wikimedia.org/r/1136352 (owner: 10Clément Goubert) [11:34:06] (03CR) 10Jelto: [V:03+2 C:03+2] "this did not solve the issue, `Function lookup() did not find a value for the name 'profile::ceph::s3::client::apus_keys'`" [labs/private] - 10https://gerrit.wikimedia.org/r/1136344 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [11:37:04] (03CR) 10Kamila Součková: [C:03+1] services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (owner: 10Elukey) [11:37:57] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:38:49] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:40:28] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.roll-restart (exit_code=0) rolling restart_daemons on A:dnsbox [11:42:11] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [11:42:39] (03CR) 10Kamila Součková: [C:03+1] "Sure, SGTM. I don't have a strong opinion either way." [debs/helm3] (helm311) - 10https://gerrit.wikimedia.org/r/1135412 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [11:43:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T391056)', diff saved to https://phabricator.wikimedia.org/P74941 and previous config saved to /var/cache/conftool/dbconfig/20250414-114300-fceratto.json [11:43:04] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [11:43:16] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1175.eqiad.wmnet with reason: Maintenance [11:43:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T391056)', diff saved to https://phabricator.wikimedia.org/P74942 and previous config saved to /var/cache/conftool/dbconfig/20250414-114323-fceratto.json [11:45:03] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:46:47] (03CR) 10Muehlenhoff: [C:03+2] Create insetup role for ServiceOps with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133927 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [11:47:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T391056)', diff saved to https://phabricator.wikimedia.org/P74943 and previous config saved to /var/cache/conftool/dbconfig/20250414-114711-fceratto.json [11:47:35] OK to deploy cxserver/MinT? [11:49:35] (03CR) 10Jaime Nuche: "> We should run scap to deploy Jenkins+plugins on the new host that is erroring out." [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [11:55:10] (03CR) 10Volans: [C:03+2] remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond) [11:55:59] (03CR) 10Clément Goubert: [C:03+2] team-sre/mw-cron: Fix logstash link [alerts] - 10https://gerrit.wikimedia.org/r/1136352 (owner: 10Clément Goubert) [11:57:14] (03Merged) 10jenkins-bot: team-sre/mw-cron: Fix logstash link [alerts] - 10https://gerrit.wikimedia.org/r/1136352 (owner: 10Clément Goubert) [12:00:04] James_F: OwO what's this, a deployment window?? Wikifunctions MediaWiki integration backport. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1200). nyaa~ [12:00:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136049 (https://phabricator.wikimedia.org/T391594) (owner: 10Jforrester) [12:00:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136050 (owner: 10Jforrester) [12:00:45] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136051 (https://phabricator.wikimedia.org/T391441) (owner: 10Jforrester) [12:00:46] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136126 (owner: 10Jforrester) [12:01:02] (03PS1) 10Jelto: gitlab: create an alias for apus credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) [12:01:17] PROBLEM - MariaDB Replica SQL: pc3 on pc2013 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1539, Errmsg: Error Unknown event wmf_slave_overload on query. Default database: . [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:01:40] (03Merged) 10jenkins-bot: logging: Allow through WikiLambdaClient logs at info level and above [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136126 (owner: 10Jforrester) [12:02:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P74944 and previous config saved to /var/cache/conftool/dbconfig/20250414-120219-fceratto.json [12:02:55] (03PS1) 10Filippo Giunchedi: profile: fix prometheus cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1136360 [12:02:58] (03CR) 10Volans: "post-merge reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [12:03:03] (03CR) 10CI reject: [V:04-1] gitlab: create an alias for apus credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [12:03:17] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Make choice of firewall stack in insetup roles specific / Add nftables variants - https://phabricator.wikimedia.org/T389825#10738270 (10MoritzMuehlenhoff) 05Open→03Resolved Separate insetup roles have been created and an announcement was sent t... [12:03:27] (03Merged) 10jenkins-bot: Special pages: Don't just set userCanExecute() but actually run it [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136049 (https://phabricator.wikimedia.org/T391594) (owner: 10Jforrester) [12:03:39] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [12:05:06] (03Merged) 10jenkins-bot: Client mode: Provide WikiLambdaClientModeOffline for SRE to disable [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136050 (owner: 10Jforrester) [12:05:56] (03Merged) 10jenkins-bot: Wikifunctions VE: Add loading and abort state to content editable [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136051 (https://phabricator.wikimedia.org/T391441) (owner: 10Jforrester) [12:06:13] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136049|Special pages: Don't just set userCanExecute() but actually run it (T391594)]], [[gerrit:1136050|Client mode: Provide WikiLambdaClientModeOffline for SRE to disable]], [[gerrit:1136051|Wikifunctions VE: Add loading and abort state to content editable (T391441)]], [[gerrit:1136126|logging: Allow through WikiLambdaClient logs at info level and [12:06:14] above]] [12:06:18] T391594: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectLabel: [1146] Table 'test2wiki.wikilamb - https://phabricator.wikimedia.org/T391594 [12:06:18] T391441: [VE WikifunctionsCall]: Adapt function call editing to new parsoid version - https://phabricator.wikimedia.org/T391441 [12:06:29] (03Merged) 10jenkins-bot: remote: make RemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/803243 (owner: 10Jbond) [12:06:56] (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [12:09:03] PROBLEM - MariaDB Replica Lag: pc3 on pc2013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.85 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:09:06] On it [12:11:16] (03PS2) 10Hnowlan: jobrunner: clean up remaining cruft [puppet] - 10https://gerrit.wikimedia.org/r/1135465 [12:11:17] RECOVERY - MariaDB Replica SQL: pc3 on pc2013 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:12:03] RECOVERY - MariaDB Replica Lag: pc3 on pc2013 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:13:55] (03PS1) 10Arnaudb: gerrit: failover cookbook fix [cookbooks] - 10https://gerrit.wikimedia.org/r/1136361 (https://phabricator.wikimedia.org/T260666) [12:13:55] (03CR) 10Arnaudb: "missed that fix in the previous merge" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136361 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [12:14:28] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10738327 (10Ladsgroup) >>! In T355914#10738083, @hgzh wrote: > Thanks for the links, most of the requests are based on a local discussion and also the glo... [12:15:27] (03PS2) 10Brouberol: airflow-test-k8s: adjust dag/file processing timeout to account for large v1 dumps dags [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136351 (https://phabricator.wikimedia.org/T391744) [12:17:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P74945 and previous config saved to /var/cache/conftool/dbconfig/20250414-121726-fceratto.json [12:18:20] (03CR) 10Hashar: "Ah excellent thank you! :)" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [12:19:06] (03PS4) 10Hnowlan: mw::periodic_jobs: migrate purgeExpiredMentorStatus to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1135789 (https://phabricator.wikimedia.org/T385782) [12:19:09] (03CR) 10Clément Goubert: [C:03+2] python-webapp: Update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135432 (https://phabricator.wikimedia.org/T384212) (owner: 10Clément Goubert) [12:19:11] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136366 [12:19:12] (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1135043 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [12:19:32] (03Abandoned) 10Hashar: ci: switch jenkins deployment method on contint to scap [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [12:20:55] (03Merged) 10jenkins-bot: python-webapp: Update modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1135432 (https://phabricator.wikimedia.org/T384212) (owner: 10Clément Goubert) [12:22:19] !log cgoubert@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:22:56] !log cgoubert@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:23:30] !log cgoubert@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'ores-legacy' for release 'main' . [12:23:39] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:24:24] !log cgoubert@deploy1003 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10738390 (10phaultfinder) [12:24:45] !log cgoubert@deploy1003 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:24:50] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1136360 (owner: 10Filippo Giunchedi) [12:25:13] !log cgoubert@deploy1003 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [12:25:50] (03PS1) 10Jforrester: Complete our RecentChanges entry generation and formatting [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136368 (https://phabricator.wikimedia.org/T386020) [12:28:24] (03CR) 10Filippo Giunchedi: [C:03+2] profile: fix prometheus cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1136360 (owner: 10Filippo Giunchedi) [12:28:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:29:09] 10SRE-swift-storage, 10CX-deployments, 10MinT, 10LPL Essential (LPL Essential 2025 Apr-Jun: CX): Provide better long-term storage for translation models - https://phabricator.wikimedia.org/T335491#10738399 (10Nikerabbit) [12:30:31] (03CR) 10Volans: sanitarium_restart.py: restart Sanitarium hosts (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [12:30:52] (03Abandoned) 10AOkoth: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [12:31:29] (03CR) 10Volans: [C:03+1] "LGTM, thx" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136361 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [12:31:47] (03CR) 10Arnaudb: [C:03+2] gerrit: failover cookbook fix [cookbooks] - 10https://gerrit.wikimedia.org/r/1136361 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [12:31:54] (03CR) 10Effie Mouzeli: "LGTM! one question, do we still need them defined in common.yaml (under wikimedia_clusters)?" [puppet] - 10https://gerrit.wikimedia.org/r/1135465 (owner: 10Hnowlan) [12:31:57] (03CR) 10Effie Mouzeli: [C:03+1] jobrunner: clean up remaining cruft [puppet] - 10https://gerrit.wikimedia.org/r/1135465 (owner: 10Hnowlan) [12:32:21] (03CR) 10Marostegui: sanitarium_restart.py: restart Sanitarium hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [12:32:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T391056)', diff saved to https://phabricator.wikimedia.org/P74946 and previous config saved to /var/cache/conftool/dbconfig/20250414-123234-fceratto.json [12:32:38] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [12:32:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1189.eqiad.wmnet with reason: Maintenance [12:32:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T391056)', diff saved to https://phabricator.wikimedia.org/P74947 and previous config saved to /var/cache/conftool/dbconfig/20250414-123255-fceratto.json [12:36:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T391056)', diff saved to https://phabricator.wikimedia.org/P74948 and previous config saved to /var/cache/conftool/dbconfig/20250414-123649-fceratto.json [12:36:52] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136049|Special pages: Don't just set userCanExecute() but actually run it (T391594)]], [[gerrit:1136050|Client mode: Provide WikiLambdaClientModeOffline for SRE to disable]], [[gerrit:1136051|Wikifunctions VE: Add loading and abort state to content editable (T391441)]], [[gerrit:1136126|logging: Allow through WikiLambdaClient logs at info level and [12:36:52] above]] [12:36:56] T391594: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectLabel: [1146] Table 'test2wiki.wikilamb - https://phabricator.wikimedia.org/T391594 [12:36:57] T391441: [VE WikifunctionsCall]: Adapt function call editing to new parsoid version - https://phabricator.wikimedia.org/T391441 [12:36:59] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820 (10Lena_WMDE) 03NEW [12:37:16] (03PS3) 10KartikMistry: MinT: Increase liveness probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1125093 (https://phabricator.wikimedia.org/T386889) [12:38:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10738496 (10Lena_WMDE) [12:39:40] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10738498 (10WMDE-leszek) On WMDE's behalf I approve this request, and confirm @Lena_WMDE is who she claims to be. [12:40:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Lena Meintrup - https://phabricator.wikimedia.org/T391820#10738499 (10Lena_WMDE) [12:40:21] (03CR) 10Volans: sanitarium_restart.py: restart Sanitarium hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [12:42:25] (03CR) 10Marostegui: sanitarium_restart.py: restart Sanitarium hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [12:43:45] !log remove ganeti01.svc.eqsin.wmnet cert (replaced by cfssl cert) T357750 [12:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:50] T357750: Phase out cergen - https://phabricator.wikimedia.org/T357750 [12:44:24] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1136049|Special pages: Don't just set userCanExecute() but actually run it (T391594)]], [[gerrit:1136050|Client mode: Provide WikiLambdaClientModeOffline for SRE to disable]], [[gerrit:1136051|Wikifunctions VE: Add loading and abort state to content editable (T391441)]], [[gerrit:1136126|logging: Allow through WikiLambdaClient logs at info level and above]] sync [12:44:24] ed to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:44:29] T391594: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectLabel: [1146] Table 'test2wiki.wikilamb - https://phabricator.wikimedia.org/T391594 [12:44:29] T391441: [VE WikifunctionsCall]: Adapt function call editing to new parsoid version - https://phabricator.wikimedia.org/T391441 [12:44:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10738536 (10phaultfinder) [12:46:05] !log remove ganeti01.svc.ulsfo.wmnet cert (replaced by cfssl cert) T357750 [12:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:22] !log jforrester@deploy1003 jforrester: Continuing with sync [12:48:04] (03PS2) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) [12:48:27] (03CR) 10CI reject: [V:04-1] ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [12:48:39] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:48:44] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:48:53] FIRING: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [12:48:53] (03CR) 10Hashar: "From https://gerrit.wikimedia.org/r/c/operations/puppet/+/676008/comment/ccc95a45_71fd7099/ , the logic is shared with production and SRE " [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [12:49:37] !log remove ganeti01.svc.esams.wmnet cert (replaced by cfssl cert) T357750 [12:49:40] (03PS3) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) [12:49:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:41] T357750: Phase out cergen - https://phabricator.wikimedia.org/T357750 [12:50:45] !log upgrade prometheus2005 to thanos 0.38.0 - T383966 [12:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:49] T383966: Upgrade Thanos to 0.38.0 - https://phabricator.wikimedia.org/T383966 [12:51:34] !log upgrade prometheus2007 to thanos 0.38.0 - T383966 [12:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P74949 and previous config saved to /var/cache/conftool/dbconfig/20250414-125156-fceratto.json [12:53:06] !log remove ganeti01.svc.codfw.wmnet cert (replaced by cfssl cert) T357750 [12:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:39] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:54:51] (03PS2) 10Jelto: gitlab: create an alias for apus credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) [12:55:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool pc5 T391454', diff saved to https://phabricator.wikimedia.org/P74950 and previous config saved to /var/cache/conftool/dbconfig/20250414-125511-marostegui.json [12:55:15] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [12:56:14] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_ulsfo and not P{cp4047.ulsfo.wmnet} and not P{cp4045.ulsfo.wmnet} and A:cp [12:56:20] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136049|Special pages: Don't just set userCanExecute() but actually run it (T391594)]], [[gerrit:1136050|Client mode: Provide WikiLambdaClientModeOffline for SRE to disable]], [[gerrit:1136051|Wikifunctions VE: Add loading and abort state to content editable (T391441)]], [[gerrit:1136126|logging: Allow through WikiLambdaClient logs at info level an [12:56:20] d above]] (duration: 19m 27s) [12:56:21] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on pc2015.codfw.wmnet,pc1015.eqiad.wmnet with reason: Maintenance [12:56:22] Just in time for the backport window. [12:56:24] T391594: PHP Deprecated: Caller from MediaWiki\Exception\MWExceptionHandler::rollbackPrimaryChanges ignored an error originally raised from MediaWiki\Extension\WikiLambda\ZObjectStore::fetchZObjectLabel: [1146] Table 'test2wiki.wikilamb - https://phabricator.wikimedia.org/T391594 [12:56:24] T391441: [VE WikifunctionsCall]: Adapt function call editing to new parsoid version - https://phabricator.wikimedia.org/T391441 [12:56:34] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_ulsfo and not P{cp4037.ulsfo.wmnet} and A:cp [12:56:58] (03PS1) 10Marostegui: mariadb: pc5 upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136373 (https://phabricator.wikimedia.org/T391454) [12:57:08] (03CR) 10CI reject: [V:04-1] gitlab: create an alias for apus credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [12:58:39] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:58:42] (03CR) 10Marostegui: [C:03+2] mariadb: pc5 upgrade to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1136373 (https://phabricator.wikimedia.org/T391454) (owner: 10Marostegui) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: Time to snap out of that daydream and deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1300). [13:00:05] MatmaRex and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] o/ [13:00:21] I can’t deploy today, forgot to bring my yubikey to the office 😔 [13:00:35] hi [13:00:58] !log remove ganeti01.svc.eqiad.wmnet cert (replaced by cfssl cert) T357750 [13:01:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:02] T357750: Phase out cergen - https://phabricator.wikimedia.org/T357750 [13:01:22] (03CR) 10Majavah: Add wmcs-bastionless utility script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1118526 (https://phabricator.wikimedia.org/T379550) (owner: 10Andrew Bogott) [13:02:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repool pc5 T391454', diff saved to https://phabricator.wikimedia.org/P74951 and previous config saved to /var/cache/conftool/dbconfig/20250414-130222-marostegui.json [13:02:26] T391454: Migrate pcX sections to MariaDB 10.11 - https://phabricator.wikimedia.org/T391454 [13:02:57] (03PS1) 10Volans: sre.hosts.reimage: check dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) [13:03:06] * TheresNoTime can deploy [13:03:11] \o/ [13:03:25] MatmaRex: starting with your CA patch [13:03:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [extensions/CentralAuth] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135993 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [13:03:59] thanks [13:04:29] nothing to test on mwdebug here, we don't have a way to reproduce these failures [13:04:42] ack [13:06:03] (03CR) 10Elukey: [V:03+2 C:03+2] Updating docker-pkg to 4.0.4 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1134727 (owner: 10Elukey) [13:07:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P74952 and previous config saved to /var/cache/conftool/dbconfig/20250414-130703-fceratto.json [13:08:39] RESOLVED: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:08:42] (03CR) 10Volans: "Tested with test-cookbook in dry-run on a pooled host:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans) [13:09:53] (03PS1) 10Ssingh: hiera: durum: add dummy ECH private key [labs/private] - 10https://gerrit.wikimedia.org/r/1136376 (https://phabricator.wikimedia.org/T205378) [13:10:18] (03PS1) 10Muehlenhoff: Remove from list of approvers [puppet] - 10https://gerrit.wikimedia.org/r/1136377 [13:11:03] (03CR) 10Ssingh: [V:03+2 C:03+2] hiera: durum: add dummy ECH private key [labs/private] - 10https://gerrit.wikimedia.org/r/1136376 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:13:02] !log remove old LVs from prometheus[12]00[56] - T383232 [13:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:05] T383232: Move k8s Prometheus instances to new Prometheus hw in eqiad/codfw - https://phabricator.wikimedia.org/T383232 [13:13:14] !log elukey@deploy1003 Started deploy [docker-pkg/deploy@a555b7b]: Upgrade to 4.0.4 [13:13:29] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10738719 (10hgzh) I tried an onwiki answer, so thank you for the reply here. But IMO this could have been announced earlier and more detailed, keeping in... [13:13:47] !log elukey@deploy1003 Finished deploy [docker-pkg/deploy@a555b7b]: Upgrade to 4.0.4 (duration: 00m 38s) [13:14:33] (03Merged) 10jenkins-bot: CentralAuthTokenManager: Log failures for write operations [extensions/CentralAuth] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1135993 (https://phabricator.wikimedia.org/T390784) (owner: 10Bartosz Dziewoński) [13:14:52] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1135993|CentralAuthTokenManager: Log failures for write operations (T390784)]] [13:14:56] T390784: Error when logging-in on auth.wikimedia.org: "The provided authentication token is either expired or invalid." - https://phabricator.wikimedia.org/T390784 [13:15:10] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans) [13:16:55] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_magru [13:17:06] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_magru [13:17:07] (03CR) 10Elukey: "LGTM, I left a nit about the log message, the rest looks good and safer." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans) [13:17:25] (03CR) 10Elukey: [C:03+2] profile::pyrra::filesystem::slo: add citoid's SLO [puppet] - 10https://gerrit.wikimedia.org/r/1135746 (owner: 10Elukey) [13:17:30] !log rolling upgrade to varnish 7.1.1-1.1~bpo11+wmf3 in magru - T391334 [13:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:33] T391334: varnish 7.1.1 crash - https://phabricator.wikimedia.org/T391334 [13:18:16] !log bking@cumin2002 START - Cookbook sre.hosts.rename from cirrussearch2014 to cirrussearch2104 [13:18:25] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10738759 (10Papaul) @VRiley-WMF yes it is OK to apply 7.20 to the server. My personally opinion I don't think applying this latest IDRAC upgrade to the server will provide us with any information then wha... [13:18:27] !log bking@cumin2002 START - Cookbook sre.dns.netbox [13:19:47] !log samtar@deploy1003 samtar, matmarex: Backport for [[gerrit:1135993|CentralAuthTokenManager: Log failures for write operations (T390784)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:19:49] !log samtar@deploy1003 samtar, matmarex: Continuing with sync [13:19:58] (03PS1) 10Giuseppe Lavagetto: Release for conftool 5.1.0 [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1136378 [13:20:58] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Release for conftool 5.1.0 [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1136378 (owner: 10Giuseppe Lavagetto) [13:21:29] !log oblivian@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Compatibility with conftool 5.1.0 - oblivian@cumin2002" [13:21:31] !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Compatibility with conftool 5.1.0 - oblivian@cumin2002 [13:22:01] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Compatibility with conftool 5.1.0 - oblivian@cumin2002 [13:22:03] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Compatibility with conftool 5.1.0 - oblivian@cumin2002" [13:22:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T391056)', diff saved to https://phabricator.wikimedia.org/P74953 and previous config saved to /var/cache/conftool/dbconfig/20250414-132210-fceratto.json [13:22:14] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [13:22:26] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1198.eqiad.wmnet with reason: Maintenance [13:22:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T391056)', diff saved to https://phabricator.wikimedia.org/P74954 and previous config saved to /var/cache/conftool/dbconfig/20250414-132232-fceratto.json [13:22:34] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cirrussearch2014 to cirrussearch2104 - bking@cumin2002" [13:22:55] !log oblivian@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Compatibility with conftool 5.1.0 (take 2) - oblivian@cumin2002" [13:22:59] !log oblivian@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Compatibility with conftool 5.1.0 (take 2) - oblivian@cumin2002 [13:23:35] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Compatibility with conftool 5.1.0 (take 2) - oblivian@cumin2002 [13:23:37] !log oblivian@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Compatibility with conftool 5.1.0 (take 2) - oblivian@cumin2002" [13:24:57] RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:25:18] (03CR) 10Volans: sre.hosts.reimage: check dbctl (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans) [13:26:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T391056)', diff saved to https://phabricator.wikimedia.org/P74955 and previous config saved to /var/cache/conftool/dbconfig/20250414-132625-fceratto.json [13:26:32] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135993|CentralAuthTokenManager: Log failures for write operations (T390784)]] (duration: 11m 39s) [13:26:35] T390784: Error when logging-in on auth.wikimedia.org: "The provided authentication token is either expired or invalid." - https://phabricator.wikimedia.org/T390784 [13:26:53] MatmaRex: anzx: running your two config patches together [13:27:13] ok [13:27:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 (owner: 10Bartosz Dziewoński) [13:27:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by samtar@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136104 (https://phabricator.wikimedia.org/T348611) (owner: 10Anzx) [13:27:34] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming cirrussearch2014 to cirrussearch2104 - bking@cumin2002" [13:27:35] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:27:36] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2104 [13:27:40] thanks [13:27:41] FIRING: [6x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [13:27:45] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2104 [13:27:47] (03CR) 10Hnowlan: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1135465 (owner: 10Hnowlan) [13:28:06] (03Merged) 10jenkins-bot: Enable SUL3 on most remaining beta cluster wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135850 (owner: 10Bartosz Dziewoński) [13:28:11] (03Merged) 10jenkins-bot: punjabiwikimedia, maiwikimedia: fix tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136104 (https://phabricator.wikimedia.org/T348611) (owner: 10Anzx) [13:28:25] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from cirrussearch2014 to cirrussearch2104 [13:28:27] (03PS1) 10Jforrester: Switch test Wikifunctions client deployment from test2wiki to test2iki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136379 (https://phabricator.wikimedia.org/T391584) [13:28:28] !log samtar@deploy1003 Started scap sync-world: Backport for [[gerrit:1135850|Enable SUL3 on most remaining beta cluster wikis]], [[gerrit:1136104|punjabiwikimedia, maiwikimedia: fix tagline (T348611)]] [13:28:29] (03PS1) 10Jforrester: Document Wikifunctions options, adding wgWikiLambdaClientModeOffline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136380 (https://phabricator.wikimedia.org/T391584) [13:28:31] T348611: [Deployment] Fix logo clipping issues in mai and punjabi wikis - https://phabricator.wikimedia.org/T348611 [13:28:54] (03CR) 10Elukey: sre.hosts.reimage: check dbctl (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans) [13:30:49] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2104.codfw.wmnet with OS bullseye [13:30:53] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2104 [13:30:54] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2104 [13:30:54] FIRING: [5x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:33:04] !log samtar@deploy1003 matmarex, anzx, samtar: Backport for [[gerrit:1135850|Enable SUL3 on most remaining beta cluster wikis]], [[gerrit:1136104|punjabiwikimedia, maiwikimedia: fix tagline (T348611)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:33:17] looking [13:33:22] ack [13:33:39] RESOLVED: NetworkDeviceAlarmActive: Alarm active on cr2-codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#Juniper_alarm - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DNetworkDeviceAlarmActive [13:33:43] TheresNoTime: logos on both wikis look good [13:33:47] !log samtar@deploy1003 matmarex, anzx, samtar: Continuing with sync [13:33:51] (03PS2) 10Volans: sre.hosts.reimage: check dbctl [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) [13:34:17] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1136374 (https://phabricator.wikimedia.org/T377878) (owner: 10Volans) [13:34:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10738897 (10phaultfinder) [13:34:58] (03CR) 10Fabfur: [C:03+1] sre: Add LibericaUnhealthyRealserverPooled alert [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez) [13:35:34] (03PS1) 10Elukey: profile::pyrra: fix Istio SLO metrics template [puppet] - 10https://gerrit.wikimedia.org/r/1136381 [13:35:54] FIRING: [6x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:37:20] (03PS1) 10Fabfur: data-engineering: duplicating varnishkafka alerts [alerts] - 10https://gerrit.wikimedia.org/r/1136383 (https://phabricator.wikimedia.org/T391810) [13:38:05] !log reprepro -C component/nginx-ech include bookworm-wikimedia nginx_1.22.1-9+deb12u1+ech2_amd64.changes: T205378 [13:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:08] T205378: Support ECH on Wikimedia servers - https://phabricator.wikimedia.org/T205378 [13:39:26] (03PS7) 10Ssingh: P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [13:40:28] !log samtar@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135850|Enable SUL3 on most remaining beta cluster wikis]], [[gerrit:1136104|punjabiwikimedia, maiwikimedia: fix tagline (T348611)]] (duration: 12m 00s) [13:40:32] T348611: [Deployment] Fix logo clipping issues in mai and punjabi wikis - https://phabricator.wikimedia.org/T348611 [13:40:40] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:40:41] anzx: live and logo purged :) [13:40:55] TheresNoTime: thank you for deploying [13:41:18] !log UTC afternoon backport window done [13:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P74956 and previous config saved to /var/cache/conftool/dbconfig/20250414-134132-fceratto.json [13:42:11] (03CR) 10Elukey: [C:03+2] profile::pyrra: fix Istio SLO metrics template [puppet] - 10https://gerrit.wikimedia.org/r/1136381 (owner: 10Elukey) [13:42:11] thanks for deploying TheresNoTime! [13:42:57] (03CR) 10Volans: [C:03+2] commit: refactor asking for approval [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [13:43:22] Lucas_WMDE: np! :) [13:43:54] (03CR) 10DCausse: [C:03+1] cirrussearch: remove no-longer-existing master-eligibles. [puppet] - 10https://gerrit.wikimedia.org/r/1136026 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:45:38] (03PS1) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) [13:45:40] (03PS1) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136387 (https://phabricator.wikimedia.org/T388610) [13:46:25] (03Abandoned) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136387 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:46:54] (03CR) 10Effie Mouzeli: [C:03+1] "cheers, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1135465 (owner: 10Hnowlan) [13:47:04] !log arnaudb@cumin1002 START - Cookbook sre.gerrit.failover from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [13:47:09] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.gerrit.failover (exit_code=97) from gerrit1003.wikimedia.org to gerrit2003.wikimedia.org [13:47:51] (03PS2) 10Bking: cirrussearch: Add new master-eligibles [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) [13:49:23] jouncebot: nowandnext [13:49:23] For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1300) [13:49:23] In 1 hour(s) and 40 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1530) [13:49:33] (03CR) 10Bking: [C:03+2] cirrussearch: remove no-longer-existing master-eligibles. [puppet] - 10https://gerrit.wikimedia.org/r/1136026 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:49:37] (03PS8) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [13:50:16] (03CR) 10Cathal Mooney: [C:03+1] Host BGP: ignore hosts with no primary IP [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136150 (owner: 10Ayounsi) [13:50:33] (03CR) 10Cathal Mooney: [C:03+1] "sorry that's probably my fault with the nokia test servers is it?" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136150 (owner: 10Ayounsi) [13:50:43] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:51:36] (03CR) 10Vgutierrez: [C:04-1] P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:53:27] (03PS9) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [13:53:56] (03CR) 10Vgutierrez: [C:03+2] sre: Add LibericaUnhealthyRealserverPooled alert [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez) [13:54:30] (03PS2) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666) [13:55:16] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:55:20] (03Merged) 10jenkins-bot: commit: refactor asking for approval [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [13:55:27] (03CR) 10Cathal Mooney: "LGTM overall, I think we probably should get a list of what hosts are using this role and run PCC against them just to check there is no c" [puppet] - 10https://gerrit.wikimedia.org/r/1135455 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:55:40] (03CR) 10Volans: [C:03+2] commit: refactor asking for approval (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [13:55:54] FIRING: [8x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:55:57] (03PS3) 10Jelto: gitlab: fix type of s3 credentials [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) [13:55:58] (03Merged) 10jenkins-bot: sre: Add LibericaUnhealthyRealserverPooled alert [alerts] - 10https://gerrit.wikimedia.org/r/1136334 (https://phabricator.wikimedia.org/T391697) (owner: 10Vgutierrez) [13:56:16] (03CR) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:56:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P74960 and previous config saved to /var/cache/conftool/dbconfig/20250414-135640-fceratto.json [13:56:49] 06SRE, 10MW-on-K8s, 06serviceops, 13Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791#10739003 (10hnowlan) 05Open→03Resolved All jobrunner hardware decommissioned or reclaimed, services torn down, puppet cleaned up. [13:57:49] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1178.eqiad.wmnet with OS bullseye [13:57:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739008 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1178.eqi... [13:58:42] (03CR) 10Bking: [V:04-1] "Do not merge until elastic2115 has been reimaged to cirrussearch2115" [puppet] - 10https://gerrit.wikimedia.org/r/1136386 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [13:59:16] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10739013 (10MoritzMuehlenhoff) p:05Triage→03Medium [13:59:41] I accidentally thanos, it is coming back [13:59:46] (03PS10) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [14:00:23] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5284/console" [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:00:54] FIRING: [8x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [14:01:01] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2104.codfw.wmnet with reason: host reimage [14:01:41] !log temp disable "backend time" panel using unaggregated big mediawiki metric on "reading web performance" dashboard - T391677 [14:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:44] T391677: Audit dashboards using histogram_quantile on mediawiki_WikimediaEvents_editResponseTime - https://phabricator.wikimedia.org/T391677 [14:01:49] (03PS9) 10Federico Ceratto: Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) [14:01:50] (03CR) 10Federico Ceratto: "Basic cookbook moving the existing code from switchmaster." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [14:03:23] (03CR) 10Volans: [C:03+2] commit: allow to approve/reject diffs globally (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [14:03:39] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:03:40] (03CR) 10Volans: [C:03+2] doc: update documentation configuration [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 (owner: 10Volans) [14:04:05] (03PS3) 10Arnaudb: gerrit: failover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666) [14:04:48] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2104.codfw.wmnet with reason: host reimage [14:04:55] (03CR) 10Federico Ceratto: "The CR received a +1, is it ok if I set the required changes as Resolved?" [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [14:04:57] (03CR) 10Arnaudb: gerrit: failover cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1136385 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [14:05:35] 10ops-codfw, 06SRE, 06DC-Ops: cr2-codfw: 2/4 PSU down - https://phabricator.wikimedia.org/T391790#10739035 (10Jhancock.wm) 05Open→03Resolved reseated the cables to the two downed CPU. direct result of tension from the fiber drop connected the cage in DH7 to DH5. pointed out the issue to the maintenan... [14:06:16] (03CR) 10Vgutierrez: [C:04-1] P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:06:20] (03PS1) 10Ssingh: modules: move durum.yaml to secret snake oil [labs/private] - 10https://gerrit.wikimedia.org/r/1136389 [14:07:10] (03CR) 10Muehlenhoff: [C:03+2] Remove from list of approvers [puppet] - 10https://gerrit.wikimedia.org/r/1136377 (owner: 10Muehlenhoff) [14:11:21] (03PS2) 10Ssingh: modules: move durum.yaml to secret snake oil [labs/private] - 10https://gerrit.wikimedia.org/r/1136389 [14:11:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T391056)', diff saved to https://phabricator.wikimedia.org/P74961 and previous config saved to /var/cache/conftool/dbconfig/20250414-141148-fceratto.json [14:11:51] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [14:12:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1212.eqiad.wmnet with reason: Maintenance [14:12:22] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [14:12:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T391056)', diff saved to https://phabricator.wikimedia.org/P74962 and previous config saved to /var/cache/conftool/dbconfig/20250414-141227-fceratto.json [14:14:01] (03Merged) 10jenkins-bot: commit: allow to approve/reject diffs globally [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [14:14:23] (03Merged) 10jenkins-bot: doc: update documentation configuration [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 (owner: 10Volans) [14:15:06] (03PS1) 10Jelto: gitlab: use a wmflib::expand_path compatible path for apus keys [labs/private] - 10https://gerrit.wikimedia.org/r/1136391 (https://phabricator.wikimedia.org/T378922) [14:15:16] (03CR) 10Filippo Giunchedi: [C:03+2] logstash: cast 'error' to object too [puppet] - 10https://gerrit.wikimedia.org/r/1135917 (owner: 10Filippo Giunchedi) [14:15:34] (03CR) 10Jelto: [V:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [14:15:52] (03CR) 10Ssingh: [V:03+2 C:03+2] modules: move durum.yaml to secret snake oil [labs/private] - 10https://gerrit.wikimedia.org/r/1136389 (owner: 10Ssingh) [14:16:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T391056)', diff saved to https://phabricator.wikimedia.org/P74963 and previous config saved to /var/cache/conftool/dbconfig/20250414-141639-fceratto.json [14:16:55] (03CR) 10Ssingh: [C:03+1] "It's a yes from me!" [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [14:18:29] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10739100 (10VRiley-WMF) Understood. I will be reaching out to them again to see if we can request that plan of action that you've recommended. I can ask them about the mainboard to see if they would replac... [14:19:54] (03PS11) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [14:21:23] (03CR) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:21:39] (03CR) 10Nikerabbit: Catalog ContentTranslation tables (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1135730 (https://phabricator.wikimedia.org/T386094) (owner: 10Nik Gkountas) [14:23:05] (03CR) 10Ssingh: [C:04-1] P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:23:39] FIRING: [2x] SystemdUnitFailed: elasticsearch-disable-readahead.service on elastic2115:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:24:21] (03PS12) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [14:25:23] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:26:04] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2104.codfw.wmnet with OS bullseye [14:28:01] (03PS1) 10Volans: CHANGELOG: add changelogs for release v0.9.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1136392 [14:28:29] (03CR) 10Filippo Giunchedi: "I believe we can abandon this now" [alerts] - 10https://gerrit.wikimedia.org/r/1135673 (owner: 10Slyngshede) [14:28:53] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v0.9.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1136392 (owner: 10Volans) [14:29:13] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1136022 (https://phabricator.wikimedia.org/T374711) (owner: 10JHathaway) [14:30:46] (03CR) 10Filippo Giunchedi: [C:03+1] netbox-hiera: adding pdu type [puppet] - 10https://gerrit.wikimedia.org/r/1128479 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [14:31:32] (03PS13) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [14:31:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P74964 and previous config saved to /var/cache/conftool/dbconfig/20250414-143146-fceratto.json [14:32:16] (03CR) 10Filippo Giunchedi: [C:03+1] pdu_config_netbox: add new module to grab PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1124083 (https://phabricator.wikimedia.org/T387231) (owner: 10Tiziano Fogli) [14:35:07] (03PS12) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [14:40:26] (03CR) 10Ayounsi: [C:03+1] CHANGELOG: add changelogs for release v0.9.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1136392 (owner: 10Volans) [14:40:55] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v0.9.0 [software/homer] - 10https://gerrit.wikimedia.org/r/1136392 (owner: 10Volans) [14:42:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): relocate (1) discovery-search elastic1067 out of eqiad D6 - https://phabricator.wikimedia.org/T391542#10739150 (10Gehel) [14:45:06] (03PS1) 10Scott French: Remove PHP 8.1 migration WikimediaEvents settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135507 (https://phabricator.wikimedia.org/T391421) [14:45:08] (03PS1) 10Scott French: hieradata: remove mw-php-migration.lua from plugin chains [puppet] - 10https://gerrit.wikimedia.org/r/1135504 (https://phabricator.wikimedia.org/T391421) [14:45:39] (03PS14) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [14:46:17] (03PS1) 10Herron: logstash: increase refresh_interval to 10s in index templates [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714) [14:46:40] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5290/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:46:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P74965 and previous config saved to /var/cache/conftool/dbconfig/20250414-144653-fceratto.json [14:47:09] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10739175 (10JTweed-WMF) [14:51:54] (03CR) 10Vgutierrez: [C:04-1] P:durum: add conditional to enable ECH (durum2002) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:53:43] (03CR) 10Effie Mouzeli: [C:03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135507 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [14:54:37] (03CR) 10Effie Mouzeli: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1135504 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [14:54:59] (03PS15) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [14:58:07] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5291/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:58:55] (03CR) 10Ahmon Dancy: [C:03+1] "Looks reasonable to me" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136335 (https://phabricator.wikimedia.org/T387781) (owner: 10Hashar) [14:59:11] (03PS16) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [14:59:37] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5292/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:59:41] (03PS1) 10Volans: Release v0.9.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136396 [15:00:03] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:02:01] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T391056)', diff saved to https://phabricator.wikimedia.org/P74966 and previous config saved to /var/cache/conftool/dbconfig/20250414-150200-fceratto.json [15:02:04] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:02:16] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1240.eqiad.wmnet with reason: Maintenance [15:05:31] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1129179 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [15:07:48] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:08:39] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:43] (03PS2) 10Eevans: restbase: bootstrap restbase1044 (refresh for restbase1029) [puppet] - 10https://gerrit.wikimedia.org/r/1130174 (https://phabricator.wikimedia.org/T389423) [15:11:43] (03PS2) 10Eevans: restbase: bootstrap restbase1045 (refresh for restbase1030) [puppet] - 10https://gerrit.wikimedia.org/r/1130175 (https://phabricator.wikimedia.org/T389423) [15:11:44] (03CR) 10Ayounsi: [C:03+1] Release v0.9.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136396 (owner: 10Volans) [15:11:52] (03CR) 10Ayounsi: [C:03+2] Host BGP: ignore hosts with no primary IP [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136150 (owner: 10Ayounsi) [15:12:34] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1130174 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [15:12:41] (03CR) 10Filippo Giunchedi: [C:03+1] logstash: increase refresh_interval to 10s in index templates [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron) [15:13:10] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2149.codfw.wmnet with reason: Maintenance [15:13:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T391056)', diff saved to https://phabricator.wikimedia.org/P74967 and previous config saved to /var/cache/conftool/dbconfig/20250414-151316-fceratto.json [15:13:19] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:15:31] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-misc2002.codfw.wmnet [15:18:16] (03CR) 10Filippo Giunchedi: [C:03+2] snmp_exporter: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129179 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [15:18:23] (03PS2) 10Filippo Giunchedi: snmp_exporter: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129179 (https://phabricator.wikimedia.org/T389170) [15:18:46] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] snmp_exporter: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1129179 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [15:20:18] (03CR) 10Volans: [C:03+2] Release v0.9.0 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1136396 (owner: 10Volans) [15:20:22] (03CR) 10Eevans: [C:03+2] restbase: bootstrap restbase1044 (refresh for restbase1029) [puppet] - 10https://gerrit.wikimedia.org/r/1130174 (https://phabricator.wikimedia.org/T389423) (owner: 10Eevans) [15:22:17] (03CR) 10Filippo Giunchedi: [C:03+1] "My understanding is that refresh time affects how long it takes for indexed documents to be available for search; worth adding "high frequ" [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron) [15:22:24] (03PS1) 10Gergő Tisza: private: Drop $wgCentralAuthSul3SharedDomainRestrictions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136397 (https://phabricator.wikimedia.org/T390329) [15:22:57] (03CR) 10Federico Ceratto: [C:03+1] mysql: make MysqlRemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325 (owner: 10Volans) [15:22:58] (03CR) 10Gergő Tisza: [C:04-2] "Needs to wait a week for the train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136397 (https://phabricator.wikimedia.org/T390329) (owner: 10Gergő Tisza) [15:23:58] !log volans@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.9.0 - volans@cumin1002 [15:24:48] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-misc2002.codfw.wmnet [15:24:49] (03CR) 10Volans: [C:03+2] mysql: make MysqlRemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325 (owner: 10Volans) [15:24:59] (03CR) 10Volans: [C:03+2] cookbook modules: use docstring for title [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136326 (owner: 10Volans) [15:25:39] !log volans@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.9.0 - volans@cumin1002 [15:25:46] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye [15:25:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739522 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1181... [15:26:18] !log deployed homer v0.9.0 to cumin hosts [15:26:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:29:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T391056)', diff saved to https://phabricator.wikimedia.org/P74968 and previous config saved to /var/cache/conftool/dbconfig/20250414-152911-fceratto.json [15:29:16] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [15:29:57] (03CR) 10Ssingh: [V:03+1 C:04-1] "Still figuring out the correlation between outer and inner SNI." [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:30:05] jan_drewniak: Your horoscope predicts another Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1530). [15:30:05] (03PS1) 10Herron: logstash: set "index.translog.durability": "async" as template default [puppet] - 10https://gerrit.wikimedia.org/r/1136400 (https://phabricator.wikimedia.org/T391714) [15:30:25] skipping portal deployments this week [15:30:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10739548 (10phaultfinder) [15:30:50] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_ulsfo and not P{cp4047.ulsfo.wmnet} and not P{cp4045.ulsfo.wmnet} and A:cp [15:31:52] (03CR) 10CI reject: [V:04-1] logstash: set "index.translog.durability": "async" as template default [puppet] - 10https://gerrit.wikimedia.org/r/1136400 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron) [15:32:58] !log eevans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase1044.eqiad.wmnet with reason: Bootstrapping — T389423 [15:33:02] T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423 [15:34:02] (03PS1) 10Elukey: profile::pyrra: fix latency total count metric for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1136401 [15:34:27] (03Merged) 10jenkins-bot: mysql: make MysqlRemoteHosts iterable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136325 (owner: 10Volans) [15:35:16] (03Merged) 10jenkins-bot: cookbook modules: use docstring for title [software/spicerack] - 10https://gerrit.wikimedia.org/r/1136326 (owner: 10Volans) [15:35:25] (03CR) 10Elukey: dnsdisc: make it compatible with bookworm (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [15:35:37] (03CR) 10Elukey: [C:03+1] dnsdisc: make it compatible with bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [15:36:23] (03CR) 10Elukey: [C:03+2] profile::pyrra: fix latency total count metric for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1136401 (owner: 10Elukey) [15:36:37] (03CR) 10Volans: [C:03+2] dnsdisc: make it compatible with bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [15:36:40] (03CR) 10Herron: [C:03+1] profile::pyrra: fix latency total count metric for Istio SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1136401 (owner: 10Elukey) [15:36:42] (03PS1) 10Hashar: Gemfile: update rspec-puppet to 2.10.x [puppet] - 10https://gerrit.wikimedia.org/r/1136403 [15:36:44] (03PS17) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [15:37:23] (03CR) 10Federico Ceratto: [C:03+1] Add zarcillo (aux k8s) CNAME (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [15:37:29] (03CR) 10Federico Ceratto: [C:03+2] Add zarcillo (aux k8s) CNAME [dns] - 10https://gerrit.wikimedia.org/r/1135438 (https://phabricator.wikimedia.org/T384212) (owner: 10Federico Ceratto) [15:37:41] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [15:37:45] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5294/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:37:48] !log bootstrapping Cassandra/restbase1044-a — T389423 [15:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:39] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:40:17] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:31] !log fceratto@cumin1002 START - Cookbook sre.dns.netbox [15:42:58] (03PS3) 10Dzahn: jenkins: fix puppet error, systemd override requires systemd service [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) [15:44:05] !log fceratto@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:44:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P74969 and previous config saved to /var/cache/conftool/dbconfig/20250414-154419-fceratto.json [15:45:31] (03CR) 10Dzahn: "So you are saying the flag isn't actually transitory and should stay around forever? That's also a valid answer, but there would need to b" [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [15:45:47] (03Merged) 10jenkins-bot: dnsdisc: make it compatible with bookworm [software/spicerack] - 10https://gerrit.wikimedia.org/r/1133808 (https://phabricator.wikimedia.org/T389380) (owner: 10Volans) [15:47:04] (03CR) 10Dzahn: "I did not make the claim that it was easy. I was trying to start a discussion how we can move forward here. The answer can be many things," [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [15:47:49] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:48:29] federico3: hi. did you run authdns-update? thanks! [15:48:32] 06SRE, 10observability: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852 (10elukey) 03NEW [15:48:33] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:48:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:48:51] (03CR) 10Alexandros Kosiaris: [C:03+1] Remove PHP 8.1 migration WikimediaEvents settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135507 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [15:49:01] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:49:41] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:49:41] 06SRE, 10observability: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10739720 (10elukey) Code changes merged so far: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1135746 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1136... [15:49:43] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:49:51] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite) [15:50:00] (03CR) 10Bartosz Dziewoński: [C:03+1] private: Drop $wgCentralAuthSul3SharedDomainRestrictions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136397 (https://phabricator.wikimedia.org/T390329) (owner: 10Gergő Tisza) [15:50:02] 06SRE, 10MediaWiki-Core-HTTP-Cache, 06Traffic-Icebox, 07Wikimedia-Performance-recommendation: Separate Cache-Control header for proxy and client - https://phabricator.wikimedia.org/T50835#10739722 (10Seb35) There is the [[https://datatracker.ietf.org/doc/html/rfc9213|RFC 9213 "Targeted HTTP Cache Control"]... [15:50:03] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:50:11] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:50:21] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:50:26] (03Restored) 10Dzahn: ci: switch jenkins deployment method on contint to scap [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [15:50:29] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:50:29] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:50:34] (03PS2) 10Dzahn: ci: switch jenkins deployment method on contint to scap [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595) [15:50:39] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:50:43] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:51:03] (03Abandoned) 10Dzahn: ci: switch jenkins deployment method on contint to scap [puppet] - 10https://gerrit.wikimedia.org/r/1136039 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [15:51:11] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:51:25] sukhe: no [15:51:29] please do :) [15:51:31] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:51:37] this is what the above alert is about [15:51:49] (03Restored) 10Dzahn: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [15:51:58] (03PS8) 10Dzahn: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [15:52:03] (03Abandoned) 10Dzahn: releases: invert use_scap3_deployment for jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1135796 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [15:52:03] how? I've been told to run sre.dns.netbox but it's showing "Nothing to commit" [15:52:14] federico3: no worries [15:52:31] PROBLEM - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is CRITICAL: Local zone files are NOT in sync with operations/dns.git (SHA: local is 03295d95c8a084d1b2e7aebcf5e46c74ba210dc8, dns.git is adc8233852cb138ce074e783f1f46647fa4376fe) https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:52:32] https://wikitech.wikimedia.org/wiki/DNS#Deploying_DNS_changes [15:52:33] I'll follow the authdns update run as by wiki, ok? [15:52:36] yep [15:53:21] Ah, our doc is bad [15:53:23] Editing [15:53:32] (03CR) 10Vgutierrez: P:durum: add conditional to enable ECH (durum2002) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [15:53:39] FIRING: [5x] ProbeDown: Service restbase1044-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:53:39] sudo -i authdns-update from dns1004.wikimedia.org , sounds good? [15:53:50] yep [15:53:53] !log fceratto@dns1004 START - running authdns-update [15:54:57] https://wikitech.wikimedia.org/wiki/Kubernetes/Ingress#DNS_changes fixed [15:55:18] claime: thanks! [15:55:21] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:55:29] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns3003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:55:29] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:55:39] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:55:42] thanks claime [15:55:43] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:56:11] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:56:23] !log fceratto@dns1004 END - running authdns-update [15:56:31] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:56:44] ok, the tool ran without errors [15:56:50] nice thanks [15:57:02] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1181.eqiad.wmnet with OS bullseye [15:57:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739758 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1181.eqi... [15:57:31] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns1005 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:57:49] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:58:33] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2006 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:58:58] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:59:01] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns6001 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:59:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P74970 and previous config saved to /var/cache/conftool/dbconfig/20250414-155925-fceratto.json [15:59:38] (03CR) 10Cwhite: [C:03+1] "I don't think this is the problem, but this won't hurt." [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron) [15:59:41] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns4004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [15:59:43] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns5003 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:00:03] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns2004 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:00:11] RECOVERY - check if authdns-update was run after a change was merged to operations/dns.git on dns7002 is OK: Local zone files and operations/dns.git are in sync https://wikitech.wikimedia.org/wiki/DNS%23authdns_update_run [16:00:27] 10ops-eqiad, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854 (10elukey) 03NEW [16:01:43] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Migrate the KDCs to Bookworm - https://phabricator.wikimedia.org/T390863#10739807 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [16:03:38] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_ulsfo and not P{cp4037.ulsfo.wmnet} and A:cp [16:03:57] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:04:53] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:05:05] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1181.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [16:06:06] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye [16:06:07] (03PS3) 10Bking: sre.elasticsearch.rolling-operation: use tftp, run puppet with new hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [16:06:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739853 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1181... [16:06:45] (03Abandoned) 10Slyngshede: Netbox: Temporarily remove Netbox alerting [alerts] - 10https://gerrit.wikimedia.org/r/1135673 (owner: 10Slyngshede) [16:10:35] (03PS4) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) [16:11:07] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:11:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:11:52] (03CR) 10CI reject: [V:04-1] ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [16:13:56] (03CR) 10CI reject: [V:04-1] sre.elasticsearch.rolling-operation: use tftp, run puppet with new hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [16:14:26] (03Abandoned) 10Hashar: Gemfile: update to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/976304 (owner: 10Jbond) [16:14:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T391056)', diff saved to https://phabricator.wikimedia.org/P74971 and previous config saved to /var/cache/conftool/dbconfig/20250414-161432-fceratto.json [16:14:36] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:14:42] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Backlog in mailing lists is increasing - https://phabricator.wikimedia.org/T391330#10739874 (10Dzahn) Checking now the mail queue is much smaller than before. (hundreds vs thousands). So missing mail might have been delivered... [16:14:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2156.codfw.wmnet with reason: Maintenance [16:15:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2186.codfw.wmnet with reason: Maintenance [16:15:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T391056)', diff saved to https://phabricator.wikimedia.org/P74972 and previous config saved to /var/cache/conftool/dbconfig/20250414-161512-fceratto.json [16:15:44] (03PS4) 10Bking: sre.elasticsearch.rolling-operation: use tftp, run puppet with new hostname [cookbooks] - 10https://gerrit.wikimedia.org/r/1135826 (https://phabricator.wikimedia.org/T383811) (owner: 10Ryan Kemper) [16:18:18] (03PS5) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) [16:19:36] (03CR) 10CI reject: [V:04-1] ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [16:19:48] seriously ... [16:20:53] (03PS6) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) [16:20:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:21:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:21:29] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1181.eqiad.wmnet with OS bullseye [16:21:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739892 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1181.eqi... [16:22:09] (03CR) 10CI reject: [V:04-1] ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [16:22:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10739895 (10elukey) I can confirm that using `start initialization` and stopping it right afterwards makes `set jbod` working, without a... [16:23:25] (03PS3) 10Herron: logstash: set "index.translog.durability": "async" as template default [puppet] - 10https://gerrit.wikimedia.org/r/1136400 (https://phabricator.wikimedia.org/T391714) [16:24:12] (03PS7) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) [16:26:44] (03CR) 10CI reject: [V:04-1] ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [16:28:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:28:53] (03PS1) 10Ebernhardson: Revert "mjolnir: temp remove msearch daemon from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1136414 [16:29:17] (03CR) 10CI reject: [V:04-1] Revert "mjolnir: temp remove msearch daemon from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1136414 (owner: 10Ebernhardson) [16:29:46] (03PS2) 10Ebernhardson: Revert "mjolnir: temp remove msearch daemon from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1136414 [16:30:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T391056)', diff saved to https://phabricator.wikimedia.org/P74973 and previous config saved to /var/cache/conftool/dbconfig/20250414-163037-fceratto.json [16:30:41] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [16:31:57] PROBLEM - Hadoop NodeManager on an-worker1157 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:32:33] (03PS8) 10Hashar: ci: always add eatmydata to cow images [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) [16:37:00] 06SRE, 10observability: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10739985 (10elukey) @herron something is off in one of the recording rules, see for example https://w.wiki/Doru. Do you have an idea why this is so different? I didn't... [16:38:08] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1181.eqiad.wmnet with OS bullseye [16:38:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10739991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host an-worker1181... [16:39:09] PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:39:57] PROBLEM - Hadoop NodeManager on an-worker1208 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:41:45] PROBLEM - Hadoop NodeManager on an-worker1081 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:42:08] (03CR) 10Hashar: "Done as of patchset 8" [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [16:43:38] (03CR) 10Hashar: [C:03+1] "I have cherry picked it on `integration-puppetserver-01.integration.eqiad1.wikimedia.cloud` and ran Puppet on the two CI instances buildin" [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [16:43:48] (03CR) 10Hashar: [C:03+1] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1135966 (https://phabricator.wikimedia.org/T240430) (owner: 10Hashar) [16:45:43] 10ops-codfw, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker2042:9290 - https://phabricator.wikimedia.org/T391860 (10phaultfinder) 03NEW [16:45:45] RECOVERY - Hadoop NodeManager on an-worker1081 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P74974 and previous config saved to /var/cache/conftool/dbconfig/20250414-164545-fceratto.json [16:45:57] RECOVERY - Hadoop NodeManager on an-worker1208 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:45:57] RECOVERY - Hadoop NodeManager on an-worker1157 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:47:47] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for kcoleman - https://phabricator.wikimedia.org/T391861 (10KColeman-WMF) 03NEW [16:47:55] PROBLEM - Hadoop NodeManager on an-worker1149 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:49:09] RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:52:41] PROBLEM - Hadoop NodeManager on an-worker1138 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [16:56:25] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-text_magru [16:56:53] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_magru [16:59:49] sirenbot: sing [16:59:56] _joe_: :( [17:00:05] swfrench-wmf: #bothumor My software never has bugs. It just develops random features. Rise for MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1700). [17:00:05] ryankemper: Your horoscope predicts another Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1700). [17:00:15] !sing [17:00:15] Never gonna give you up [17:00:16] Never gonna let you down [17:00:16] Never gonna run around and desert you [17:00:17] Never gonna make you cry [17:00:18] Never gonna say goodbye [17:00:19] Never gonna tell a lie and hurt you [17:00:25] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:00:25] *chef's kiss*\ [17:00:40] Amir1: BTW, Dexbot seems to not be active on wikitech any more? [17:00:45] o/ [17:00:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P74975 and previous config saved to /var/cache/conftool/dbconfig/20250414-170052-fceratto.json [17:00:57] https://phabricator.wikimedia.org/T391346 [17:01:06] James_F: I think it's something with SUL3 roll out [17:01:07] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:01:16] Amir1: Aha, yes, that'd break things. [17:01:41] RECOVERY - Hadoop NodeManager on an-worker1138 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:01:57] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:02:11] FYI, I'll be starting a backport deployment for some PHP 8.1 migration cleanuiup shortly. [17:02:12] (03PS18) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [17:02:15] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.184 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:02:26] *cleanup [17:03:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by swfrench@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135507 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [17:03:11] PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:03:11] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5296/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:03:55] RECOVERY - Hadoop NodeManager on an-worker1149 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:03:56] (03Merged) 10jenkins-bot: Remove PHP 8.1 migration WikimediaEvents settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1135507 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [17:04:05] (03CR) 10Ssingh: [V:03+1] P:durum: add conditional to enable ECH (durum2002) (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:04:13] !log swfrench@deploy1003 Started scap sync-world: Backport for [[gerrit:1135507|Remove PHP 8.1 migration WikimediaEvents settings (T391421)]] [17:04:16] T391421: Clean up and abstract PHP_ENGINE 8.1 routing - https://phabricator.wikimedia.org/T391421 [17:04:35] (03PS1) 10Ebernhardson: search: Update envoy alerts for discovery dns names [alerts] - 10https://gerrit.wikimedia.org/r/1136422 (https://phabricator.wikimedia.org/T143553) [17:04:45] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740204 (10phaultfinder) [17:05:57] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1136359 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [17:06:04] (03PS2) 10Ebernhardson: search: Update envoy alerts for discovery dns names [alerts] - 10https://gerrit.wikimedia.org/r/1136422 (https://phabricator.wikimedia.org/T143553) [17:06:11] RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:06:13] (03CR) 10Ssingh: [V:03+1 C:04-1] "2025/04/14 17:05:51 [emerg] 2928385#2928385: "http" directive is not allowed here in /etc/nginx/sites-enabled/durum:10" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:08:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:08:53] !log swfrench@deploy1003 swfrench: Backport for [[gerrit:1135507|Remove PHP 8.1 migration WikimediaEvents settings (T391421)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:10:00] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1181.eqiad.wmnet with OS bullseye [17:10:12] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: enable knn native lib [puppet] - 10https://gerrit.wikimedia.org/r/1135441 (https://phabricator.wikimedia.org/T388549) (owner: 10DCausse) [17:10:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02), 13Patch-For-Review: Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10740234 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host an-worker1181.eqi... [17:10:38] !log swfrench@deploy1003 swfrench: Continuing with sync [17:12:41] FIRING: [7x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:13:21] PROBLEM - Hadoop NodeManager on an-worker1110 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:13:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:14:33] PROBLEM - Hadoop NodeManager on an-worker1148 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:14:50] (03Abandoned) 10Ebernhardson: tlsproxy: Reload nginx when on-disk and served cert don't match [puppet] - 10https://gerrit.wikimedia.org/r/1133187 (https://phabricator.wikimedia.org/T390599) (owner: 10Ebernhardson) [17:15:33] RECOVERY - Hadoop NodeManager on an-worker1148 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:15:33] PROBLEM - Hadoop NodeManager on an-worker1159 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:15:58] (03CR) 10Dzahn: [C:03+2] phabricator weekly changes email: Lower "in progress" threshold to 1y [puppet] - 10https://gerrit.wikimedia.org/r/1136028 (https://phabricator.wikimedia.org/T380300) (owner: 10Aklapper) [17:15:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T391056)', diff saved to https://phabricator.wikimedia.org/P74976 and previous config saved to /var/cache/conftool/dbconfig/20250414-171558-fceratto.json [17:16:02] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:16:04] (03CR) 10Bking: "Plugins have been updated across CODFW, so we are clear to revert." [puppet] - 10https://gerrit.wikimedia.org/r/1136414 (owner: 10Ebernhardson) [17:16:06] (03CR) 10Bking: [C:03+2] Revert "mjolnir: temp remove msearch daemon from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/1136414 (owner: 10Ebernhardson) [17:16:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2177.codfw.wmnet with reason: Maintenance [17:16:21] RECOVERY - Hadoop NodeManager on an-worker1110 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:16:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74977 and previous config saved to /var/cache/conftool/dbconfig/20250414-171622-fceratto.json [17:17:23] !log swfrench@deploy1003 Finished scap sync-world: Backport for [[gerrit:1135507|Remove PHP 8.1 migration WikimediaEvents settings (T391421)]] (duration: 13m 10s) [17:17:27] T391421: Clean up and abstract PHP_ENGINE 8.1 routing - https://phabricator.wikimedia.org/T391421 [17:17:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:18:33] FYI, I have a couple of other cleanups to fit in during this window, but I'm done with deployments [17:18:43] (03CR) 10Bking: [C:03+2] cirrussearch: enable knn native lib [puppet] - 10https://gerrit.wikimedia.org/r/1135441 (https://phabricator.wikimedia.org/T388549) (owner: 10DCausse) [17:20:46] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740369 (10phaultfinder) [17:20:53] !log running: cumin 'A:cp-text' 'disable-puppet "merging ATS config change - T391421"' [17:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:48] (03CR) 10Scott French: [C:03+2] hieradata: remove mw-php-migration.lua from plugin chains [puppet] - 10https://gerrit.wikimedia.org/r/1135504 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [17:22:27] 06SRE, 10observability: Create a Pyrra template for Istio-based K8s services and apply it to Citoid - https://phabricator.wikimedia.org/T391852#10740391 (10herron) First thing I notice is the first panel (using recording rule) applies rate(sum()) and the second panel sum(rate()) Seems like a similar issue to... [17:22:33] RECOVERY - Hadoop NodeManager on an-worker1159 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:23:07] PROBLEM - Hadoop NodeManager on an-worker1194 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:25:36] !log running: run-puppet-agent -e "merging ATS config change - T391421" on cp4040 [17:25:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10740420 (10phaultfinder) [17:25:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:39] T391421: Clean up and abstract PHP_ENGINE 8.1 routing - https://phabricator.wikimedia.org/T391421 [17:25:48] !log hashar@deploy1003 Started deploy [integration/docroot@e92740c]: opensource: remove OOjs Router - T358813 [17:25:51] T358813: Document mediawiki-router, move oojs-router into core - https://phabricator.wikimedia.org/T358813 [17:25:59] !log hashar@deploy1003 Finished deploy [integration/docroot@e92740c]: opensource: remove OOjs Router - T358813 (duration: 00m 10s) [17:30:47] !log running: cumin -b8 -s60 'A:cp-text' 'run-puppet-agent -e "merging ATS config change - T391421"' [17:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:50] T391421: Clean up and abstract PHP_ENGINE 8.1 routing - https://phabricator.wikimedia.org/T391421 [17:32:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74978 and previous config saved to /var/cache/conftool/dbconfig/20250414-173218-fceratto.json [17:32:22] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:32:36] (03PS1) 10Scott French: P:trafficserver::backend: absent mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135505 (https://phabricator.wikimedia.org/T391421) [17:32:38] (03PS1) 10Scott French: P:trafficserver::backend: remove mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135506 (https://phabricator.wikimedia.org/T391421) [17:33:39] RESOLVED: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:37:56] (03CR) 10Ssingh: [C:03+1] P:trafficserver::backend: absent mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135505 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [17:38:02] (03CR) 10Ssingh: [C:03+1] P:trafficserver::backend: remove mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135506 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [17:38:13] (03CR) 10Herron: [C:03+1] "🌅" [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite) [17:47:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P74979 and previous config saved to /var/cache/conftool/dbconfig/20250414-174725-fceratto.json [17:49:10] (03PS1) 10Ebernhardson: search: Remove CirrusSearchJVMGCYoungPoolInsufficient alert [alerts] - 10https://gerrit.wikimedia.org/r/1136426 [17:49:23] (03CR) 10Herron: [C:03+2] "Thanks for the reviews! Good idea, cc-ing releng for awareness" [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron) [17:50:08] RECOVERY - Hadoop NodeManager on an-worker1194 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:52:27] (03CR) 10Ahmon Dancy: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1136394 (https://phabricator.wikimedia.org/T391714) (owner: 10Herron) [17:52:33] (03PS19) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [17:53:34] (03CR) 10Scott French: [C:03+2] P:trafficserver::backend: absent mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135505 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [17:53:41] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5297/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:54:55] (03CR) 10Ssingh: "Changes since last time:" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:00:04] James_F: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikifunctions MediaWiki integration backport II deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T1800). [18:00:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136368 (https://phabricator.wikimedia.org/T386020) (owner: 10Jforrester) [18:00:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136379 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [18:00:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136380 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [18:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [18:01:11] (03Merged) 10jenkins-bot: Switch test Wikifunctions client deployment from test2wiki to test2iki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136379 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [18:01:15] (03Merged) 10jenkins-bot: Document Wikifunctions options, adding wgWikiLambdaClientModeOffline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136380 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [18:01:39] (03CR) 10Ssingh: "I think this is ready for review. Thanks a lot for the feedback and rubber ducking, @vgutierrez@wikimedia.org!" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:02:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P74980 and previous config saved to /var/cache/conftool/dbconfig/20250414-180232-fceratto.json [18:04:05] (03CR) 10Ssingh: "Dropping ssl_dhparam too. Not really required for TLS1.3." [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:04:47] (03Merged) 10jenkins-bot: Complete our RecentChanges entry generation and formatting [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136368 (https://phabricator.wikimedia.org/T386020) (owner: 10Jforrester) [18:05:04] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136368|Complete our RecentChanges entry generation and formatting (T386020)]], [[gerrit:1136379|Switch test Wikifunctions client deployment from test2wiki to test2iki (T391584)]], [[gerrit:1136380|Document Wikifunctions options, adding wgWikiLambdaClientModeOffline (T391584)]] [18:05:11] T386020: Implement design for change propagation when WF function calls change - https://phabricator.wikimedia.org/T386020 [18:05:11] T391584: Complete the successful deployment of embedded Wikifunctions to testwiki - https://phabricator.wikimedia.org/T391584 [18:05:15] (03PS20) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [18:06:21] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5298/co" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:15:07] (03CR) 10Vgutierrez: P:durum: add conditional to enable ECH (durum2002) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:15:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10740691 (10phaultfinder) [18:16:27] (03CR) 10Ssingh: [V:03+1] P:durum: add conditional to enable ECH (durum2002) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:17:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T391056)', diff saved to https://phabricator.wikimedia.org/P74981 and previous config saved to /var/cache/conftool/dbconfig/20250414-181740-fceratto.json [18:17:44] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:17:56] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2190.codfw.wmnet with reason: Maintenance [18:18:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T391056)', diff saved to https://phabricator.wikimedia.org/P74982 and previous config saved to /var/cache/conftool/dbconfig/20250414-181802-fceratto.json [18:18:10] PROBLEM - Hadoop NodeManager on an-worker1085 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:19:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10740730 (10Jclark-ctr) a:03VRiley-WMF [18:19:24] (03CR) 10Ssingh: [V:03+1] P:durum: add conditional to enable ECH (durum2002) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:19:54] (03PS21) 10Ssingh: P:durum: add conditional to enable ECH (durum2002) [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [18:19:57] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10740731 (10Jclark-ctr) a:03VRiley-WMF [18:20:15] (03CR) 10Ssingh: "If we are removing CSP, I removed cache-control here as well." [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:20:45] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10740733 (10Jclark-ctr) a:03VRiley-WMF [18:20:54] 10ops-codfw, 06SRE, 06DC-Ops: Power Supply - PS Redundancy - issue on wikikube-worker2042:9290 - https://phabricator.wikimedia.org/T391860#10740735 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm another instances of a third party loosening power cables in our rack. reseated. [18:23:39] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:24:37] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1136368|Complete our RecentChanges entry generation and formatting (T386020)]], [[gerrit:1136379|Switch test Wikifunctions client deployment from test2wiki to test2iki (T391584)]], [[gerrit:1136380|Document Wikifunctions options, adding wgWikiLambdaClientModeOffline (T391584)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [18:24:41] T386020: Implement design for change propagation when WF function calls change - https://phabricator.wikimedia.org/T386020 [18:24:42] T391584: Complete the successful deployment of embedded Wikifunctions to testwiki - https://phabricator.wikimedia.org/T391584 [18:25:48] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740771 (10phaultfinder) [18:27:01] !log Run `mwscript sql --wiki=testwiki /srv/mediawiki-staging/php-1.44.0-wmf.24/extensions/WikiLambda/sql/mysql/table-usage.sql` for T391885 [18:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:05] T391885: Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'testwiki.wikifunctionsclient_usage' doesn't existFunction: MediaWiki\Extension\WikiLambda\WikifunctionsClientStore::deleteWikifunctionsUsageQuery: DELETE FROM `wikifunctionscli - https://phabricator.wikimedia.org/T391885 [18:27:42] !log jforrester@deploy1003 jforrester: Continuing with sync [18:34:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T391056)', diff saved to https://phabricator.wikimedia.org/P74983 and previous config saved to /var/cache/conftool/dbconfig/20250414-183411-fceratto.json [18:34:15] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:35:10] RECOVERY - Hadoop NodeManager on an-worker1085 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [18:36:04] (03CR) 10Scott French: [C:03+2] P:trafficserver::backend: remove mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135506 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [18:36:33] (03PS2) 10Scott French: P:trafficserver::backend: remove mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135506 (https://phabricator.wikimedia.org/T391421) [18:37:29] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136368|Complete our RecentChanges entry generation and formatting (T386020)]], [[gerrit:1136379|Switch test Wikifunctions client deployment from test2wiki to test2iki (T391584)]], [[gerrit:1136380|Document Wikifunctions options, adding wgWikiLambdaClientModeOffline (T391584)]] (duration: 32m 25s) [18:37:33] T386020: Implement design for change propagation when WF function calls change - https://phabricator.wikimedia.org/T386020 [18:37:34] T391584: Complete the successful deployment of embedded Wikifunctions to testwiki - https://phabricator.wikimedia.org/T391584 [18:39:38] (03CR) 10Scott French: [C:03+2] P:trafficserver::backend: remove mw-php-migration [puppet] - 10https://gerrit.wikimedia.org/r/1135506 (https://phabricator.wikimedia.org/T391421) (owner: 10Scott French) [18:46:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.268s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:49:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P74984 and previous config saved to /var/cache/conftool/dbconfig/20250414-184918-fceratto.json [18:51:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.268s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:54:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740852 (10phaultfinder) [18:55:55] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row D - bking@cumin2002 - T388610 [18:55:58] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [19:00:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10740880 (10Jclark-ctr) @elukey would you like to shut it down or can we shutdown on our own? [19:00:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 06Infrastructure-Foundations: Swap RAID controller on ms-be1091.eqiad.wmnet - https://phabricator.wikimedia.org/T391854#10740885 (10Jclark-ctr) a:03Jclark-ctr [19:02:29] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2109 to cirrussearch2109 [19:02:51] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:03:24] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391654#10740894 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:04:21] (03CR) 10Kamila Součková: [C:03+1] php-fpm-multiversion-base: Stop copying mwcron scripts [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1135922 (https://phabricator.wikimedia.org/T391665) (owner: 10Clément Goubert) [19:04:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P74985 and previous config saved to /var/cache/conftool/dbconfig/20250414-190426-fceratto.json [19:04:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740896 (10phaultfinder) [19:05:43] (03CR) 10Bking: [C:03+2] "This should really help reduce alert noise, thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/1136426 (owner: 10Ebernhardson) [19:06:54] 06SRE-OnFire, 10Incident Tooling: Incident documents are less visible with Corto - https://phabricator.wikimedia.org/T390126#10740901 (10Eevans) >>! In T390126#10719499, @jhathaway wrote: > reached out to ITS in a follow-up task: https://wikimediainternal.zendesk.com/hc/en-us/requests/111894 Just following up... [19:07:24] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2109 to cirrussearch2109 - bking@cumin2002" [19:07:44] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2109 to cirrussearch2109 - bking@cumin2002" [19:07:44] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:07:45] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2109 [19:07:55] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2109 [19:08:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2109 to cirrussearch2109 [19:10:36] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2109.codfw.wmnet with OS bullseye [19:10:47] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2109 [19:12:59] (03CR) 10Dwisehaupt: "@jhathaway@wikimedia.org I think we are ready to roll this out when possible (maybe tomorrow 4/15). I'm not 100% certain that the prod mx-" [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [19:13:01] !log bking@cumin2002 START - Cookbook sre.dns.netbox [19:13:46] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T391644#10740934 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Rebalanced pdu [19:14:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10740947 (10phaultfinder) [19:17:09] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2109 - bking@cumin2002" [19:17:14] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2109 - bking@cumin2002" [19:17:14] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:17:15] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2109.codfw.wmnet 160.48.192.10.in-addr.arpa 0.6.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:17:18] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2109.codfw.wmnet 160.48.192.10.in-addr.arpa 0.6.1.0.8.4.0.0.2.9.1.0.0.1.0.0.4.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [19:17:19] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2109 [19:17:41] (03PS1) 10Eevans: Upgade data-gateway to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136436 (https://phabricator.wikimedia.org/T370470) [19:18:17] mforns: Ok, first step: upgrading data-gateway to v1.0.12 (matching what is already in staging ) ^^^ [19:19:13] (as soon as helm-lint has had its say ofc...) [19:19:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T391056)', diff saved to https://phabricator.wikimedia.org/P74986 and previous config saved to /var/cache/conftool/dbconfig/20250414-191933-fceratto.json [19:19:37] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:19:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2194.codfw.wmnet with reason: Maintenance [19:19:55] (03CR) 10Jgreen: [C:03+1] Enable mail and dovecot services for community_civicrm [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [19:19:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T391056)', diff saved to https://phabricator.wikimedia.org/P74987 and previous config saved to /var/cache/conftool/dbconfig/20250414-191957-fceratto.json [19:20:17] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2109 [19:20:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2109 [19:20:22] (03CR) 10Eevans: [C:03+2] Upgade data-gateway to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136436 (https://phabricator.wikimedia.org/T370470) (owner: 10Eevans) [19:21:42] 10ops-eqiad, 06SRE, 06DC-Ops: Fix "changeme" cable labels - https://phabricator.wikimedia.org/T390818#10740976 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:21:49] (03Merged) 10jenkins-bot: Upgade data-gateway to v1.0.12 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1136436 (https://phabricator.wikimedia.org/T370470) (owner: 10Eevans) [19:23:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [19:23:43] !log eevans@deploy1003 helmfile [codfw] START helmfile.d/services/data-gateway: apply [19:24:02] !log eevans@deploy1003 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [19:24:27] !log eevans@deploy1003 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [19:24:45] !log eevans@deploy1003 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [19:26:01] mforns: ok, the data-gateway service is at v1.0.12, so I'm going to drop those 8 tables [19:28:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [19:31:32] !log dropped & recreated 8 commons impact metrics tables — https://phabricator.wikimedia.org/T370470#10687053 [19:31:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:42] mforns: you are good to start reloading [19:34:56] !log mforns@deploy1003 helmfile [eqiad] START helmfile.d/services/commons-impact-analytics: apply [19:35:11] !log mforns@deploy1003 helmfile [eqiad] DONE helmfile.d/services/commons-impact-analytics: apply [19:35:19] !log mforns@deploy1003 helmfile [codfw] START helmfile.d/services/commons-impact-analytics: apply [19:35:33] !log mforns@deploy1003 helmfile [codfw] DONE helmfile.d/services/commons-impact-analytics: apply [19:36:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T391056)', diff saved to https://phabricator.wikimedia.org/P74988 and previous config saved to /var/cache/conftool/dbconfig/20250414-193610-fceratto.json [19:36:15] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:36:45] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cirrussearch2109.codfw.wmnet with reason: host reimage [19:40:18] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cirrussearch2109.codfw.wmnet with reason: host reimage [19:43:48] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 2 others: Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded - https://phabricator.wikimedia.org/T381109#10741064 (10Umherirrender) a:03Umherirrender [19:47:45] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management, 10MediaWiki-Uploading, and 2 others: Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded (via Special:UploadStash) - https://phabricator.wikimedia.org/T381109#10741073 (10Umherirrender) [19:47:47] !log mforns@deploy1003 Started deploy [analytics/refinery@6fe5a7e] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6fe5a7e3] [19:48:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:50:31] !log mforns@deploy1003 Finished deploy [analytics/refinery@6fe5a7e] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6fe5a7e3] (duration: 02m 44s) [19:51:18] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P74989 and previous config saved to /var/cache/conftool/dbconfig/20250414-195117-fceratto.json [19:53:35] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed again - https://phabricator.wikimedia.org/T391372#10741088 (10VRiley-WMF) After working with Dell a bit more on this, I pushed back on their request regarding the iDRAC. They initially wanted to check if the newer firmware would collect more in-depth logs... [19:53:39] FIRING: [5x] ProbeDown: Service restbase1044-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:56:31] Nothing in the deploy window, so I may steal it. [19:56:36] (03CR) 10JHathaway: [C:03+1] "looks good, let me know if you need help in the rollout" [puppet] - 10https://gerrit.wikimedia.org/r/1135513 (https://phabricator.wikimedia.org/T383715) (owner: 10Dwisehaupt) [19:57:12] !log mforns@deploy1003 Started deploy [analytics/refinery@6fe5a7e]: Regular analytics weekly train [analytics/refinery@6fe5a7e3] [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741111 (10phaultfinder) [20:00:43] !log mforns@deploy1003 Finished deploy [analytics/refinery@6fe5a7e]: Regular analytics weekly train [analytics/refinery@6fe5a7e3] (duration: 03m 31s) [20:00:45] (03PS1) 10Jforrester: FunctionCalls: Use base64url encoding rather than raw base64 [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136447 (https://phabricator.wikimedia.org/T391584) [20:00:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136447 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [20:01:22] (03PS1) 10Jforrester: FunctionCalls: Don't error if Wikifunctions.org isn't in client mode yet [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136448 (https://phabricator.wikimedia.org/T391584) [20:01:27] !log mforns@deploy1003 Started deploy [analytics/refinery@6fe5a7e] (thin): Regular analytics weekly train THIN [analytics/refinery@6fe5a7e3] [20:01:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136448 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [20:01:55] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REIMAGE (3 nodes at a time) for ElasticSearch cluster search_codfw: reimage row D - bking@cumin2002 - T388610 [20:01:59] T388610: Migrate production Elastic clusters to Opensearch - https://phabricator.wikimedia.org/T388610 [20:02:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cirrussearch2109.codfw.wmnet with OS bullseye [20:02:36] !log mforns@deploy1003 Finished deploy [analytics/refinery@6fe5a7e] (thin): Regular analytics weekly train THIN [analytics/refinery@6fe5a7e3] (duration: 01m 09s) [20:03:27] (03PS1) 10Jforrester: FunctionCalls: Throw an explicable error if json_encode returns null [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136449 (https://phabricator.wikimedia.org/T391584) [20:03:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136447 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [20:03:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136448 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [20:03:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136449 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [20:03:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 14 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136449 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [20:06:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P74990 and previous config saved to /var/cache/conftool/dbconfig/20250414-200624-fceratto.json [20:08:49] (03Merged) 10jenkins-bot: FunctionCalls: Use base64url encoding rather than raw base64 [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136447 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [20:08:52] (03Merged) 10jenkins-bot: FunctionCalls: Don't error if Wikifunctions.org isn't in client mode yet [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136448 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [20:08:54] (03Merged) 10jenkins-bot: FunctionCalls: Throw an explicable error if json_encode returns null [extensions/WikiLambda] (wmf/1.44.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1136449 (https://phabricator.wikimedia.org/T391584) (owner: 10Jforrester) [20:09:13] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1136447|FunctionCalls: Use base64url encoding rather than raw base64 (T391584)]], [[gerrit:1136448|FunctionCalls: Don't error if Wikifunctions.org isn't in client mode yet (T391584)]], [[gerrit:1136449|FunctionCalls: Throw an explicable error if json_encode returns null (T391584)]] [20:09:16] T391584: Complete the successful deployment of embedded Wikifunctions to testwiki - https://phabricator.wikimedia.org/T391584 [20:14:03] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1136447|FunctionCalls: Use base64url encoding rather than raw base64 (T391584)]], [[gerrit:1136448|FunctionCalls: Don't error if Wikifunctions.org isn't in client mode yet (T391584)]], [[gerrit:1136449|FunctionCalls: Throw an explicable error if json_encode returns null (T391584)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:17:02] !log jforrester@deploy1003 jforrester: Continuing with sync [20:21:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T391056)', diff saved to https://phabricator.wikimedia.org/P74991 and previous config saved to /var/cache/conftool/dbconfig/20250414-202131-fceratto.json [20:21:38] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:21:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2205.codfw.wmnet with reason: Maintenance [20:21:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2205 (T391056)', diff saved to https://phabricator.wikimedia.org/P74992 and previous config saved to /var/cache/conftool/dbconfig/20250414-202152-fceratto.json [20:23:33] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1136447|FunctionCalls: Use base64url encoding rather than raw base64 (T391584)]], [[gerrit:1136448|FunctionCalls: Don't error if Wikifunctions.org isn't in client mode yet (T391584)]], [[gerrit:1136449|FunctionCalls: Throw an explicable error if json_encode returns null (T391584)]] (duration: 14m 20s) [20:23:37] T391584: Complete the successful deployment of embedded Wikifunctions to testwiki - https://phabricator.wikimedia.org/T391584 [20:38:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T391056)', diff saved to https://phabricator.wikimedia.org/P74993 and previous config saved to /var/cache/conftool/dbconfig/20250414-203800-fceratto.json [20:38:04] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:39:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741316 (10phaultfinder) [20:53:07] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P74994 and previous config saved to /var/cache/conftool/dbconfig/20250414-205307-fceratto.json [20:56:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3636 MB (3% inode=98%): /tmp 3636 MB (3% inode=98%): /var/tmp 3636 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [21:00:04] Reedy, sbassett, Maryum, and manfredi: It is that lovely time of the day again! You are hereby commanded to deploy Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T2100). [21:03:20] 06SRE-OnFire, 10Incident Tooling: corto: track responders - https://phabricator.wikimedia.org/T391897 (10Eevans) 03NEW [21:05:55] 06SRE-OnFire, 10Incident Tooling: Incident documents are less visible with Corto - https://phabricator.wikimedia.org/T390126#10741431 (10jhathaway) not yet, but I asked for an update. [21:08:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P74995 and previous config saved to /var/cache/conftool/dbconfig/20250414-210814-fceratto.json [21:13:39] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:15:34] (03CR) 10JHathaway: [C:03+2] keyholder: restart proxy after arming a key [puppet] - 10https://gerrit.wikimedia.org/r/1136022 (https://phabricator.wikimedia.org/T374711) (owner: 10JHathaway) [21:16:00] 07Puppet, 06SRE, 06Infrastructure-Foundations, 10Keyholder, 13Patch-For-Review: keyholder-proxy doesn't restart on config change - https://phabricator.wikimedia.org/T374711#10741473 (10jhathaway) 05Open→03Resolved a:03jhathaway [21:17:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:23:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T391056)', diff saved to https://phabricator.wikimedia.org/P74996 and previous config saved to /var/cache/conftool/dbconfig/20250414-212320-fceratto.json [21:23:25] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:23:37] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2227.codfw.wmnet with reason: Maintenance [21:23:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2227 (T391056)', diff saved to https://phabricator.wikimedia.org/P74997 and previous config saved to /var/cache/conftool/dbconfig/20250414-212344-fceratto.json [21:24:01] (03CR) 10JHathaway: [C:03+2] puppetmaster tests: remove resolving www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1134289 (owner: 10JHathaway) [21:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741549 (10phaultfinder) [21:39:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T391056)', diff saved to https://phabricator.wikimedia.org/P74998 and previous config saved to /var/cache/conftool/dbconfig/20250414-213957-fceratto.json [21:40:01] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [21:45:04] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:45:46] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:45:46] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:50:18] PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:50:24] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:50:38] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 564 bytes in 0.602 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:50:42] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 555 bytes in 6.172 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:50:54] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 05 May 2025 06:42:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:51:08] RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:51:14] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 05 May 2025 06:42:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:51:39] ^ looking, I can't access Grafana. [21:53:46] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:53:46] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:54:04] PROBLEM - grafana-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:55:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P74999 and previous config saved to /var/cache/conftool/dbconfig/20250414-215504-fceratto.json [21:55:18] PROBLEM - SSH on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:55:21] (03PS1) 10Bking: cirrussearch: Add row C hosts [puppet] - 10https://gerrit.wikimedia.org/r/1136464 (https://phabricator.wikimedia.org/T388610) [21:55:24] PROBLEM - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:55:45] (03CR) 10CI reject: [V:04-1] cirrussearch: Add row C hosts [puppet] - 10https://gerrit.wikimedia.org/r/1136464 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:57:13] (03PS2) 10Bking: cirrussearch: Add row C hosts [puppet] - 10https://gerrit.wikimedia.org/r/1136464 (https://phabricator.wikimedia.org/T388610) [21:58:30] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add row C hosts [puppet] - 10https://gerrit.wikimedia.org/r/1136464 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:58:47] (03CR) 10Bking: [C:03+2] cirrussearch: Add row C hosts [puppet] - 10https://gerrit.wikimedia.org/r/1136464 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [22:00:54] FIRING: [2x] SLOMetricAbsent: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:01:38] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add row D non-master hosts to elasticsearch pools [puppet] - 10https://gerrit.wikimedia.org/r/1135976 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [22:01:48] (03CR) 10Bking: [C:03+2] cirrussearch: Add row D non-master hosts to elasticsearch pools [puppet] - 10https://gerrit.wikimedia.org/r/1135976 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [22:04:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741650 (10phaultfinder) [22:05:02] RECOVERY - grafana-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 05 May 2025 06:42:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:05:08] RECOVERY - SSH on grafana1002 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u5 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [22:05:14] RECOVERY - grafana-next-rw.wikimedia.org tls expiry on grafana1002 is OK: OK - Certificate grafana.discovery.wmnet will expire on Mon 05 May 2025 06:42:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:05:36] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:05:36] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [22:06:41] !log ryankemper@cumin2002 conftool action : set/pooled=yes:weight=10; selector: name=cirrussearch2060.codfw.wmnet|cirrussearch2067.codfw.wmnet|cirrussearch2068.codfw.wmnet|cirrussearch2072.codfw.wmnet|cirrussearch2085.codfw.wmnet|cirrussearch2104.codfw.wmnet|cirrussearch2105.codfw.wmnet|cirrussearch2107.codfw.wmnet|cirrussearch2109.codfw.wmnet|cirrussearch2114.codfw.wmnet|cirrussearch2115.codfw.wmnet [22:07:41] FIRING: [8x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_search-https.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [22:10:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P75000 and previous config saved to /var/cache/conftool/dbconfig/20250414-221012-fceratto.json [22:13:18] Hey all - currently deploying one security patch for today’s window: T391343 [22:16:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3510 MB (3% inode=98%): /tmp 3510 MB (3% inode=98%): /var/tmp 3510 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [22:19:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741660 (10phaultfinder) [22:20:32] !log Deployment of security patch for T391343 halted [22:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:39] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:25:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T391056)', diff saved to https://phabricator.wikimedia.org/P75001 and previous config saved to /var/cache/conftool/dbconfig/20250414-222519-fceratto.json [22:25:24] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:25:25] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2239.codfw.wmnet with reason: Maintenance [22:27:16] (03CR) 10Dzahn: "Ah, transitory in _that_ way. I see now, ok. thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [22:29:47] FIRING: [5x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:30:09] (03CR) 10Dzahn: [C:04-1] "Ideally, let's avoid a pattern where setting up a new machine requires coordination between teams (and using both puppet and scap)." [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [22:30:43] !log Deployed previous good versions of affected files for T391343 [22:30:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:33:50] !log dzahn@deploy1003 Installing scap version "4.153.0" for 1 host(s) [22:34:47] RESOLVED: [5x] HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:34:49] !log dzahn@deploy1003 Installation of scap version "4.153.0" completed for 1 hosts [22:34:54] !log deploy1003 - scap install-world -l release2003.codfw.wmnet T391590 [22:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:57] T391590: PuppetFailure - releases2003 - https://phabricator.wikimedia.org/T391590 [22:35:34] PROBLEM - MD RAID on aqs1015 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:35:35] ACKNOWLEDGEMENT - MD RAID on aqs1015 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T391903 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [22:35:45] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1015 - https://phabricator.wikimedia.org/T391903 (10ops-monitoring-bot) 03NEW [22:37:18] (03PS1) 10Ladsgroup: wikimedia.org: Add gitlab.wm.o HIBP TXT record, remove lists [dns] - 10https://gerrit.wikimedia.org/r/1136465 [22:37:52] (03CR) 10CI reject: [V:04-1] wikimedia.org: Add gitlab.wm.o HIBP TXT record, remove lists [dns] - 10https://gerrit.wikimedia.org/r/1136465 (owner: 10Ladsgroup) [22:39:08] (03CR) 10Ladsgroup: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1136465 (owner: 10Ladsgroup) [22:39:49] (03PS2) 10Ladsgroup: wikimedia.org: Add gitlab.wm.o HIBP TXT record, remove lists [dns] - 10https://gerrit.wikimedia.org/r/1136465 [22:41:55] (03CR) 10Ladsgroup: [C:03+2] wikimedia.org: Add gitlab.wm.o HIBP TXT record, remove lists [dns] - 10https://gerrit.wikimedia.org/r/1136465 (owner: 10Ladsgroup) [22:42:16] !log ladsgroup@dns1004 START - running authdns-update [22:44:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10741683 (10phaultfinder) [22:44:45] !log ladsgroup@dns1004 END - running authdns-update [22:46:13] (03CR) 10Dzahn: [C:04-1] "What scap command would you actually run?" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [22:53:46] (03PS1) 10MusikAnimal: testwiki: enable wgUseCodexSpecialBlock and wgEnableMultiBlocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136466 (https://phabricator.wikimedia.org/T377121) [22:54:53] (03CR) 10Tim Starling: [C:03+1] testwiki: enable wgUseCodexSpecialBlock and wgEnableMultiBlocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136466 (https://phabricator.wikimedia.org/T377121) (owner: 10MusikAnimal) [22:56:16] PROBLEM - Disk space on archiva1002 is CRITICAL: DISK CRITICAL - free space: / 3448 MB (3% inode=98%): /tmp 3448 MB (3% inode=98%): /var/tmp 3448 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=archiva1002&var-datasource=eqiad+prometheus/ops [22:58:43] FIRING: [5x] ProbeDown: Service restbase1044-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250414T2300) [23:00:46] (03CR) 10Dzahn: [C:04-1] "deploy1003:~] $ scap deploy -v -l releases2003.codfw.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1135994 (https://phabricator.wikimedia.org/T384595) (owner: 10Dzahn) [23:03:29] 06SRE: archiva1002 - disk 98% full - https://phabricator.wikimedia.org/T391904 (10Dzahn) 03NEW [23:12:02] !log zabe@mwmaint1002:~$ cat group2.dblist | xargs -I{} bash -c "echo {}; mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php {} --delete /home/zabe/afl_text_table_deletedump/{} --sleep 0.3" # T381599 [23:12:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:12:05] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [23:22:37] !log bootstrapping Cassandra/restbase1044-b — T389423 [23:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:41] T389423: Refresh restbase10[28-30] w/ restbase104[3-5] - https://phabricator.wikimedia.org/T389423 [23:23:39] FIRING: [4x] ProbeDown: Service restbase1044-b:7000 has failed probes (tcp_cassandra_b_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:28:39] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [23:40:00] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136472 [23:40:00] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136472 (owner: 10TrainBranchBot) [23:40:33] (03CR) 10Creynolds: [C:03+1] dumps enterprise copy update [puppet] - 10https://gerrit.wikimedia.org/r/1135522 (owner: 10Creynolds) [23:48:39] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:52:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1136472 (owner: 10TrainBranchBot) [23:56:00] (03PS2) 10Scott French: hieradata: switch parsoidtest1001 to PHP 8.1 [puppet] - 10https://gerrit.wikimedia.org/r/1136413 (https://phabricator.wikimedia.org/T380485)